4
123abc
5y

hey guys, im a total noob and new to this field. this post is open for criticism and suggestions.

so i wanna forecast slippage (dependent) using the % WT Plan (planned work, the predictor).

i get -0.35 for correlation.

and im using knn regression

how does this look to you? insights and comments are appreciated. thanks!

Comments
  • 2
  • 1
  • 1
  • 1
    Apart from that, what in the unholy name of C# is that?
    Looks cool and all but wha-
  • 2
    Your accuracy for the 2nd model is 84%.
    If those are the real-life data you're using, then the result is pretty good in my opinion.
    I don't know the nature of of your data and why you choose to use KNN Regression
    Maybe you can compare with other models, if you can squeeze more accuracy.
  • 2
    @cho-uc thanks!

    those are real data. yea i was also told that knn regression isnt normally used with volatile data.

    ill probably do svm regression or LDA next. idk. do you have any model to recommend thats good with volatility?

    thanks again :)
  • 2
    @cuburt
    I just try on everything from Polynomial Regression, Lasso, Ridge to ElasticNet and pick the best one.
    If you have time and power, you can even try neural net for regression

    @NoMad is more of an expert on this.
  • 3
    Disclaimer: am not an expert, it's a hit or miss with me.

    First of, why are you plotting prediction dots? (green) What are you trying to see? The prediction line is basically what you need.

    Your model is meh accurate. I don't get why knn, but maybe that could work. But you may need more data points to get a better answer.

    What are you using (method/function) to get that correlation? Correlation without a point of reference doesn't have much meaning. But all in all, that number most likely says there should be small-ish to moderate negative correlation, which your line of prediction kinda supports that. (or so I think, correct me -anybody- if I'm wrong)
  • 2
    Also maybe play with the K (currently at 20 kernels) and see if your result are any different. I could suggest 3,4, 5, 9, 10 instead. I don't think higher k could give you better answer necessarily (your data is a tad too cluttered to see any definite patterns here.)
  • 1
    @NoMad now that i think about it, idrk why i plotted those green dots lol. i based the variable selection from a correlation matrix
  • 1
    @NoMad yep, i already did that. i made a line graph for that as well
  • 3
    @cuburt wait, why is accuracy on training set soooo different from test set??? Like, accuracy on training set should not be on all the set at once... I'm confused about the accuracy on training set here tbh.
  • 2
    Like, you train the model, regardless of accuracy, and then the accuracy on test set is your measure. While on training, your model is still learning. Accuracy has no meaning for it.
  • 2
    And on your test, k's 2,3 and 14 work too.
  • 0
    @NoMad oh, maybe because sets are being fitted separately and scored separately
  • 1
    @cuburt accuracy on training has no meaning afaik ¯\_(ツ)_/¯
    Unless you mini-batch the jobs, then yes your loss/accuracy are relevant to overfitting.

    My best suggestion can be use SVM instead. Use RBF and play with kernels until you get a proper "Test" result.
  • 1
    @cho-uc @NoMad

    Thank you so much! I'll keep those in mind
  • 2
    @cuburt actually, one more thing. Try k's 2 and 3,and compare the results of test to this one. I'd be interested to know how it goes.
  • 1
    @NoMad so i should fit the train and score the test?

    the reason i did that is so i can compare their accuracy, that way ill be able to tell if its overfitting
  • 2
    @cuburt well, are you batching the jobs? If yes, then yeah you can compare accuracy/loss.
    And yes, bingo, ding ding ding, correct, hooray! fit the train and score the test.
    It's better if you don't train/fit on test set as well, or you are actually doing magic on the model. (except if your doing life-long learning which tbh, I personally have problems wrapping my head around it so I'll leave you to it)
  • 1
    @NoMad

    k=3. i fitted only the training set and scored the test set. its gotten low. LOL
  • 2
    @cuburt accuracy went near zero, but error rate stayed the same 🤔 Interesting indeed
  • 1
    Can you get the same accuracy graph for kernel numbers, on test set again; without fitting on test set? That may change now.
  • 1
    Just to point out something, look at PctWTP at around 1. The variation is more than half of the range for Slippage in training set. I mean, maybe this is why your model is confused. It finds neighborhoods with very high variations.
  • 0
  • 1
    @cuburt interesting. What accuracy would a k=25 give you?
  • 1
    @NoMad sometimes it goes .22
  • 2
    @cuburt 😂😂😂 I'm legit confused on how accuracy could go below 0 😂😂😂
  • 1
    @NoMad im not sure if its accuracy, its the model score or something
  • 1
    Because, 0 accuracy means none of the examples were predicted right. Accuracy below zero means what? They were predicted right, but not? 🤔 Like, it can't mean reverse, unless you're not getting accuracy but the distance of something.
  • 1
    @NoMad here i used rmse.
  • 3
    Assuming you're using ` sklearn.neighbors.KNeighborsRegressor` I assume you're using this one:
    https://scikit-learn.org/stable/...

    which does say " The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0."
  • 2
    @cuburt that looks nice-er. Still, 30 kernels is a shitton. Like, I don't know, maybe I'm wrong. If your model responds to it, maybe that's how it should go. Can you find out how many tests it gets wrong, with a reasonable margin? That could give you a better measure of accuracy
  • 2
    Actually, another idea could be to find another feature (X, in your case is only PctWTP. Maybe add another feature that could be linked to how these neighborhoods are defined. AKA make the model 3D, or more)
    Ofc, it'd take time so it's up to you if you want to go down that rabbit hole.
  • 2
    On a second thought, ignore that accuracy measure. RMSE is your error rate in this cass.
  • 0
    @NoMad is that what multivariate model is? because thats where im supposed to be heading
  • 2
    @cuburt I think so. Because this is showing you that PctWTP is not the only factor in estimating the Slippage.
  • 1
    Tho, iirc MANOVA (which deals with multivariate stuff) deals more with variation. 🤔 At this point, I'm even confusing myself, so let's say idk. ¯\_(ツ)_/¯
  • 1
    One last request (cuz curiosity totally didn't kill the cat), can you show me how your test graph looks using k=25 or 30 or 40 or whatever you like?
    (titled KNN regression, slippage vs PctWTP)
  • 0
    @NoMad k=30. as the previous line graph showed, accuracy also increased (although accuracy doesnt really make much sense to me now, its not even the r squared which i thought it was)
  • 2
    @cuburt if you are actually curious, what you can do next is to map two other lines. Your prediction line +/- the RMSE and then calculate how many data points fall in between these two lines. Then, you can say your model has a variation of RMSE, and it has an error rate of {whatever number that didn't fall in between two lines divided by total number of samples}. Which could be a much better model than Just the line of prediction (that is practically what SVM does much nicer)
  • 1
    @NoMad okay, ill def keep that in mind. im actually planning to do SVM regression next

    thank you so much! it makes much more sense to me now
  • 0
    @pythonInRelay I keep on seeing this, what is this from?
Add Comment