Hi,
I am sorry I was off the forum due to some burning issues @ work.
The most important thing to consider here is the RMSE. I did not think I was able to explain.
RMSE being high is not good GIVEN that the RMSE is high across 10-20% of your data set.
However, if say for example only a few values have very high variance, your model is correct.
Now, How many (or what percent) of the 51 K events that you have are giving very high RMSE? You did not mention that.
Also, you say that without applying pre-processing you were anyways getting RMSE in the range of 150-200 and after applying pre-processing you get it ~ 198. Considering the jump in your R square to 0.98 after applying pre-processing it might be worth a shout.
And now for my summary
1- Point to consider here (and always remember) RMSE is in the same units as your dependent variable. So, assuming you are predicting duration what are some typical duration values? Is it in the range of 100-300 or something like that? So, if you have dependent values 200-300 your RMSE is pretty good! However, if you have your dependent variable in the range of say always 50-150, this is too high and indicates some cases where the prediction and actuals have a huge variance.
2-If you can quantify your job groupings (90 distinct values is okish)
you can again try running the random forest with both start time & job group as independent variables, job duration being the sole dependent one AND also try having just job group as the sole independent variable. See the model generated out of these two runs as well, see r square and RMSE, if either of these two former scenarios look better, maybe you should consider this model.
3-Finally the RMSE for your test and sample data should be similar, is your test data RMSE values too high in general as compared to your sample data RMSE? If yes, you have over fit your data
4-If our model does much better on the training set than on the test set, then we’re likely overfitting.For example, it would be a big red flag if our model saw 99% accuracy on the training set but only 55% accuracy on the test set.
Keep me posted 🙂 and spry once again for the delay, but I was really caught up in office work
... View more