topic which of the following is good way for predicting any field? in All Apps and Add-ons

which of the following is good way for predicting any field?

nasrinmulani — Thu, 17 May 2018 09:03:40 GMT

Hi All,

I am working on prediction of start time of job and i have scheduled time as a independent variable.

Approach 1:
I am thinking to convert the H:M:S time of start time and scheduled time into seconds and them predict the start time in seconds using independent variable as schedule time in seconds and hour if the schedule time.and convert it again into H:M:S and append it with the respective date
Approach 2:
Another approach can be convert the start time and scheduled Time into epoch. Get the difference between them, predict that difference using independent variable as schedule time in epoch and hour of the schedule time, type of the job

Please let me know which approach is better and algorithm - RandomForestRegressor algorithm is feasible here,

Thanks in Advance !

Re: which of the following is good way for predicting any field?

aoliner_splunk — Thu, 17 May 2018 19:52:49 GMT

This questions is impossible to answer well without knowing more about the data, but here are a few suggestions based on what you've provided:

Predict delay (the difference between scheduled and start time) rather than start time.
Use derived features of the scheduled time (like hour or day of the week) in addition to the epoch time.
Try different algorithms, including random forest, and see which works best. If you stick with the defaults in the Toolkit, you only need to run the assistant a handful of times.
Think about what you want from this model. If minimal RMSE is your goal, #3 is sufficient. If you want an interpretable model that tells you what features are important, for example, some models are better choices than others (models that support the summary command will be automatically summarized at the bottom of the assistant).

Re: which of the following is good way for predicting any field?

nasrinmulani — Mon, 21 May 2018 11:12:21 GMT

Thanks Aoliner,

I have worked on both approach , but i got good results with the approach 1 of calculating the start time into seconds.
Random forest is working fine for me, but i have some outliers because of that my result is having more RMSE value and R square value is coming 0.99.

I have one question that we should remove the outliers (deviation in data) or it should be there?

Re: which of the following is good way for predicting any field?

aoliner_splunk — Mon, 21 May 2018 15:58:34 GMT

Do you consider the outliers to be noise (e.g., measurement error, external interference, etc.) or a phenomenon you want to model?

Also, perfect prediction isn't always possible, especially in the presence of random noise or factors missing from your dataset. You may find it difficult to do better than R^2=0.99.