Hello,
I use the Splunk Machine Learning Toolkit. I would like to predict a rare event. The predicted variable has two values : "GOOD" and "BAD". The "BAD" only represents 13% of the data.
I use RandomForestClassifier to do the prediction. But it has serious difficulty to predict the "BAD". The confusion matrix is :
Predicted | Predicted GOOD | Predicted BAD |
BAD | 11.9% | 88.1% |
GOOD | 19.4% | 80.6% |
Of course, this model has great results with a precision of 0.87 and an F1 of 0.85 because, most of the time, the result is GOOD, but it doesn't work for the "BAD".
How can I improve my model? Is it possible to use class_weight or other things like that ?
Thank you in advance for your answer
You can use this as reference for adding class weight in your algo or you can use Github algo directly for your case: https://github.com/splunk/mltk-algo-contrib/blob/master/src/bin/algos_contrib/CustomDecisionTreeClas...
Thank you for your answer. I try to use the algorithm after register it, create the python script file and add the Github algo.
But when I do the following search :
... | fit CustomDecisionTreeClassifier splitter=best criterion=gini class_weight="{'GOOD':7,'BAD':1}" "explained_variable" from "explanatory_variable_1" "explanatory_variable_2" "explanatory_variable_3" .... into "test" as prediction
I have an error : Error in 'fit' command: Error while saving model "test": Not JSON serializable: algos.CustomDecisionTreeClassifier.CustomDecisionTreeClassifier
I think it's because of the SimpleObjectCodec, but I don't really know how to fix it.
No its not SimpleObjectCodec problem. I know the reason behind it. Let me explain you
Change : codecs_manager.add_codec('algos_contrib.CustomDecisionTreeClassifier', 'CustomDecisionTreeClassifier', SimpleObjectCodec)
To : codecs_manager.add_codec('algos.CustomDecisionTreeClassifier', 'CustomDecisionTreeClassifier', SimpleObjectCodec)
i.e you are replacing "algos_contrib" to "algos"
-Make sure to register your algorithm under algos.conf
-restart splunk and it will work for you 🙂
Here is the syntax to use class_weight
| fit DecisionTreeClassifier class_weight="{'Yes':1,'No':0.1}"
Hi
Class weight is something you can access by using the ML APIs and exposing that parameter in the code. https://docs.splunk.com/Documentation/MLApp/4.0.0/API/Introduction
Or you can change the events in your search, by sampling by class manually using SPL to balance the classes, and then using the |fit command on that balanced data.
Thank you for your answer. But how can I do to sample class manually with SPL ?
There are many ways depending on the type of sampling you wish to use.
https://docs.splunk.com/Documentation/MLApp/4.0.0/User/Customsearchcommands#sample
.. | search fieldforclass="class_label_A" | sample partitions=100 seed=1001 | where partition_number<=70 | outputlookup class_label_A.csv
.. | search fieldforclass="class_label_B" | sample partitions=100 seed=1001 | where partition_number<=70 | outputlookup class_label_B.csv
combine the two like so
| inputlookup class_label_A.csv | append[ inputlookup class_label_B.csv ]
Note that there are far more performant options if you use summary indexes or maybe even use the proportional option on the sample command itself.
As per your comment your prediction accuracy for bad is low but in the table shared, it seems like for actual bad values your predicted bad accuracy is 88.1% which seems great and for prediction of good it's 19.4% accurate which is not a good prediction accuracy.
Seems like table is inversed.
Coming back to your problem of improving your accuracy of predicting bad , there are three options:
1) Getting more data for bad cases, this would help the model understand those cases more. Also, its possible that the fields being used for prediction do not have a good relation with the target variable, including new variables for prediction could also help.
2) Trying different algorithms. (Although RandomForestClassifier is a good one)
3) Using MLSPL API https://docs.splunk.com/Documentation/MLApp/4.0.0/API/Introduction , getting in the resampling algorithm into MLTK and using that to resample your data for Bad and Good to 50% each.
Algorithm which can help:
Hope this helps.