All Apps and Add-ons

Using the Splunk Machine Learning Toolkit, how do you predict rare events?

New Member

Hello,

I use the Splunk Machine Learning Toolkit. I would like to predict a rare event. The predicted variable has two values : "GOOD" and "BAD". The "BAD" only represents 13% of the data.

I use RandomForestClassifier to do the prediction. But it has serious difficulty to predict the "BAD". The confusion matrix is :

Predicted | Predicted GOOD | Predicted BAD |
BAD | 11.9% | 88.1% |
GOOD | 19.4% | 80.6% |

Of course, this model has great results with a precision of 0.87 and an F1 of 0.85 because, most of the time, the result is GOOD, but it doesn't work for the "BAD".

How can I improve my model? Is it possible to use class_weight or other things like that ?

Thank you in advance for your answer

0 Karma

Splunk Employee
Splunk Employee

You can use this as reference for adding class weight in your algo or you can use Github algo directly for your case: https://github.com/splunk/mltk-algo-contrib/blob/master/src/bin/algos_contrib/CustomDecisionTreeClas...

0 Karma

New Member

Thank you for your answer. I try to use the algorithm after register it, create the python script file and add the Github algo.
But when I do the following search :
... | fit CustomDecisionTreeClassifier splitter=best criterion=gini class_weight="{'GOOD':7,'BAD':1}" "explained_variable" from "explanatory_variable_1" "explanatory_variable_2" "explanatory_variable_3" .... into "test" as prediction
I have an error : Error in 'fit' command: Error while saving model "test": Not JSON serializable: algos.CustomDecisionTreeClassifier.CustomDecisionTreeClassifier

I think it's because of the SimpleObjectCodec, but I don't really know how to fix it.

0 Karma

Splunk Employee
Splunk Employee

No its not SimpleObjectCodec problem. I know the reason behind it. Let me explain you

  • When you are using GitHub algos, you can use it as an app . Instruction has been given in the readme file
  • If you want to use it inside MLTK by copying the algo in Toolkit, please do the following
  • Copy the algo file to $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/bin/algos
  • Open the file and change line 64 as shown below

Change : codecs_manager.add_codec('algos_contrib.CustomDecisionTreeClassifier', 'CustomDecisionTreeClassifier', SimpleObjectCodec)
To : codecs_manager.add_codec('algos.CustomDecisionTreeClassifier', 'CustomDecisionTreeClassifier', SimpleObjectCodec)

i.e you are replacing "algos_contrib" to "algos"

-Make sure to register your algorithm under algos.conf

-restart splunk and it will work for you 🙂

Here is the syntax to use class_weight
| fit DecisionTreeClassifier class_weight="{'Yes':1,'No':0.1}"

0 Karma

Splunk Employee
Splunk Employee

Hi
Class weight is something you can access by using the ML APIs and exposing that parameter in the code. https://docs.splunk.com/Documentation/MLApp/4.0.0/API/Introduction
Or you can change the events in your search, by sampling by class manually using SPL to balance the classes, and then using the |fit command on that balanced data.

New Member

Thank you for your answer. But how can I do to sample class manually with SPL ?

0 Karma

Splunk Employee
Splunk Employee

There are many ways depending on the type of sampling you wish to use.
https://docs.splunk.com/Documentation/MLApp/4.0.0/User/Customsearchcommands#sample

.. | search fieldforclass="class_label_A" | sample partitions=100 seed=1001 | where partition_number<=70 | outputlookup class_label_A.csv

.. | search fieldforclass="class_label_B" | sample partitions=100 seed=1001 | where partition_number<=70 | outputlookup class_label_B.csv

combine the two like so
| inputlookup class_label_A.csv | append[ inputlookup class_label_B.csv ]

Note that there are far more performant options if you use summary indexes or maybe even use the proportional option on the sample command itself.

0 Karma

Splunk Employee
Splunk Employee

As per your comment your prediction accuracy for bad is low but in the table shared, it seems like for actual bad values your predicted bad accuracy is 88.1% which seems great and for prediction of good it's 19.4% accurate which is not a good prediction accuracy.
Seems like table is inversed.

Coming back to your problem of improving your accuracy of predicting bad , there are three options:

1) Getting more data for bad cases, this would help the model understand those cases more. Also, its possible that the fields being used for prediction do not have a good relation with the target variable, including new variables for prediction could also help.
2) Trying different algorithms. (Although RandomForestClassifier is a good one)
3) Using MLSPL API https://docs.splunk.com/Documentation/MLApp/4.0.0/API/Introduction , getting in the resampling algorithm into MLTK and using that to resample your data for Bad and Good to 50% each.
Algorithm which can help:

Hope this helps.

0 Karma