About yangzd

yangzd · ‎08-01-2018

Great point. Definitely other algorithms should support the probability prediction, as long as it's available in scikit-learn. Please stay tuned for the next MLTK release(s)!

yangzd · ‎07-30-2018

That's great!

yangzd · ‎07-30-2018

The generated fields will be something like predicted(target_field) for the class prediction, and bunch of probability(target_field=class_a) , probability(target_field=class_b) ... for the probabilities of predicting the fields as class_a , class_b , etc. Can you double check if you are using LogisticRegression properly and at least seeing the predicted(target_field) ?

yangzd · ‎12-20-2017

Hi jsinnott_, that sounds great. Yes please do let us know the comparison results. Look forward to it. -zd

yangzd · ‎12-18-2017

Hi, Thank you for asking, this is an incredibly valuable question! You have a very good understanding of dummy variables. First, about the bias in the model. Let's assume you have dummy variables x1 , x2 , x3 , such that x1 + x2 + x3 = 1 , With m-1 dummy variables, your linear model can be expressed as y = α0 + α1 * x1 + α2 * x2 With m dummy variables, your linear model is now: y = β0 + β1 * x1 + β2 * x2 + β3 * x3 Since x3 = 1 − x1 − x2 , you get y = β0 + β1 * x1 + β2 * x2 + β3 * (1 − x1 − x2) = (β0 + β3) + (β1 − β3) * x1 + (β2 − β3) * x2 Essentially you have α0 = β0 + β3 , α1 = β1 − β3 , α2 = β2 − β3 So, these two models are equivalent, and there is no bias introduced as you see in this exercise. Now, the question is, what's introduced here? Collinearity is what you are after, since you can always tell the value of the left out dummy variable if you know m-1 of them. Collinearity can cause computational problems for linear regression since the matrix inversion can not be performed. But for logistic regression, depending on the computational scheme under the hood, e.g. gradient descent, numerical instability may not be an issue. Moreover, the LogisticRegression model in sklearn uses a regularization, penalty='l2' and C=1.0 , which means feature collinearity will be penalized. Therefore, using the full m dummy variables instead of m-1 does not introduce bias to the model, except for potential numerical instability. In practice, to avoid the potential numerical instability issue, if you decide to go with m-1 dummy variables, you may have the following options: 1) With latest version of MLTK (you are right it uses pandas 0.17), you can modify the prepare_features_and_target method in df_util.py , instead of doing X = pd.get_dummies(X, prefix_sep='=', sparse=True) you can use the following code to drop the first column of the created dummy variables for each categorical variable: columns_to_encode = X.select_dtypes(include=['object', 'category']).columns for col in columns_to_encode: X = X.join(pd.get_dummies(X.pop(col), prefix=col, prefix_sep='=').iloc[:, 1:]) 2) As you already mentioned in your post, drop_first=True is supported in pandas 0.18+, you could use this when a future version of Python for Scientific Computing is released. On the other hand, if you want to reduce the effect of collinearity in your model, you can also use some preprocessing methods, e.g. Field Selector to select features, or PCA to remove collinearity. You can also use algorithms like Random Forest that are least affected by feature multicollinearity. Hope it helps clarify some of the issues. zd

yangzd · ‎10-25-2017

The output number of a fit command can be set by maxresultrows in limits.conf and max_inputs in mlspl.conf, which might not be the case if you did not change the default setting. But you can double check that. Also, you did check every single event for null value, right? 5209 events seem a lot to check visually, you may append "| stats count by countNull" to your above search.

yangzd · ‎10-20-2017

Hey _jgpm_ The parameters you are trying to set are computing resource related, and supposed to be configured in mlspl.conf, as documented here: http://docs.splunk.com/Documentation/MLApp/3.0.0/User/Configurefitandapply In your real problem, if you see number of results are smaller than the original inputs, it's likely that a lot of events are dropped because of missing value - they are not used in model fitting. Can you check if that's the case, and post here if it's not? Hope it helps~

yangzd · ‎03-01-2017

Hi, As of now, MLTK does not support importing a trained model from outside of Splunk. There are several ways that you could try: Use random forest algorithm in MLTK to train a native MLTK model, you can just change the max_memory_usage_mb stanza in the mlspl.conf file to allow higher memory usage. Write a custom algorithm that reads your trained model and translate the parameters and pass into sklearn models, then run on your data. It may not be trivial in your use case due to the complexity of decision tree. Write a custom search command that sends Splunk data to your environment where you trained your model, make predictions and send back to Splunk. This requires good understanding of Splunk custom search command and extra integration work.

yangzd · ‎01-16-2017

Predicted probabilities are supported in LogisticRegression. You can simply include probabilities=true in your fit query and it will show the predicted probabilities alongside the predicted classes. Please refer to the MLTK documentation here: http://docs.splunk.com/Documentation/MLApp/latest/User/Algorithms#LogisticRegression

yangzd · ‎12-20-2016

Hi Blake, This is an interesting algorithm! I am proposing two solutions for you: MLTK/SPL only: Single-linkage clustering is not yet a supported algorithm in the current release of MLTK. But MLTK offers some other clustering algorithms such as kmeans, spectral clustering, DBSCAN and Birch. So if you don't mind trying one of the supported clustering algorithms, then here is one possible solution: (1) transpose your raw data to flip events/fields, (2) calculate differences per pair of columns and generate the M * N^2/2 matrix (you may need SPL commands such as map, join, and/or foreach), (3) perform binary classification via fit command, (4) since step (1)-(3) can be done using SPL and ML-SPL, you can copy/paste the SPL into the search bar in the Clustering dashboard of MLTK and try out different supported clustering methods. Via ML-SPL API: If you already have your custom script ready to perform the transformation and clustering as you described, you can wire it up with ML-SPL and it could be more convenient than the first method. A reference script will be SpectralClustering.py and DBSCAN.py in the Splunk_ML_Toolkit/bin/algos directory, where you can follow the way fit_predict is implemented and replace it with your own script. Hope it's useful.

yangzd · ‎10-11-2016

Awesome. You are very welcome!

yangzd · ‎10-11-2016

The error message indicates the Python for Scientific Computing Add-on is not installed properly. Can you double check the app installation, specifically: It is the right PSC bits on your platform, that is Linux-64bit in your environment. PSC is installed to $SPLUNK_HOME/etc/apps/, and has unchanged name, that is "Splunk_SA_Scientific_Python_linux_x86_64" in your environment. You may need to reinstall PSC and restart Splunk.

Posts	12
Solutions	4
Karma Given	3
Karma Received	16
Member Since	‎10-11-2016

Online Status	Offline
Date Last Visited	‎06-05-2020 02:04 AM

Re: How to display scored probabilities from Machi...

Re: How to display scored probabilities from Machi...

Re: How to display scored probabilities from Machi...

Re: Splunk Machine Learning Toolkit: Does ML Toolk...

Re: Splunk Machine Learning Toolkit: Does ML Toolk...

Re: MLTK v2.4 returning max 1000 results and max_m...

Re: MLTK v2.4 returning max 1000 results and max_m...

Re: Machine Learning Toolkit: Is there a way to im...

Re: How to display scored probabilities from Machi...

Re: Machine Learning Toolkit: Event Field Differen...

Re: Why is the fit command resulting in error "Fai...

Re: Why is the fit command resulting in error "Fai...