Understanding KMeans Clustering

jpawloski · ‎10-07-2019

So I'm new to the Machine Learning Toolkit and I'm trying to model something that I thought would be somewhat straightforward, but I'm beginning to realize that I might need more of an understanding of what Splunk needs from me to create an accurate report.

What I want is a model that will report any event_id that reports an uncharacteristic volume of events. I started off by throwing as many stat commands I could at the data but I ended up with a model that may have modeling something completely different. I simplified the search to the following:

base search earliest=-30d | bin _time span=1h as event_window | stats sum(count) avg(count) dc(hostname) by event_id event_window

So at this point, I have several questions:

Do I include the event_id in the k means algorithm? I've been told that the answer you hope to find should not be included in the clustering but here I feel like it's necessary to describe the data. In my mind, we have to attribute the data point to an event_id or the frame of reference is lost. Am I correct?
The bin/timespan. Right now I'm group everything by an hour because I anticipate running a saved search over this data on an hourly basis. Do I have to have like for like values here, or can I potentially run the clustering with different bucket spans and still obtain accurate reports?
Is this even a decent model? Like I said previously, I went nuts with multiple stats commands, deltas, etc but felt like I began modeling something else instead of volume of event_ids. I have run this model and tested with some data out in the wild and I see outliers but I don't understand why Splunk is reporting them as such. I'm not sure if the variance in the amount of event_ids overall, but these numbers are well within the limits I would have expected to be set in the clustering. Are there stats that I should be including to address this?

Again, I am a super newb at this. I've looked for a primer to ML in Splunk but I haven't found anything that goes into this level of explanation. Any assistance would be greatly appreciated.

Sukisen1981 · ‎10-08-2019

@jpawloski - This is rather open ended, but I will try.
I've been told that the answer you hope to find should not be included in the clustering - Wrong, the answer you hope / or do not hope has nothing to do with the ML algo that you use. For the sake of argument, let us assume you have 3 eventids - 11, 22 &33. Now, let us see your use case
'What I want is a model that will report any event_id that reports an uncharacteristic volume of events'
What does this mean? are all the 'uncharacteristic events' the same / same type? So, say the event is off the same type something like - 'user logged in but got booted' and say eventid 11 has 56 events like this and eventid 2 has 75 events like this and eventid has 81 events like this. Now, what is an uncharacteristic volume of events? is it something like where the total event is over 70, in which case you expect eventid22&33 to be of the same cluster or type?

If all you need is the count of the total(or some specific event types) associated with respective eventids, you should go for outlier detection in the MLTK app.
If your use case is something wherein you want to cluster based on the event types AND want to predict which eventid has a certain type of event type, then you might want to cluster based on the event type
while excluding the eventid. This assumes, apriori that your eventids are a set of repeating identifiers. If your eventid is randomly generated, you can not cluster just based on the event types and eventids. To give you an example, say i have a dump of 20k+incidents with their ticket#, description and resolution. clustering by including the ticket# means nothing here, because it is NOT repeatable set of numbers. So what can I cluster? I can cluster my incident description. How does this help? It helps because when a new incident comes in , based on the entered description I can check the incident resolution.
For example, say description - app AAA is down has been reported 20 times over the set of my historical 20k data. When i cluster the incident description , incidents with this description gets into a single cluster. I can then pass (through a text input dropdown in a dashboard) and check whether the newly reported incident description matched the description in any of the clusters I have already created . If i find a hit all I need to do is to display the resolution fields from the matching cluster/clusters. you need to check this out - https://docs.splunk.com/Documentation/Splunk/7.3.1/SearchReference/Cluster
this delivered command is very powerful and has text matches (ngrams , termsets etc.) with associated probability level for similarity match. I do not understand your need and why you are trying to use k means clustering.
I suspect you need to give some data samples and some measure of your expected output. It could be that you need kmeans clustering but we need to see some data samples

jpawloski · ‎10-23-2019

Sorry for the delay in my reply. So an uncharacteristic volume of events would be based on the history of that specific event ID. So if event ID=1 has around 3 events every hour and then suddenly shoots up to 500, I want to see that. This would be unique to that specific event ID. In terms of an event_type, these events are unique. They won't have anything in common with each other. These event id are repeating identifiers, not randomly generated, and I also have something like a description field that descriptions the event in detail.
I tried creating a number outlier in the MLTK but it doesn't appear you can model the experiment and reference it within other searches as you can with clustering. Could I potential incorporate the description into the model and get more accurate results?

Sukisen1981 · ‎10-31-2019

You probably do not need the MLTK app for your needs. if you look at the numerica outlier examples
| inputlookup hostperf.csv | eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S.%3Q%z") | timechart span=10m max(rtmax) as responsetime | head 1000
| streamstats window=200 current=true median("responsetime") as median
| eval absDev=(abs('responsetime'-median))
| streamstats window=200 current=true median(absDev) as medianAbsDev
| eval lowerBound=(median-medianAbsDev*exact(20)), upperBound=(median+medianAbsDev*exact(20))
| eval isOutlier=if('responsetime' < lowerBound OR 'responsetime' > upperBound, 1, 0)
All of these commands are available in the normal search, the point is to establish your thresholds instead of the using exact(20) in the lower and upper bound you can say your thresholds are exact(10) instead.
You can also cluster those events which fall in the outlier range without the MLTK app. For example, say you are running your search for the last 6 hours split by intervals of 1 hour each. Now, for the hour where event count shoots up to 500, this will be an outlier category (can be identifed by the isoutlier field), you can then simply run the cluster command on the _raw events falling underisoutlier=1
Here is a blog - https://www.splunk.com/blog/2014/07/28/splunk-command-cluster.html
and here is the official doc - https://docs.splunk.com/Documentation/Splunk/7.3.1/SearchReference/Cluster
You can then save these queries just like ordinary panels in a dashboard

Understanding KMeans Clustering

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Adoption of RUM and APM at Splunk