how do I rank nodality from kmeans data

MonkeyK · ‎11-14-2017

I have been trying to do kmeans analysis of some data. I see some of my evaluation points falling into lots of clusters, but with heavy weighting towards 1-2 clusters. Is there a way to call this out?

[my search] | kmeans k=100 col1 col2 col3 | | eventstats count as clusterConnectionEvents by CLUSTERNUM | eval culusterConectionEvents=CLUSTERNUM."(".clusterConnectionEvents.")" | | stats dc(CLUSTERNUM) as clusterCount values(culusterConectionEvents) values(connectionCount) count by

gives me a first item with

clusterCount=26

values(clusterConnectionEvents)
1(7)
10(1)
100(14)
12(100)
14(19)
16(9)
2(50247)
20(2)
37(203)
39(122)
4(472)
40(75)
48(17)
5(2)
50(16)
52(8)
53(39)
59(8)
73(3)
74(20)
75(142)
80(2)
81(13)
83(4)
84(58)
87(96)

This clearly has a huge node at cluster 2

and another

clusterCount=12

values(clusterConnectionEvents)
1(4)
14(2)
16(5)
2(59)
32(3)
4(11)
48(2)
59(148)
75(170)
84(2)
87(69)
89(5)

which clearly has nodes at clusters 59 and 75 (and maybe 2 and 87 as well).

For other items, the nodes are less pronounced. These are less interesting to me

Is there a way to score such data so that items with the vast majority of their values falling into 1-2 buckets comes to the top of a list?

DalJeanis · ‎11-14-2017

Hmmm. I'm not sure I agree with your distinction between interesting and non-interesting with regard to clusters. Until you know the characteristics of a cluster, you don't know why the system decided it WAS a cluster. But, we can agree that identifying those clusters is initially more critical, since it is the bulk of your data.

It is easy enough to do something like this...

[my search] 
| kmeans k=100 col1 col2 col3
| eval rectype="detail" 
| appendpipe 
    [| stats count as CLUSTERCOUNT by CLUSTERNUM | sort 0 - CLUSTERCOUNT + CLUSTERNUM | eval rectype="ClusterSummary"]

This gives you a set of data at the end that summarizes your clusters.

Or you could do this to get rid of all events that are not in your biggest 3 clusters...

[my search] 
| kmeans k=100 col1 col2 col3
| eval rectype="detail" 
| appendpipe 
    [| stats count as CLUSTERCOUNT by CLUSTERNUM | sort 3 - CLUSTERCOUNT + CLUSTERNUM | eval rectype="ClusterSummary"]
| eventstats max(CLUSTERCOUNT) as keepme by CLUSTERNUM
| where isnotnull(keepme) AND rectype="detail"

MonkeyK · ‎11-14-2017

Sorry Dal, I left out the meaning of the query to keep my question from getting too complex. Generally the query looks for malware beacons by looking for traffic that is similar in period, size, and duration. The clusters are clustering on those values.

I tightened up my ability to see the nodes by throwing away all clusters that have less than 1% of the total clustered events.
In the case of my first example, this left just the one cluster, which is what I wanted to see. So maybe I could just play with the % that I throw away.

how do I rank nodality from kmeans data

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms