Splunk Search

how do I rank nodality from kmeans data

MonkeyK
Builder

I have been trying to do kmeans analysis of some data. I see some of my evaluation points falling into lots of clusters, but with heavy weighting towards 1-2 clusters. Is there a way to call this out?

[my search] | kmeans k=100 col1 col2 col3 | | eventstats count as clusterConnectionEvents by CLUSTERNUM | eval culusterConectionEvents=CLUSTERNUM."(".clusterConnectionEvents.")" | | stats dc(CLUSTERNUM) as clusterCount values(culusterConectionEvents) values(connectionCount) count by

gives me a first item with

clusterCount=26

values(clusterConnectionEvents)
1(7)
10(1)
100(14)
12(100)
14(19)
16(9)
2(50247)
20(2)
37(203)
39(122)
4(472)
40(75)
48(17)
5(2)
50(16)
52(8)
53(39)
59(8)
73(3)
74(20)
75(142)
80(2)
81(13)
83(4)
84(58)
87(96)

This clearly has a huge node at cluster 2

and another

clusterCount=12

values(clusterConnectionEvents)
1(4)
14(2)
16(5)
2(59)
32(3)
4(11)
48(2)
59(148)
75(170)
84(2)
87(69)
89(5)

which clearly has nodes at clusters 59 and 75 (and maybe 2 and 87 as well).

For other items, the nodes are less pronounced. These are less interesting to me

Is there a way to score such data so that items with the vast majority of their values falling into 1-2 buckets comes to the top of a list?

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Hmmm. I'm not sure I agree with your distinction between interesting and non-interesting with regard to clusters. Until you know the characteristics of a cluster, you don't know why the system decided it WAS a cluster. But, we can agree that identifying those clusters is initially more critical, since it is the bulk of your data.

It is easy enough to do something like this...

[my search] 
| kmeans k=100 col1 col2 col3
| eval rectype="detail" 
| appendpipe 
    [| stats count as CLUSTERCOUNT by CLUSTERNUM | sort 0 - CLUSTERCOUNT + CLUSTERNUM | eval rectype="ClusterSummary"]

This gives you a set of data at the end that summarizes your clusters.

Or you could do this to get rid of all events that are not in your biggest 3 clusters...

[my search] 
| kmeans k=100 col1 col2 col3
| eval rectype="detail" 
| appendpipe 
    [| stats count as CLUSTERCOUNT by CLUSTERNUM | sort 3 - CLUSTERCOUNT + CLUSTERNUM | eval rectype="ClusterSummary"]
| eventstats max(CLUSTERCOUNT) as keepme by CLUSTERNUM
| where isnotnull(keepme) AND rectype="detail"
0 Karma

MonkeyK
Builder

Sorry Dal, I left out the meaning of the query to keep my question from getting too complex. Generally the query looks for malware beacons by looking for traffic that is similar in period, size, and duration. The clusters are clustering on those values.

I tightened up my ability to see the nodes by throwing away all clusters that have less than 1% of the total clustered events.
In the case of my first example, this left just the one cluster, which is what I wanted to see. So maybe I could just play with the % that I throw away.

0 Karma
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...