Solved: Can I do kmeans by a for a column by a column?

MonkeyK · ‎11-11-2017

I am hoping to do kmeans analysis on my firewall traffic in a way that gives me 10 buckets for each destination port. something like:

search network traffic | kmeans k=10 bytes_out by dest_port

i know that kmeans doesn't support "by", but is there a different way to do this?

niketn · ‎11-11-2017

How about the following search using map command to pass on the k value for kmean? If you have this on a dashboard, you can run a separate dummy search to get the multiplier and pass on the same to single kmean search instead of map command:

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

View solution in original post

niketn · ‎11-11-2017

How about the following search using map command to pass on the k value for kmean? If you have this on a dashboard, you can run a separate dummy search to get the multiplier and pass on the same to single kmean search instead of map command:

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

MonkeyK · ‎11-12-2017

Map is interesting, I have not used that command before. As long as I am using it, I can just feed it a list of dest_ports, no?

<search network traffic> | fields + dest_port | dedup dest_port | map maxsearches=25 search="<search network traffic>  dest_port=$dest_port$ | kmeans k=10 bytes_out "

seems to work, I tried limiting to dest_port IN (80,443) and got different centroid_bytes_out values for CLUSTERNUM 1 on each port.

MonkeyK · ‎11-12-2017

Something in this answer confuses me. Am I limited in the number of values that I can give to map? If so, how much should I limit the input set by?

Also if I augment the k value for kmeans by doing (10*number of ports * desired k), does that mean that kmeans will arrange ports as a decimal position? If not then ports will distribute over different ranges buckets. I will do some studies on this. I think that I could do stats on the buckets to determine ports/CLUSTERNAME and CLUSTERNAMES/port.

I will also try out your dashboard after I work out a few kinks in the query that I am building.

Either way on both concerns, map does solve my problem. If you want to convert a comment to an answer, I will accept it.

niketn · ‎11-12-2017

@MonkeyK, limit will be number of sub-searches your system allows. You can change the limit in your configuration file or from UI Setting. But increasing this limit will imply load on your Splunk instance. By default if you do not specify limit in map command it takes up to 10 values. (https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Map) You can change this value in search but it your Sub Search limit is less than 10 then you will see a warning when you run this search. PS: This description could be confusing, so just wanted to iterate that Limit of Subsearches and limit in map commands are different.

Coming back to the use of token, if you have 4 ports and you want 10 clusters of bytes_out for each port then you will have 40 CLUSTERNUM s and you should use both bytes_out and port in your query.

| kmeans k=40 bytes_out dest_port

But do try out all scenarios.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

MonkeyK · ‎11-25-2017

Since map does not like my queries, here is what I came up with:
https://answers.splunk.com/answers/594332/pattern-loopable-lookup-table-to-bypass-map-subsea.html?mi...

niketn · ‎11-12-2017

Yes this should also work. However, map is a subsearch however, it will have limitation on max values you can pass on from main query to subsearch. It will also have query performance.

Hence I was retaining only the distinct count of all port numbers. to adjust k value for kmeans i.e. if there are 3 unique ports k value should be 30. If your existing query if working then it is fine. However, you would be moving the above code to dashboard you can also try the following approach (following is a run anywhere dashboard which gets date of the month for last 24 hours i.e. it can have maximum 2 values, hence the multiplier can be maximum 20):

<dashboard>
  <label>kmeans</label>
  <search>
    <query>index=_internal sourcetype=splunkd log_level!="INFO"
| stats dc(date_mday) as multiplier
| eval multiplier=10*multiplier
    </query>
      <earliest>-24h@h</earliest>
      <latest>now</latest>    
    <done>
      <set token="tokMultiplier">$result.multiplier$</set>
    </done>
  </search>
  <row depends="$tokMultiplier$">
    <panel>
      <table>
        <search>
          <query>index=_internal sourcetype=splunkd log_level!="INFO"
| kmeans k=$tokMultiplier$ date_hour date_mday</query>
          <earliest>-24h@h</earliest>
          <latest>now</latest>
          <sampleRatio>1</sampleRatio>
        </search>
        <option name="count">20</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">none</option>
        <option name="percentagesRow">false</option>
        <option name="rowNumbers">false</option>
        <option name="totalsRow">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
</dashboard>

PS: Above run anywhere dashboard works on Splunk's _internal index, you can replace with your own.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

MonkeyK · ‎11-13-2017

This dashboard does not work for me. I think that it treats $result.multiplier$
in $result.multiplier$ as a literal

I get
Error in 'kmeans' command. The number of clusters ($result.multiplier$) is invalid

niketn · ‎11-12-2017

@MonkeyK, let me know if it worked for you. I can convert to answer so that you can Accept to mark the question as answered.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

niketn · ‎11-11-2017

@MonkeyK how about

<YourBaseSearch>
| kmeans k=10 bytes_out dest_port

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

MonkeyK · ‎11-11-2017

Niketnilay, I tied that but I only get 10 buckets total. I want to be able to work with 10 buckets per destination port.

So
port 80 (HTTP) would get 10 buckets
port 443 (HTTPS) would get 10 buckets
port 53 (DNS) would get 10 buckets

This is good to do because some apps send less data than others so anomalous for one app may be different than it is for another

I could write a separate query for each app (destination ports), but then I would have a hard time accounting for anomalous apps.

Can I do kmeans by a for a column by a column?

Alerting Best Practices: How to Create Good Detectors

Discover Powerful New Features in Splunk Cloud Platform: Enhanced Analytics, ...

Splunk Classroom Chronicles: Training Tales and Testimonials