Solved: How to search for fields that cross correlate with...

zeophlite · ‎03-03-2016

From my data below, I can see peaks in the CPU usage of a machine. I can add other fields to the graph, and visually compare the shapes to see when the two fields cross correlate, but how can I ask Splunk to look at the fields and tell me automatically what other fields cross correlate with my first field?

jeffland · ‎03-08-2016

The splunk command correlate is for analyzing co-occurence of fields, which is not what you are after I think. You want to see which fields have values that correlate with a given field value (you CPUusage). AFAIK, there is no built-in command in splunk which does this for you "automagically", because associate does something else as well. Apart from what you mentioned with visually comparing possible factors, I'd say you have a couple of options:

a) you could try the kmeans command. Without going into details (read up on the algorithm on Wikipedia if you like - it's really simple, yet powerful), this algorithm will try to cluster your data, and the resulting graph may be easier to interpret for a human than the manual timechart method.

See this run-anywhere example:

index=_internal group!=queue group!=pipeline | timechart usenull=f useother=f count by group | fields - _time | kmeans k=4 | fields - centroid_*

On my test environment (no real load), this produces the following chart of messages from all groups (except for queue and pipeline for demonstration purposes):

What I can immediately see from that is how per_source_thruput and per_sourcetype_thruput correlate, but others not so much. If I already know I'm interested in per_sourcetype_thruput, I can stick a sort in there as well like so:

index=_internal group!=queue group!=pipeline | timechart usenull=f useother=f count by group | fields - _time | kmeans k=4 | fields - centroid_* | sort CLUSTERNUM per_sourcetype_thruput

and it may appear even more visually apparent. I don't really know how this method scales to your thousand fields, but it might help you figure out where to look and where not to look (you might need to adjust k=4 to higher values).
If you're willing to leave splunk out-of-the-box-functionality, you could try the other clustering algorithms that come with the (already mentioned) Splunk Machine Learning app. The procedure would still involve a human looking at graphs however. I'm still listing the app here because I like it and I hope that when more people use it, it will become even better 🙂

b) Speaking of leaving splunk out-of-the-box-functionality, there is definitely something to find in the app R Project, which has already been mentioned as well. With the following query, I was able to conclude the correlation above pretty easily (of course this is where knowledge of R and/or statistics comes in handy - this query is very basic):

index=_internal group!=queue group!=pipeline | timechart usenull=f useother=f count by group | fields per* | r "
library(reshape2)
output = melt(cor(input))
" | where Var1!=Var2 AND value >= 0 | sort - value

which produces the following nice table:

This should be what you are looking for. I'm pretty sure there is an option c), to compute the correlation by hand, which I think is non-trivial if you want to do it right. I have tried myself at a simple construct below, which basically does three things: it first normalizes the values of all fields so that their values are between 0 and 1, and then subtracts each row from the one I know I am interested in (which is per_sourcetype_thruput in my case and could be CPU in yours). It then lists the absolute value of that as the difference of these two. The sum of these differences is how much the two fields correlate: the less difference between the two values, the more they correlate. See for yourself:

index=_internal | timechart usenull=f useother=f count by group | eventstats max(*) as max_* min(*) as min_* | foreach * [eval normalized_<<MATCHSTR>>=(<<MATCHSTR>>-min_<<MATCHSTR>>)/max_<<MATCHSTR>>] | fields - min_* max_* | fields _time norm* | foreach normalized_* [eval diff_<<MATCHSTR>>_with_sourcetype_thruput=abs(normalized_per_sourcetype_thruput-normalized_<<MATCHSTR>>)] | stats sum(diff_*) as sum_diff_* | transpose

It's not exactly a pretty search, but it produces a nice result table:

I hope this gives you some ideas to work with. You can certainly take approach c) and develop something on your own - depending on how far you take it, you might reinvent the wheel though, which is why I would recommend the R approach. It offers many tools specifically suited for your purposes, but you'll have to invest some time into learning the syntax and how to use it with splunk.
Feel free to come back with any questions!

View solution in original post

andreasz · ‎06-10-2017

All you need is the Machine Learning Toolkit and to use the Panda Correlation Matrix Algorithm:
http://docs.splunk.com/Documentation/MLApp/2.2.0/API/CorrelationMatrix

Simply copy&paste the code and than:

index=_internal sourcetype=splunkd group=* | timechart usenull=f useother=f count by group | fields -
_time | fit CorrelationMatrix * | table index,*

Here is the description of the available methods:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html

jeffland · ‎03-08-2016

The splunk command correlate is for analyzing co-occurence of fields, which is not what you are after I think. You want to see which fields have values that correlate with a given field value (you CPUusage). AFAIK, there is no built-in command in splunk which does this for you "automagically", because associate does something else as well. Apart from what you mentioned with visually comparing possible factors, I'd say you have a couple of options:

a) you could try the kmeans command. Without going into details (read up on the algorithm on Wikipedia if you like - it's really simple, yet powerful), this algorithm will try to cluster your data, and the resulting graph may be easier to interpret for a human than the manual timechart method.

See this run-anywhere example:

index=_internal group!=queue group!=pipeline | timechart usenull=f useother=f count by group | fields - _time | kmeans k=4 | fields - centroid_*

On my test environment (no real load), this produces the following chart of messages from all groups (except for queue and pipeline for demonstration purposes):

What I can immediately see from that is how per_source_thruput and per_sourcetype_thruput correlate, but others not so much. If I already know I'm interested in per_sourcetype_thruput, I can stick a sort in there as well like so:

index=_internal group!=queue group!=pipeline | timechart usenull=f useother=f count by group | fields - _time | kmeans k=4 | fields - centroid_* | sort CLUSTERNUM per_sourcetype_thruput

and it may appear even more visually apparent. I don't really know how this method scales to your thousand fields, but it might help you figure out where to look and where not to look (you might need to adjust k=4 to higher values).
If you're willing to leave splunk out-of-the-box-functionality, you could try the other clustering algorithms that come with the (already mentioned) Splunk Machine Learning app. The procedure would still involve a human looking at graphs however. I'm still listing the app here because I like it and I hope that when more people use it, it will become even better 🙂

b) Speaking of leaving splunk out-of-the-box-functionality, there is definitely something to find in the app R Project, which has already been mentioned as well. With the following query, I was able to conclude the correlation above pretty easily (of course this is where knowledge of R and/or statistics comes in handy - this query is very basic):

index=_internal group!=queue group!=pipeline | timechart usenull=f useother=f count by group | fields per* | r "
library(reshape2)
output = melt(cor(input))
" | where Var1!=Var2 AND value >= 0 | sort - value

which produces the following nice table:

This should be what you are looking for. I'm pretty sure there is an option c), to compute the correlation by hand, which I think is non-trivial if you want to do it right. I have tried myself at a simple construct below, which basically does three things: it first normalizes the values of all fields so that their values are between 0 and 1, and then subtracts each row from the one I know I am interested in (which is per_sourcetype_thruput in my case and could be CPU in yours). It then lists the absolute value of that as the difference of these two. The sum of these differences is how much the two fields correlate: the less difference between the two values, the more they correlate. See for yourself:

index=_internal | timechart usenull=f useother=f count by group | eventstats max(*) as max_* min(*) as min_* | foreach * [eval normalized_<<MATCHSTR>>=(<<MATCHSTR>>-min_<<MATCHSTR>>)/max_<<MATCHSTR>>] | fields - min_* max_* | fields _time norm* | foreach normalized_* [eval diff_<<MATCHSTR>>_with_sourcetype_thruput=abs(normalized_per_sourcetype_thruput-normalized_<<MATCHSTR>>)] | stats sum(diff_*) as sum_diff_* | transpose

It's not exactly a pretty search, but it produces a nice result table:

I hope this gives you some ideas to work with. You can certainly take approach c) and develop something on your own - depending on how far you take it, you might reinvent the wheel though, which is why I would recommend the R approach. It offers many tools specifically suited for your purposes, but you'll have to invest some time into learning the syntax and how to use it with splunk.
Feel free to come back with any questions!

javiergn · ‎03-03-2016

Hi, I'm not 100% what you are asking for but if you want to correlate the occurrence of two fields you can use the correlate command:

http://docs.splunk.com/Documentation/Splunk/6.3.3/SearchReference/Correlate

I'm sure you can do something very similar with stats but without knowing a bit more about what you are trying to achieve it's hard for me to make any other suggestion. If correlate doesn't work for you please provide a tabular example about what you are trying to achieve, for example:

Input
-------
FieldA, FieldB
Foo1, Bar1
Foo2, Bar2

Output
-----------
FieldC
FooBar

zeophlite · ‎03-03-2016

Suppose the graph above is field001, and I want to see which field out of field002 to field999 follow the same shape, if not values

javiergn · ‎03-04-2016

I see, in that case maybe the associate command can help:

http://docs.splunk.com/Documentation/Splunk/6.3.3/SearchReference/Associate

But ideally pre filter your fields before running associate if possible as it is not a very fast commands. You can try discretising the values of your signal to speed things up (see the bin the command).

There are much more advanced alternatives using stats but it could require time to develop that as far as I know.

You could also try to use the Splunk App for R: https://github.com/rfsp/r

Hope that helps.

How to search for fields that cross correlate with a specified field?

Shape the Future of Splunk: Join the Product Research Lab!

Auto-Injector for Everything Else: Making OpenTelemetry Truly Universal

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

Are you a member of the Splunk Community?

How to search for fields that cross correlate with a specified field?

Shape the Future of Splunk: Join the Product Research Lab!

Auto-Injector for Everything Else: Making OpenTelemetry Truly Universal

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions