I'm indexing log events en mass... and I know that I have events that always occur together and within the same time window BUT I don't know what the common field / value is to use the
transaction cmd etc. Any suggestions on how I could find (group) these "unknowingly related" events together?
Yeah I kinda feel I do know my data - it's a common JSON schema and over the past few months we've gleaned tons of great valuable stuff by putting in into Splunk. This latest thing - find groups of events that always co-occur without knowing what to group on is tricky (250k events a day) . It's almost like I want to group events based on timespans (like 15mins) and then look for repeats of those groups over time.
OR maybe it's a ML Toolkit job. Hmm
I'll poke around with your suggestions 🙂
I think hunters' answer is more likely what you need, but just in case a simple answer is needed :
If your problem is really just about finding fields in common, one thing I found useful is to craft your base search so that it returns the two types of data. Like
index=X OR index=B or
sourcetype=A OR (index=Y AND sourcetype=Z)
Once you have that, at the bottom of "Interesting Fields" on the left click the "All fields" button. This will show you a list of all fields that (by default - there's a selector at the top to change it) are in like 1% or 5% or something of all your data. From that list, check the column labeled "Event Coverage". You can click that column header to sort by that to make it easier.
If in your original data the split between the two types of events are about even (50% are one and 50% are the other), look for fields that have perhaps 75% or greater "Event Coverage". These are the fields that both data sets have in common.
If it's skewed one way or the other (like, 90% are type A, 10% type B) then this technique will only work if you check for coverage greater than about 95% or so.
Neither of those mean they're actually related, but it's a start of things to look into more specifically (think of it as converting your list of 100 possible fields into a short list of 5 to check into manually).
If you don't have any fields in common, all is still not lost. Sometimes just taking a few events from each and eyeballing them may tell you important linkages. Look for similar, repeated data structures in each data type, see if there isn't a couple of those that appear to connect the two events together.
If you find two but they're differently named, you can do something like the below to make them the same name for testing, then try a transaction on them to see*. So if one event type has a field "sysID" and another has a field "id" and they look like they may fit, you could try
index=X OR index=B | eval id=sysID | transaction maxspan=5m id
Which essentially just creates a copy of the sysID fields into a variable named id, then we transaction that to see if they actually make sense.
So, while not perfect, these are some of the techniques I use. The real answer to this question, and the tips we've put down here all lead to this, is to "learn your data." And it's not always easy, but some day perhaps you too will find this a very enjoyable pasttime!
*Transaction isn't usually the best way to have a solution that will scale well, usually use stats instead, but a) when you need transaction it may be the best answer and b) in THIS case you are just testing anyway!
Another thought - I found a previous article on how to search for something and then get surrounding events at the same time based on _time (_time-150 and _time+150)
Could I take each of these results and then pipe into transaction to find re occurrences of events occurring together ?
Re using cluster - yeah great cmd but in this context the events don't have any textual similarity to group them by. I.e
There might be a switch degregation event that causes host X to log an OS msg and fir host Y to log an application error. I want to group events based on their reoccurrence of happening together.
Back thinking (not tinkering unfortunately) about this....if I could use transaction* and maxspan=5m then I'd get "groups" with an eventcounts BUT how could I then work out the times those groups have reoccured together?
*once I get it working, I'll move to stats and eventstats for efficiency !
You can do all sorts of things. Your idea could work, though I think you'll still need some way to connect them. The idea is intriguing now that I'm thinking on it more.
Report back what you find out in your forays, I think these different ideas and techniques might make a good Splunk Blog topic.
Will do - I'll have a play around. Blog idea - yes, very much so. I can supply more context.. I've spoken to my presales guys about this previously actually and one of the Instructors on the Data Science course too 🙂
Another thought on a command for helping to find groups of similar data (that you'll then have to confirm and find out how to actually hook up together) would be the cluster command and similar. With it and the others you may be able to help the process of deciding if two particular fields really are very similar or not.
I think you use the correlate and contingency commands to help you identify relationship between fields and field values first: Then, you can use the transaction command on strongly-correlated fields to group events.
The correlate command calculates the correlation between different fields. Use this command to see an overview of the co-occurrence between fields in your data. The results are presented in a matrix format, where the cross tabulation of two fields is a cell value. The cell value represents the percentage of times that the two fields exist in the same events.
For details, see: http://docs.splunk.com/Documentation/Splunk/6.5.1/SearchReference/Correlate
The contingency command builds contingency tables are used to record and analyze the relationship between two or more (usually categorical) variables.
For details, see: http://docs.splunk.com/Documentation/Splunk/6.5.1/SearchReference/contingency
Hope this helps. Thanks!