Say I have the following log, where I have separate input and output parts, however, they are processed as batch in between:
input id=1 input_id=1 input id=2 input_id=2 input id=3 input_id=3 process id=4 input_ids=1,2,3 output id=5 input_id=1 output id=6 input_id=2 output id=7 input_id=3
I'd like to be able to trace the above in one transaction, such that when I search for input_id=1, I get this:
input id=1 input_id=1 process id=4 input_ids=1,2,3 output id=5 input_id=1
Is that possible (including modifying the log to fit Splunk searches)? I'd like to avoid spreading the logging, i.e. doing something like this:
input id=1 input_id=1 input id=2 input_id=2 input id=3 input_id=3 process id=4 input_id=1 process id=4 input_id=2 process id=4 input_id=3 output id=5 input_id=1 output id=6 input_id=2 output id=7 input_id=3
as there are many lines here that will make the log unreadable for human consumption if needed, outside of Splunk. This also could be useful for non-Splunk scripts that can depend on all to be on one line.
Another thing, batches are not demarcated, they are time-based. Think of the "process" part being something that's executed every 15 seconds or so. In that time frame, the number of input lines can be 1 or 1000. Outputs also, they depend on what process spits out.
Processing is also not ordered. 1000 inputs can arrive, it can pick 1 and 1000 in the next batch, then 2-999 in the following batch, as an example, due to priorities or other specifications by the user who pushed the inputs. The only way to know what was picked up in a batch is to look at input_ids.
I'm fine changing the format of individual lines. If I should go ahead and change:
process id=4 input_ids=1,2,3
process id=4 input_ids=1_2_3
that is doable, keeps the same information.
I worked up an example just using some multivalue stuff and a transaction that makes this much easier, given a particular format of data.
I assume you don't have this data in Splunk yet? It can be made to work either way. I created a test csv input to work with, using sort of something like your data. I am including it here since it's probably not quite like what you have, and this way you can compare and hopefully figure out your own data:
time, process_type, process_number, input_ids "10/5/15 7:09:20.000 PM",input,1,1 "10/5/15 7:09:21.000 PM",input,2,2 "10/5/15 7:09:22.000 PM",input,3,3 "10/5/15 7:09:23.000 PM",process,4,1 3 "10/5/15 7:09:24.000 PM",output,5,1 "10/5/15 7:09:25.000 PM",output,6,2 "10/5/15 7:09:26.000 PM",output,7,3 "10/5/15 7:09:27.000 PM",input,8,8 "10/4/15 7:09:28.000 PM",process,9,2 8 "10/4/15 7:09:29.000 PM",output,10,8
So, processtype ends up being your input, output or process. The number after that is your processnumber and IIRC I think that's nothing - I don't use it at all. It's in the data because I was playing around and it's just sequential. The last is the input_ids - it's a single number when it's a single number (duh), and it's a space separated set of numbers when it's the batch processes. I know I forgot to process input_ids 2 - sorry, it was just a test. 🙂
Now, given that,
... | eval input_ids=split(input_ids, " ") | mvexpand input_ids | transaction input_ids
What did I do?
First, given a search that returns the above 10 rows of data in Splunk, I first split up your inputids into a multivalue field on spaces. It only does this to the ones where it can split it up, so the "2 8" one and so on. It doesn't create new rows - it just takes a row that used to have `inputids=2 8
and makes it two fields likeinputids=2 inputids=8`.
Then I mvexpand it, which takes those rows I just split and creates two rows from it, so
"10/5/15 7:09:23.000 PM",process,4,1 3 becomes to two rows
"10/5/15 7:09:23.000 PM",process,4,1 "10/5/15 7:09:23.000 PM",process,4,3
Then it's a simple transaction away to group them together. One of the rows of output:
"10/4/15 7:09:27.000 PM",input,8,8, "10/4/15 7:09:28.000 PM",process,9,2 8 "10/4/15 7:09:29.000 PM",output,10,8
There's another single group of events for the 2,2,2 set (that matches that same original process event), then one set each for the 1 and 3.
This is an excellent test example rich7177 - thank you for that, I'll try it out! I have one more question, with big files are there any performance concerns with this compared to if I split the "process" lines into separate lines? I am thinking about cases where there are a lot of these, say hundreds of millions per week. It is prob something that depends on the data, any hunches though?
I would expect that the difference between the two techniques would be relatively minor. If you can do one of them well on hardware X, you can probably do the other way at least acceptably well with that same hardware.
To whit, the data ingestion will be nearly the same either way, so no significant difference there. Also, makemv and those sorts of commands aren't particularly loathsome performance-wise, and they'll be only at search time anyway, so they will probably not be the straw that broke the camel's back.
For those data volumes you may want to consider building a summary index. I'm not sure if you'll find great use for it or not, but it seems like you might be able to use the summary for most of your dashboards and searches, and only ever have to see the "raw" data when you need to see the raw data. Here's some info on using summaries. There are also blogs on this and a great .conf2013 session that you can download and watch from here, look for "Automating Operational Intelligence: Stats and Summary Indexes" by Jesse Trucks.
Unless you say not to, f8899, I'm going to "rearrange" this answer a bit later to put the comment as the top level answer to make it more clear to people who stumble across this answer later which/what the answer turned out to be.
Great, much appreciated rich7177! I am totally fine with rearranging, it is a sensible thing to me to have it as a top level answer, absolutely go ahead. All the information you presented I think is very useful, worth keeping it and exposing at the top level.