Hi,
I have a bunch of files that I need to push into Splunk that I am struggling to parse correctly. The format is the following:
00:00:00,059: htsxml1|133d2e11-cebb-4f3a-9156-a75c51f4a57e|38706|920253|635161|110|P|2011-05-02|4|1:2-0#0:2-0#0:2-0|ESAGP|||160|2875
00:00:00,293: htsxml1|80e4f08e-0795-48fd-9bf0-b7fd2f47ee8c|17116|108051|130889|110|I|RS#TOURICO|OK|0|2|0|0|0|0|0|0|0|0
00:00:00,293: htsxml1|133d2e11-cebb-4f3a-9156-a75c51f4a57e|38706|920253|635161|110|I|RS#SERHS|OK|77|2787|1|0|0|0|0|0|0|0
00:00:00,293: htsxml1|80e4f08e-0795-48fd-9bf0-b7fd2f47ee8c|17116|108051|130889|110|B|default|OK|0|100
The first field is the timestamp, that only has the hour:minute:second:milisecond (no date). Then, separated by the "|" character, the rest of the fields. These fields are different depending on the field that has values P,I,B, the meaning of following fields is one or another.
Let's make it a bit more clear with an example:
00:00:00,293: htsxml1|80e4f08e-0795-48fd-9bf0-b7fd2f47ee8c|17116|108051|130889|110|B|default|OK|0|100
As this is a "B" type line (as can be seen in the 7th field), the 8th field is "Request Type", the 9th is "Result", the 10th is errors, and the 11th is the time taken
00:00:00,059: htsxml1|133d2e11-cebb-4f3a-9156-a75c51f4a57e|38706|920253|635161|110|P|2011-05-02|4|1:2-0#0:2-0#0:2-0|ESAGP|||160|2875
This is a "P" type line. 8th field is "Query Date", 9th is "Days", and so on...
So, my questions:
Many thanks!
(1) Splunk should be able to timestamp your events correct. If the filename has a date as part of it's name we'll use it, otherwise we'll assume it today's date. Two caviats though (a) since we assign the date based on the filename, events in the file that cross the midnight boundary will end up being assigned the wrong date (b) by default we break the file stream into events whenever we see a date since your file does not have dates we'll not correctly break the stream into events - thus you'll need to simply force Splunk to treat a single line as an event:
props.conf
[your_sourcetype]
SHOULD_LINEMERGE=false
(2) you can have multiple regex KV extraction rules for your different record types. Here's an example for the B record type, the above stanza now becomes:
props.conf
[your_sourcetype]
SHOULD_LINEMERGE=false
EXTRACT-B-Record=(?:[^|]+[|]){6}B[|](?<RequestType>[^|]*)[|](?<Result>[^|]*)[|]
If I'm not wrong, the date should be added automatically by Splunk - check the documentation at http://www.splunk.com/base/Documentation/4.2/Data/HowSplunkextractstimestamps, specifically the section "Precedence rules for timestamp assignment".
You can see that Splunk will try to deduce the date from various fields (I can't test this now unfortunately so let us know if you do more testing).
Regarding question 2, you should be able to add multiple REPORT extraction fields under a single sourcetype stanza that will handle your cases. The only requirement is that they have different names, like this:
[mylogsourcetype]
REPORT-extractP = fieldP
REPORT-extractB = fieldB
And then in transforms.conf:
[fieldP]
REGEX = (.*?\|){6}P\|(.*?)\|...
FORMAT = myextractedfield::$2
[fieldB]
REGEX = (.*?\|){6}B\|(.*?)\|...
FORMAT = myextractedfield2::$2
Just keep in mind that these will get called for every event so watch for performance hits.
It depends on what you're trying to do here (and other experts with modifying at indexing time can help with this). Keep in mind that REPORT is applied at search time and TRANSFORMS is applied at index time. However, with TRANSFORMS you have to be extra careful as it might seriously impact Splunk's performance - see props.conf for more details.
One thing that can maybe help is to not use REPORT/EXTRACT but instead create searches that use the rex command to extract fields. That way you can create precise searches so regular expressions are applied only on small subsets of results.
Hi,
It works if I add to props.conf:
SHOULD_LINEMERGE=false
But as you mention, performance suffers. What about using instead of REPORT, TRANSFORMS. Would it work? would it be more performant?
Thanks!
(1) Splunk should be able to timestamp your events correct. If the filename has a date as part of it's name we'll use it, otherwise we'll assume it today's date. Two caviats though (a) since we assign the date based on the filename, events in the file that cross the midnight boundary will end up being assigned the wrong date (b) by default we break the file stream into events whenever we see a date since your file does not have dates we'll not correctly break the stream into events - thus you'll need to simply force Splunk to treat a single line as an event:
props.conf
[your_sourcetype]
SHOULD_LINEMERGE=false
(2) you can have multiple regex KV extraction rules for your different record types. Here's an example for the B record type, the above stanza now becomes:
props.conf
[your_sourcetype]
SHOULD_LINEMERGE=false
EXTRACT-B-Record=(?:[^|]+[|]){6}B[|](?<RequestType>[^|]*)[|](?<Result>[^|]*)[|]
I've modified the regex to match empty fields. As far as performance goes, I would recommend you try our "Advanced Charting" view in which we aggressively optimize field extraction. Moving extractions to index time would definitely help, however you loose the flexibility. You can look here for more info on how to configure index time extraction http://www.splunk.com/base/Documentation/latest/Data/Configureindex-timefieldextraction
Many thanks:
1.- Works great, the date is added correctly
2.- It works great, but two caveats:
2.1.- I am not matching all the events correctly. If the field Request Type is empty, nothing matches. Any ideas on how to solve that? It should be a change to the REGEX I assume
2.2.- Performance is not very good. Would it improve if instead of search time extractions, we move to index time extractions? I tried using the TRANSFORMS thing, but to no avail
Thanks again!