Solved: Parsing variable fields in a log file

oscargarcia · ‎04-28-2011

Hi,

I have a bunch of files that I need to push into Splunk that I am struggling to parse correctly. The format is the following:

00:00:00,059: htsxml1|133d2e11-cebb-4f3a-9156-a75c51f4a57e|38706|920253|635161|110|P|2011-05-02|4|1:2-0#0:2-0#0:2-0|ESAGP|||160|2875
00:00:00,293: htsxml1|80e4f08e-0795-48fd-9bf0-b7fd2f47ee8c|17116|108051|130889|110|I|RS#TOURICO|OK|0|2|0|0|0|0|0|0|0|0
00:00:00,293: htsxml1|133d2e11-cebb-4f3a-9156-a75c51f4a57e|38706|920253|635161|110|I|RS#SERHS|OK|77|2787|1|0|0|0|0|0|0|0
00:00:00,293: htsxml1|80e4f08e-0795-48fd-9bf0-b7fd2f47ee8c|17116|108051|130889|110|B|default|OK|0|100

The first field is the timestamp, that only has the hour:minute:second:milisecond (no date). Then, separated by the "|" character, the rest of the fields. These fields are different depending on the field that has values P,I,B, the meaning of following fields is one or another.

Let's make it a bit more clear with an example:

00:00:00,293: htsxml1|80e4f08e-0795-48fd-9bf0-b7fd2f47ee8c|17116|108051|130889|110|B|default|OK|0|100

As this is a "B" type line (as can be seen in the 7th field), the 8th field is "Request Type", the 9th is "Result", the 10th is errors, and the 11th is the time taken

00:00:00,059: htsxml1|133d2e11-cebb-4f3a-9156-a75c51f4a57e|38706|920253|635161|110|P|2011-05-02|4|1:2-0#0:2-0#0:2-0|ESAGP|||160|2875

This is a "P" type line. 8th field is "Query Date", 9th is "Days", and so on...

So, my questions:

How can add the first field the date? The name of the file is always the same, and it rotates daily, so the current date can be used, but I don't know how to add it index time
Having "|" as the field separator is not an issue, but, what about the different fields dependent on one field. How can I index this? Is it possible?

Many thanks!

Ledion_Bitincka · ‎04-28-2011

(1) Splunk should be able to timestamp your events correct. If the filename has a date as part of it's name we'll use it, otherwise we'll assume it today's date. Two caviats though (a) since we assign the date based on the filename, events in the file that cross the midnight boundary will end up being assigned the wrong date (b) by default we break the file stream into events whenever we see a date since your file does not have dates we'll not correctly break the stream into events - thus you'll need to simply force Splunk to treat a single line as an event:

props.conf
[your_sourcetype]
SHOULD_LINEMERGE=false

(2) you can have multiple regex KV extraction rules for your different record types. Here's an example for the B record type, the above stanza now becomes:

props.conf
[your_sourcetype]
SHOULD_LINEMERGE=false
EXTRACT-B-Record=(?:[^|]+[|]){6}B[|](?<RequestType>[^|]*)[|](?<Result>[^|]*)[|]

View solution in original post

bojanz · ‎04-28-2011

If I'm not wrong, the date should be added automatically by Splunk - check the documentation at http://www.splunk.com/base/Documentation/4.2/Data/HowSplunkextractstimestamps, specifically the section "Precedence rules for timestamp assignment".
You can see that Splunk will try to deduce the date from various fields (I can't test this now unfortunately so let us know if you do more testing).

Regarding question 2, you should be able to add multiple REPORT extraction fields under a single sourcetype stanza that will handle your cases. The only requirement is that they have different names, like this:

[mylogsourcetype]
REPORT-extractP = fieldP
REPORT-extractB = fieldB

And then in transforms.conf:

[fieldP]
REGEX = (.*?\|){6}P\|(.*?)\|...
FORMAT = myextractedfield::$2

[fieldB]
REGEX = (.*?\|){6}B\|(.*?)\|...
FORMAT = myextractedfield2::$2

Just keep in mind that these will get called for every event so watch for performance hits.

bojanz · ‎04-30-2011

It depends on what you're trying to do here (and other experts with modifying at indexing time can help with this). Keep in mind that REPORT is applied at search time and TRANSFORMS is applied at index time. However, with TRANSFORMS you have to be extra careful as it might seriously impact Splunk's performance - see props.conf for more details.

One thing that can maybe help is to not use REPORT/EXTRACT but instead create searches that use the rex command to extract fields. That way you can create precise searches so regular expressions are applied only on small subsets of results.

oscargarcia · ‎04-28-2011

Hi,

It works if I add to props.conf:

SHOULD_LINEMERGE=false

But as you mention, performance suffers. What about using instead of REPORT, TRANSFORMS. Would it work? would it be more performant?

Thanks!

Ledion_Bitincka · ‎04-28-2011

(1) Splunk should be able to timestamp your events correct. If the filename has a date as part of it's name we'll use it, otherwise we'll assume it today's date. Two caviats though (a) since we assign the date based on the filename, events in the file that cross the midnight boundary will end up being assigned the wrong date (b) by default we break the file stream into events whenever we see a date since your file does not have dates we'll not correctly break the stream into events - thus you'll need to simply force Splunk to treat a single line as an event:

props.conf
[your_sourcetype]
SHOULD_LINEMERGE=false

(2) you can have multiple regex KV extraction rules for your different record types. Here's an example for the B record type, the above stanza now becomes:

props.conf
[your_sourcetype]
SHOULD_LINEMERGE=false
EXTRACT-B-Record=(?:[^|]+[|]){6}B[|](?<RequestType>[^|]*)[|](?<Result>[^|]*)[|]

Ledion_Bitincka · ‎04-29-2011

I've modified the regex to match empty fields. As far as performance goes, I would recommend you try our "Advanced Charting" view in which we aggressively optimize field extraction. Moving extractions to index time would definitely help, however you loose the flexibility. You can look here for more info on how to configure index time extraction http://www.splunk.com/base/Documentation/latest/Data/Configureindex-timefieldextraction

oscargarcia · ‎04-28-2011

Many thanks:

1.- Works great, the date is added correctly
2.- It works great, but two caveats:
2.1.- I am not matching all the events correctly. If the field Request Type is empty, nothing matches. Any ideas on how to solve that? It should be a change to the REGEX I assume
2.2.- Performance is not very good. Would it improve if instead of search time extractions, we move to index time extractions? I tried using the TRANSFORMS thing, but to no avail

Thanks again!

Parsing variable fields in a log file

How to Monitor Google Kubernetes Engine (GKE)

Index This | How can you make 45 using only 4?

Splunk Education Goes to Washington | Splunk GovSummit 2024