Splunk Search

How to configure props and transforms to extract fields from multiple CSV log formats in the same source?

JWBailey
Communicator

Good afternoon,

I have some syslog data coming into splunk. I am trying to write the props and transforms to add the field extractions and want to make sure I am doing it the best way.

Question:
How do I accommodate different log formats in the same input source? Assuming all my events are in CSV format with clear commas for delimiters, just that the headers are different. For example:

Log 1 format:
Source, random_characters, log_TYPE, Login_Name, Last_Name, First_Name, Event_ID, Severity, Status

Log 2 format:
Source, random_characters, log_TYPE, Login_Name, Last_Name, Time, Date, Status, Resolution

Key point to notice, the fields are different after the log_TYPE field.

How should I build the props and transforms to most effectively and accurately extract the fields for each log type?

So far, the resolution I have thought of is to use a complex regex to get into the middle of the log where it can be identified, then with that information I can build out the rest of the REGEX line to extract fields based on the identifying information using more complex regex code.

Example of my idea:
Log example:

2015-04-16 13:27:37,278, some words and random characters, AUDIT_LOG, Interesting filed 1, not needed, not needed, Interesting field 2, not needed, Interesting field 3.

My transforms.conf

[audit_log_field_extractions]
REGEX = [\w\.\s-\:]+,[\w\.\s-\:]+,[\w\.\s-\:]+,\sAUDIT_LOG,\s(\w+),[\w\.\s-\:]+,[\w\.\s-\:]+,\s(\w+),[\w\.\s-\:]+,\s(\w+)
FORMAT = Field1::$1 Field2::$2 Field3::$3

All of this in English now:
[\w\.\s-\:]+,[\w\.\s-\:]+,[\w\.\s-\:]+,\s is built to get all the way up to the point where the LOG TYPE is included in the event,
AUDIT_LOG is the exact text I use to identify this specific event, how I know which fields I consider to be relevant, and what the headers should be,
The rest of the regex is a series of grabbing data into groups, or skipping past it.
Then the FORMAT assigns names to each of the groups that I collected from the REGEX line.
In theory I would have a similar stanza for each log type, to look at the event, identify the specific log type, and then customized REGEX to grab what I want.

I plan on calling these transforms as REPORT stanzas in props.conf, which means this is done at search time. My concern is will this be too resource intensive for search time? Is this a true concern? Is there a better way to do this?

0 Karma
1 Solution

JWBailey
Communicator

So to update this with my solution:

I basically ended up writing a complex regular expression for each format to do the extractions. In each of the expressions I hard coded in the unique identifier for that specific format. So the regular expression would only match the events of its own type, and then uses the correct extraction pattern.

View solution in original post

0 Karma

JWBailey
Communicator

So to update this with my solution:

I basically ended up writing a complex regular expression for each format to do the extractions. In each of the expressions I hard coded in the unique identifier for that specific format. So the regular expression would only match the events of its own type, and then uses the correct extraction pattern.

0 Karma

masonmorales
Influencer

Write two EXTRACT statements directly into props.conf (using different names). For your second log format, use either the date, or the time field with a more specific regex capture pattern (like 04-17-2015 would be captured using \d{2}-\d{2}-\d{4}. If the entire pattern doesn't match the EXTRACT statement, it won't match, so using two statements with specific capture patterns will help your data extract correctly.

0 Karma

JWBailey
Communicator

Any other possible solutions or thoughts on the efficiency of my proposed solution?

0 Karma

JWBailey
Communicator

Thanks for the help in advance.

0 Karma

jdunlea
Contributor

These are of the same source or the same sourcetype? If you can at least have either source or sourcetype different for the two different types of log format, that will solve all of your problems.

You can even force the sourcetyping of one log format based on the number of commas in the event, using some regex trickery.

That is what I would do. Then you will end up with two sourcetypes that you can build your extractions off of easily.

0 Karma

JWBailey
Communicator

The sourcetype and source are both the same right now. I would like to leave these fields the same if possible.

Basically I have an appliance that writes its logs to a single file, and it has 2 or 3 different types of events it writes. System events, Admin events, and Realtime events, each with slightly different fields.

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In November, the Splunk Threat Research Team had one release of new security content via the Enterprise ...

Index This | Divide 100 by half. What do you get?

November 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...

Stay Connected: Your Guide to December Tech Talks, Office Hours, and Webinars!

❄️ Celebrate the season with our December lineup of Community Office Hours, Tech Talks, and Webinars! ...