Solved: How best to test data input to determine source ty...

jwalzerpitt · ‎12-11-2014

I am writing logs to a local file and was wondering what the best way to determine what the proper source type should be? Are there any best practices to follow?

I was planning on creating a test index and then adding a data input and then select the various source types until I find one that parses the logs as best as possible. The questions I have are as follows:

1) Is it best to go with a built in source type that parses the logs as best as possible and then use field extractor for the remaining fields I need? Or, is there an alternative recommended way?

2) If none of the built in source types work, and I need to create a custom source type are there any best practices recommended? In this case, I'm trying to wrap my head around logs that contain say 5 unique types of log formats and how best to create a custom source type that parses all 5 log formats correctly.

Thx,
Jeff

jwalzerpitt · ‎12-12-2014

Martin,

As always, thx for the feedback.

Re: building a new sourcetype I'm trying to figure basically how to attack the building of. For example, when trying to break down a log file that has five unique formats, I'm in the "Set Sourcetype" screen in the Add Data process. To try and break down the file as much as I can, I loaded the file into Excel (it's not .csv or anything - used the 'Text to Columns' to help break it down) and minus the time, there are five standard columns, like below:

hostname sshd[17058]: [ID 800047 auth.notice]
hostname sshd[17058]: [ID 649047 auth.info]
hostname sshd[17058]: [ID 800047 auth.notice]
hostname sshd[17058]: [ID 649047 auth.info]
hostname sshd[17058]: [ID 800047 auth.notice]
hostname sshd[17058]: [ID 649047 auth.info]
hostname sshd[17058]: [ID 800047 auth.notice]
hostname sshd[17058]: [ID 649047 auth.info]
hostname sshd[17058]: [ID 800047 auth.notice]
hostname sshd[17061]: [ID 649047 auth.info]
hostname sshd[17061]: [ID 800047 auth.notice]
hostname sshd[17061]: [ID 649047 auth.info]

To the left of the first column above is the time, and to the right of the last column above is the generic text message of each event.

Could I create a regex for these standard columns, and then extract fields after the fact from the message text (IP, user name, etc.), or is there a better way to parse the file by modifying the props/transforms conf files?

Thx

View solution in original post

jwalzerpitt · ‎12-12-2014

Martin,

As always, thx for the feedback.

Re: building a new sourcetype I'm trying to figure basically how to attack the building of. For example, when trying to break down a log file that has five unique formats, I'm in the "Set Sourcetype" screen in the Add Data process. To try and break down the file as much as I can, I loaded the file into Excel (it's not .csv or anything - used the 'Text to Columns' to help break it down) and minus the time, there are five standard columns, like below:

hostname sshd[17058]: [ID 800047 auth.notice]
hostname sshd[17058]: [ID 649047 auth.info]
hostname sshd[17058]: [ID 800047 auth.notice]
hostname sshd[17058]: [ID 649047 auth.info]
hostname sshd[17058]: [ID 800047 auth.notice]
hostname sshd[17058]: [ID 649047 auth.info]
hostname sshd[17058]: [ID 800047 auth.notice]
hostname sshd[17058]: [ID 649047 auth.info]
hostname sshd[17058]: [ID 800047 auth.notice]
hostname sshd[17061]: [ID 649047 auth.info]
hostname sshd[17061]: [ID 800047 auth.notice]
hostname sshd[17061]: [ID 649047 auth.info]

To the left of the first column above is the time, and to the right of the last column above is the generic text message of each event.

Could I create a regex for these standard columns, and then extract fields after the fact from the message text (IP, user name, etc.), or is there a better way to parse the file by modifying the props/transforms conf files?

Thx

martin_mueller · ‎12-12-2014

That's the gist of schema-on-the-fly or search-time extractions, yes.

There obviously are exceptions, but in general all you need to worry about is when something happened, where in the log the next thing starts, and how to store it (index, sourcetype, host basically).

Extracting your ID for example happens as search time, and that's good in 99.98% of all cases.

martin_mueller · ‎12-12-2014

Those are two entirely different kettle of fish. While setting the sourcetype in the add data process you have to set index-time configuration, mostly timestamping and event breaking.

Field extraction can be configured later and is run at search time. After indexing, you can for example use the Pattern tab in the search page to identify common patterns (6.2 feature).

...and yeah, you can create any number of regular expressions for one sourcetype that match or don't match / extract fields or don't.

jwalzerpitt · ‎12-12-2014

Thx for that explanation as that really helps me to understand what the true function of the add data process is as I was under the assumption it was in addition to setting the time stamp and event breaking functions, was to extract fields as well.

So basically as long as time stamps and event breaking is correct when adding data, I can then work on field extraction during searches, correct?

martin_mueller · ‎12-11-2014

The best start is to upload a sample log file into the data preview, I think you're already doing that. Indexing sample data into a testing index is a good approach as well.

As for your specific questions, I'd recommend using built-in sourcetypes if you really are dealing with that type of source (duh) and the sourcetype is specific enough. Some examples:

If you're dealing with Apache's combined log format, use the pre-built access_combined sourcetype. It's exactly the kind of data, and quite specific.

If you're dealing with CSV files, start with the pre-built csv sourcetype and adapt it to your needs, for example by setting a timestamp format and field, then save it as your own. It's exactly the kind of data, but not specific enough.

If you're dealing with a custom log that kinda looks like a pre-built sourcetype from an entirely different system, don't use the built-in sourcetype. You can of course look at its settings to inspire yourself, but do build your own sourcetype. Else this will end in tears, for example if you later add that system to your Splunk as well, or if it turned out to not be quite precise enough.

How best to test data input to determine source type?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Announcing Modern Navigation: A New Era of Splunk User Experience

Think Like an Architect: Introducing the Splunk Certified Cybersecurity Defense ...

Best Practices: Splunk auto adjust pipeline queue

Join the Conversation