Should I "normalize" data prior to indexing?

msutfin1 · ‎08-29-2017

I have the opportunity to pull in some ticket system data and create some statistics / visualizations. The data consists of many “categories”. However, there are some details in the SUMMARY field that keep me from grouping/counting etc by SUMMARY as the SUMMARY value is unique in the last couple of characters. Here’s a sample of the SUMMARY field data

Pastebin extraction fn:23l4dixr
Pastebin extraction fn:xx3l9dib
Pastebin extraction fn:dk244diL

I would like to group/count by "Pastebin extraction". First attempt (successful) was to built regexes that I applied to the file BEFORE pulling into splunk that removes the unique fn:xxxxxxxx at the end of the SUMMARY field. I then created a separate index and pulled the data in using the CSV sourcetype. Due to the column headers, it appears splunk had no issues parsing the field data. This allowed me to group/count which was a good learning experience in and of itself. But now, I have no details if I need them.

It seems that most folks likely don’t massage data prior to a forwarder picking up the data. Perhaps then, the normalization, if you will, occurs just prior to indexing? Or perhaps during query? Maybe it’s possible either way?

At any rate, I’d appreciate a breadcrumb / link to some reading on how to remove the step of pre-processing of the data and to perform this a bit further down the line.

Is learning to properly use props.conf and transforms.conf my only (or best) approach?

What if I want to retain the unique details “just-in-case” and don’t want it removed prior to indexing?

Apologies if my terminology is not up to snuff.. just getting started with Splunk.

Thanks,
Sudsy

DalJeanis · ‎09-10-2017

Without regard to your question, or the discussion about indexing volumes, you can use a regex to extract the fn:23l4dixr portion to a different field at index or search time, your choice. Why wouldn't you?

msutfin1 · ‎09-10-2017

That would certainly allow running stats on the former portion of the field. I will begin looking for documentation or perhaps a tutorial on how one does that.

These are summary fields from a ticketing system. So each "type" of ticket (the above being 1 example) has something that makes it unique (incident number, vuln assessment tag, internal team that initiated it)

Fortunately, these unique portions of each summary appear in the same position and have a distinct format, so writing regex for each type of field should be straight forward once I learn how to perform that "separation" at index time.

It will be a bit time consuming, as there are 3-4 hundred summary "templates". Once done tho, the obligatory maintenance as new types of tickets are added and old types are deprecated shouldn't be unbearable.

Thanks much..
Mark

woodcock · ‎09-10-2017

The Splunk philosophy is send data in exactly the way it is. When you need to schematize, do it at search. If that is too slow, then normalize everything at search, pull it into a datamodel using eventtypes and tags and accelerate that and use tstats. That is pretty much what the CIM is:

Read “Use the CIM to normalize data at search time” documentation page:
http://docs.splunk.com/Documentation/CIM/latest/User/UsetheCIMtonormalizedataatsearchtime

Read the “Use the CIM to normalize OSSEC data” documentation page.
Most of the time, maybe always, there will be an app to help you assign a sourcetype into a datamodel, but sometimes we may have to do this ourselves. Even if you don’t, this page is both very short and highly educational so it is well worth the time. This shows us a minimal configuration that allows you to use sourcetypes with the CIM datamodels:
http://docs.splunk.com/Documentation/CIM/4.8.0/User/UsetheCIMtonormalizeOSSECdata

ddrillic · ‎08-30-2017

-- It seems that most folks likely don’t massage data prior to a forwarder picking up the data. Perhaps then, the normalization, if you will, occurs just prior to indexing? Or perhaps during query? Maybe it’s possible either way?

You are absolutely right - most folks likely don’t massage data prior to a forwarder picking up the data.
Then, when they hit the license limit at 100TB +, they wonder what went wrong. Splunk as a company, chose to encourage us to stream data as is, and wonder about normalizations, validations, schema association a bit later. I'm not clear why...

Yesterday, I attended a demo of the open source competitor, Graylog, which encourages to do the exact opposite. So, I guess the right answer might be somewhere in between. Maybe, we should stream data as is into dev, understand it, handle it and when all is normalized, validated, etc. we can stream it to production...

niketn · ‎08-30-2017

@msutfin1, Splunk reads time-series data. So most important thing for you to dictate Splunk about your data is to tell it (1) how to identify time and (2) how to break events

You do this through either built in sourcetype (for industry standard logs already defined in Splunk) or custom sourcetype (for your custom logs or use case). Sourcetype provides Splunk with "schema on the fly". For example fields are extracted, transformed, aliased and calculated based on which sourcetype they belong to. So in other words you should be definitely well versed with props.conf and transforms.conf.

Even if you do not define timestamp and event breaks correctly, most of the cases Splunk's automatic/default logic does the job for you. But if it fails data might not be indexed as you expect it to. So it is always best to take some sample logs in a file and upload the data on test/POC Splunk machine and ensure in the data preview mode that the data is getting indexed the way you expect.
Refer to some of the documentations:
Configure event line breaking: https://docs.splunk.com/Documentation/Splunk/latest/Data/Configureeventlinebreaking
Gettig Data In Tutorial: https://docs.splunk.com/Documentation/Splunk/latest/SearchTutorial/GetthetutorialdataintoSplunk

Having said these, ideally splunk should not drop final piece of your event by default unless you have newline character before that (Splunk usese newline character \n\r as default LINE_BREAKER as you might have seen in props.conf). So you might need to define event line properly. For us to assist you you might have to add complete sample events (mock/anonymize sensitive information in your data before posting).

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

gcusello · ‎08-30-2017

Hi msutfin1,
usually you don't need to normalize your logs before indexing.
It could be useful to transform some logs (if your specifications permit to modify logs) if you have some specific needs (e.g. to mask some logs like passwords or Credit Card numbers, ...) or if the format of your logs is variable and sometimes wrong (e.g. I receive logs from multiple sources and sometimes someone of them have a wrong date format).
Usually you can extract your fields using regexes.
I hope to be useful for you.
Bye.
Giuseppe

Should I "normalize" data prior to indexing?

Introducing the Splunk Community Dashboard Challenge!

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer Certification at ...