Getting Data In

Ignoring massive amounts of data at index time

msarro
Builder

Hey everyone. We are working on taking in large amounts of CSV data. Each line of the CSV is a single event, and each line is comprised of about 270 fields. Currently only about 40 of those fields are useful. Right now our props.conf and transforms.conf are set to index each field with a specific field name.

How would we best proceed to strip out the fields that we don't need so they don't get indexed? It would be a substantial cost savings for us. I'd prefer not to make the worlds nastiest REGEX but if I have to I will.

Tags (1)
0 Karma

dwaddle
SplunkTrust
SplunkTrust

If you are looking to strip "columns" out of the CSV data at index time, about the only way you'd be able to do it is with a SEDCMD. I gamble this would be a nontrivial regex to write.

Maybe you could use a scripted input to read the CSV file, and feed it through (say) Python's csv module to only emit those fields of interest? I think the biggest issue here would be keeping up with how much of the CSV file you have previously read/transformed/sent to splunk. This could be easy, or could require you to re-implement much of the tailing processor's functionality around file rotations and such.

0 Karma

msarro
Builder

Hm, so it does look like it will be a massive regex. What sort of overhead would a regex of that size incur? Sadly one issue we'd run into with this is the fact that our source moves multiple times.

0 Karma
Get Updates on the Splunk Community!

Observe and Secure All Apps with Splunk

  Join Us for Our Next Tech Talk: Observe and Secure All Apps with SplunkAs organizations continue to innovate ...

Splunk Decoded: Business Transactions vs Business IQ

It’s the morning of Black Friday, and your e-commerce site is handling 10x normal traffic. Orders are flowing, ...

Fastest way to demo Observability

I’ve been having a lot of fun learning about Kubernetes and Observability. I set myself an interesting ...