Splunk Search
Highlighted

What are the possible gains from an index-time extraction of a large JSON log?

Builder

All,

I have a JSON log coming in from Akamai. 99% of searches against this data are using the field cliIP":"1.2.3.4" . Mind you, it's a dump from a cloud service, so there is no host field right now.

Given that it stands to reason that we should give that field some sort of priority in the index. My understanding is that an index-time extraction is a solution for this?
1) thought on that?
2) How would I build an index-time extract against json? Worried there is some special option I'll miss.

0 Karma
Highlighted

Re: What are the possible gains from an index-time extraction of a large JSON log?

SplunkTrust
SplunkTrust

For JSON, I'd recommend enabling INDEXED_EXTRACTIONS=json in props.conf giving you automatic index-time fields.

http://docs.splunk.com/Documentation/Splunk/6.4.0/Data/Extractfieldsfromfileswithstructureddata

0 Karma
Highlighted

Re: What are the possible gains from an index-time extraction of a large JSON log?

Builder

Sorry, I am not following the documentation very well. Does this turn every value into an index time extraction?

0 Karma
Highlighted

Re: What are the possible gains from an index-time extraction of a large JSON log?

SplunkTrust
SplunkTrust

In an automated way, yes.

0 Karma
Highlighted

Re: What are the possible gains from an index-time extraction of a large JSON log?

Builder

Wouldn't making every field an index time extraction be a really big hit in performance?

0 Karma
Highlighted

Re: What are the possible gains from an index-time extraction of a large JSON log?

Builder

I wouldn't suggest turning on indexed extractions in production without testing the effect it has on the index size and real-world performance. Testing with a json audit log that I had available (containing 10 fields), indexed extractions cost double the storage and I imagine it could cost much more. It may not be worth the extra storage because searches on IP addresses should be fairly efficient to begin with.

0 Karma
Highlighted

Re: What are the possible gains from an index-time extraction of a large JSON log?

SplunkTrust
SplunkTrust

The indexing performance hit is not that bad, after all it's only one (complicated) extraction running, not hundreds for every imaginable field in your data.

There will be some space consumed of course, how much depends on your data. Based on my limited use, it's not too bad. Search-time speed certainly makes up for this - you can skip building an accelerated data model for many use cases, for example.

0 Karma
Highlighted

Re: What are the possible gains from an index-time extraction of a large JSON log?

@martin_mueller,

I plan to run a process on a remote computer that:

  1. Parses a proprietary-format binary log file containing thousands of events
  2. Converts those events into JSON
  3. Sends them via TCP to Splunk

I'm already doing this successfully on a small scale (with a few events). Currently, I'm using KV_MODE=json, to perform search-time extraction, but I think that you're recommending specifying INDEXED_EXTRACTIONS=json instead, to perform index-time extraction, right?

I'm very curious about this: this choice has been on my mind, too. I'd be very interested to hear more from anyone in a similar situation. I'm concerned not just about index size, but also about whether, with the extra processing introduced at index time by index-time extraction, I'll need to load balance across more indexers to handle the incoming (TCP) stream of thousands of events.

0 Karma
Highlighted

Re: What are the possible gains from an index-time extraction of a large JSON log?

SplunkTrust
SplunkTrust

Indexers typically are more busy with running searches than with indexing... so a little more indexing load that potentially takes a lot off search load can actually save net indexer capacity.

How this tradeoff turns out depends on your environment, search load, and data. I'd just run it and watch what happens.

0 Karma