Solved: Re: Large JSON event not auto kv all fields

shannan2 · ‎11-03-2020

I have an event ingesting to splunk via HEC which is around 13k characters, and approx. 260 fields within the json of the event. Currently, we do not see all the fields being extracted with auto kv at search time, and I do not want to have these as indexed fields because it would balloon the index size greatly to do so.

In some other non-json events that are rather large we have increased the limits.conf/[kv] maxchars value up to 100000 to allow for key value pairs to be extracted as expected by users in larger events. I figured that this new scenario with JSON was similar and I have so far increased within the same stanza; limit = 0 (unlimited), maxcols = 1024, avg_extractor_time = 3000, max_extractor_time = 6000. So far after these updates I am still not seeing all fields extracted.

I as well tried using spath on the entire _raw which did not work, so I upped the limtits.conf/[spath] stanza to an extraction_cutoff = 100000. Similarly it did not extract when doing the whole raw. I could call a specific field within " spath path=<field_name>" but I do not want to do that for 50+ fields, especially if more are added or removed at a later date.

I have been trying to consider if the issue is occurring on ingestion via HEC, but these are all at search time and is not indexed extractions.

Are there any other configurations anyone knows of around auto kv extraction that we should look into testing with an increased limit? For the best user experience I want this to all continue to happen automatically and not call out many fields explicitly in an spath.

shannan2 · ‎11-03-2020

It seems that simply adding a props.conf on the shclsuter tier in conjunction with my limits.conf changes are allowing all fields to be automatically extracted at search time as I expect. I will need to test removing single limits.conf stanza values to see if any of those had an impact as well.

[<sourcetype>]
KV_MODE = json

But, yes json is very "verbose" logging since it calls out field names and such. This team though is using HEC which in general prefers JSON (if you use the /event endpoint, and we don't want them using /raw and needing extractions there). HEC/JSON has allowed us to give the users some flexibility on choosing how/what they log.

These events are actually pre-processed as well and cut down on event count greatly, it comes from another system which ingests metrics from many, many, many sources and then we are using these datasets for Machine Learning. So we don't have much of an option for bringing down event count or size.

This as well is the only dataset like this we work with, every other is ~20 fields from similar use cases. But more than anything we needed to show that we could technically do this, while not ideal.

View solution in original post

efavreau · ‎11-03-2020

@shannan2
First: Consider if every character in that event is needed? Maybe pre-process to get that size down, log less, etc. Throwing the kitchen sink into your logs isn't good for many reasons.
Second: Json is a verbose logging format. It has its fans, but in large amounts or with complex json nesting, it can be a problem. Consider key/value text pairs to keep it simple.
Last: Elsewhere in the community, there are references to adjusting props.conf. You only mentioned limits.conf, so maybe that will help some.
References:
https://community.splunk.com/t5/Knowledge-Management/What-is-the-maximum-length-of-a-tag-and-an-even...
https://community.splunk.com/t5/Getting-Data-In/Size-limit-for-an-event/m-p/16410

###

If this reply helps you, an upvote would be appreciated.

shannan2 · ‎11-03-2020

It seems that simply adding a props.conf on the shclsuter tier in conjunction with my limits.conf changes are allowing all fields to be automatically extracted at search time as I expect. I will need to test removing single limits.conf stanza values to see if any of those had an impact as well.

[<sourcetype>]
KV_MODE = json

But, yes json is very "verbose" logging since it calls out field names and such. This team though is using HEC which in general prefers JSON (if you use the /event endpoint, and we don't want them using /raw and needing extractions there). HEC/JSON has allowed us to give the users some flexibility on choosing how/what they log.

These events are actually pre-processed as well and cut down on event count greatly, it comes from another system which ingests metrics from many, many, many sources and then we are using these datasets for Machine Learning. So we don't have much of an option for bringing down event count or size.

This as well is the only dataset like this we work with, every other is ~20 fields from similar use cases. But more than anything we needed to show that we could technically do this, while not ideal.

Large JSON event not auto kv all fields

field extraction

fields

other

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes