Getting Data In

Why is my sourcetype configuration for JSON events with INDEXED_EXTRACTIONS making each extracted field multivalue with duplicate values?

Path Finder

I have a Python script configured as a data input that generates one JSON object per line containing events. This is how I configured props.conf for the source type:

[mysourcetype]
INDEXED_EXTRACTIONS = JSON
TIMESTAMP_FIELDS = date
TIME_FORMAT = %Y%m%d
TZ = UTC
detect_trailing_nulls = auto
SHOULD_LINEMERGE = false
description = My source type
pulldown_type = true
disabled = false

However, what is happening is as follows:
- Each event's _raw contains a valid JSON object, as expected.
- Every field of the JSON object was extracted using it own name, as expected.
- The event timestamp is correctly set to the date contained in the date field of the JSON object.
- Unexpectedly, all extracted fields are multi valued, with exactly two copies of the correct value present in the JSON object.

Funnily enough, if I use KV_MODE = JSON instead of using INDEXED_EXTRACTIONS with the same data everything works perfectly.

Any ideas on what might be going on?

1 Solution

Path Finder

Found it. Inspired by the comments and answer provided by @dsdb_splunkadmin and @fdi01 I found the problem was that I was enabling index time extractions (via INDEXED_EXTRACTIONS) but not disabling search time extractions that happen by default (due to KV_MODE and AUTO_KV_JSON options). So both were occurring and generating duplicated extractions. 😞

This is what finally worked:

[mysourcetype]
INDEXED_EXTRACTIONS = JSON
TIMESTAMP_FIELDS = date
TIME_FORMAT = %Y%m%d
TZ = UTC
detect_trailing_nulls = auto
SHOULD_LINEMERGE = false
KV_MODE = none
AUTO_KV_JSON = false

Thanks everyone for their help.

View solution in original post

Path Finder

Found it. Inspired by the comments and answer provided by @dsdb_splunkadmin and @fdi01 I found the problem was that I was enabling index time extractions (via INDEXED_EXTRACTIONS) but not disabling search time extractions that happen by default (due to KV_MODE and AUTO_KV_JSON options). So both were occurring and generating duplicated extractions. 😞

This is what finally worked:

[mysourcetype]
INDEXED_EXTRACTIONS = JSON
TIMESTAMP_FIELDS = date
TIME_FORMAT = %Y%m%d
TZ = UTC
detect_trailing_nulls = auto
SHOULD_LINEMERGE = false
KV_MODE = none
AUTO_KV_JSON = false

Thanks everyone for their help.

View solution in original post

Explorer

For the above Accepted Answer, I would point out:
I put the above configuration in my etc/system/local/props.conf for my Universal Forwarder installation.
I also needed to ensure that on my Splunk Cloud Light instance, for the source type "mysourcetype", the following properties were set (under "Advanced"):

INDEXED_EXTRACTIONS = json

KV_MODE = none

Explorer

In fact, the aforementioned two properties on the Splunk Cloud Light source type definition solved my duplication problem even without the addition of AUTO_KV_JSON on the forwarder side (had KV_MODE = none already in forwarder's config).

0 Karma

I am having similar issue, however i only see duplicates while looking running a raw search and expanding to look at all fields, however, when i print the field using table command, i dont see any duplicate value. Anyone aware of this behaviour and why is it happening?

0 Karma

Motivator

try like this to see:

[monitor://<path to JSON>/*.JSON]
INDEXED_EXTRACTIONS = JSON
TIMESTAMP_FIELDS = date
TIME_FORMAT = %Y%m%d
TZ = UTC
detect_trailing_nulls = auto
SHOULD_LINEMERGE = false
description = JSON
pulldown_type = true
disabled = false
 sourcetype = JSON
KV_MODE = JSON
index = name_your_index
disabled = false
crcSalt = <SOURCE>

if you no ok you can use the dedup command when you run search to elimite the duplicate values.
and use the mvexpand command to transforme the multi-valued fields
ex:

your_base_search_JSON| spath | eval temp=mvzip(college,mvzip(mark,studentname,"#"),"#") | mvexpand temp |......

Path Finder

Thank you for mentioning the dedup, it's a valid workaround. But I'd rather import the data correctly in the first place.

However, if you keep both INDEXED_EXTRACTIONS and KV_MODE set to JSON I would expect to get duplicated values since Splunk would be extracting the fields both at index and at search time.

I too have this problem. Using Splunk Cloud, if I upload a JSON file with the following settings:

INDEXED_EXTRACTIONS = json
KV_MODE = none
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = true
TIMESTAMP_FIELDS = time
category = Structured
description = JavaScript Object Notation
disabled = false
pulldown_type = true

The data is imported correctly, no duplicate values. If I upload a file via a monitor on a Universal Forwarder with the following settings:

INDEXED_EXTRACTIONS = json
KV_MODE = none
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = true
TIMESTAMP_FIELDS = time

The value for each event is duplicated. If I change to have KV_MODE = json and reindex, it makes no difference for me, the values are still duplicated.

Path Finder

Interesting how you set KV_MODE = none, it hadn't occurred to me to do that. Reading http://docs.splunk.com/Documentation/Splunk/6.2.2/admin/Propsconf I noticed that KV_MODE defaults to auto and more importantly that AUTO_KV_JSON defaults to true.

In that case, it would make sense that Splunk would extract the fields both during index time and during search time, thus duplicating the values.

So maybe if I add both KV_MODE = none and AUTO_KV_JSON = false to the original props.conf file things will work as intended. I'll try this later, and if you could try it on you end as well we could confirm if that is the problem.

0 Karma

Unfortunately, having both KV_MODE=none and AUTO_KV_JSON=false together in my props.conf did not fix the issue for me.

I will do some tests to ensure the props.conf on the Universal Forwarder is definitely being applied.

Fixed it.

In my case, I had to make sure that on the Splunk Cloud instance the same sourcetype was defined and also had KV_MODE = none .

I had defined the type on my Universal Forwarder, but had not appreciated that some of the properties, like KV_MODE, are search time properties, and hence they would have to be defined on the search instance (not just the forwarded).

I didn't have to use the AUTO_KV_JSON = false setting in the end.

You put me on the right path though with the index vs search time double indexing - thanks!

Path Finder

Don't mention it. Actually thank you for guiding me to the right path by posting your example with KV_MODE = none in the first place. 🙂

0 Karma

In case it's important, I'm using Splunk Universal Forwarder 6.2.2 (build 255606)

State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!