We've been acquiring data for some time now via manual imports with CSV files. We're finishing up the process of automating that by importing JSON on a cron schedule. So far, it's been going simply great. Today, we hit a snag.
We have a source that has multiple date or date-time fields in it, so in order to ensure we get the right field to be used as the timestamp, we created a new sourcetype called dateTimeJSON that specifies the TimeStamp field as "DateDeleted", the field we're looking for.
If we search the index, we see the data and the correct number of event counts (13 in this test case). However, when we look at the data in a table, each field has two values in it -- this results in double the results in all of our searches and dashboards. Here's what we see from a search as simple as "index=koha_dcards":
What the heck can even cause something like this? How do we rectify it? All of our other indices and data inputs have never had a problem like this, and we've had to spec fields in the sourcetype before without issue.
You can certainly try to set indexed extractions to none. That may affect other things you are configuring though, like the Timestamp fields. Because if you don't have those fields when the data is ingested into Splunk, you probably can't reference it like you're doing now as the timestamp field. Indexed Extractions is sort of an easy button for parsing data (getting data into Splunk)
The other option would be to set the auto kv json setting to false for this sourcetype in props.conf. This would just turn off the extractions that try to run at search time.
[dateTimeJSON] AUTO_KV_JSON = false
If you understand parse time and search time yet, i'd just suggest reading this wiki article
not sure if something is different with 8.x, but this typically means that KV_MODE=json is set for this sourcetype somehow. So you the fields get indexed with indexed_extractions setting but then also extracted at search time with the kv_mode setting.
maybe at least rule it out with btool on your search head??
splunk btool props list dateTimeJSON --debug
And if not set on the sourcetype, maybe make sure it's not set in props for the source/host either?
Here's the results of the btool command; I don't know how to parse this, hope you can let me know if there's anything pertinent:
/Applications/Splunk/etc/apps/search/local/props.conf [dateTimeJSON] /Applications/Splunk/etc/system/default/props.conf ADD_EXTRA_TIME_FIELDS = True /Applications/Splunk/etc/system/default/props.conf ANNOTATE_PUNCT = True /Applications/Splunk/etc/system/default/props.conf AUTO_KV_JSON = true /Applications/Splunk/etc/system/default/props.conf BREAK_ONLY_BEFORE = /Applications/Splunk/etc/system/default/props.conf BREAK_ONLY_BEFORE_DATE = True /Applications/Splunk/etc/system/default/props.conf CHARSET = UTF-8 /Applications/Splunk/etc/apps/search/local/props.conf DATETIME_CONFIG = /Applications/Splunk/etc/system/default/props.conf DEPTH_LIMIT = 1000 /Applications/Splunk/etc/system/default/props.conf HEADER_MODE = /Applications/Splunk/etc/apps/search/local/props.conf INDEXED_EXTRACTIONS = json /Applications/Splunk/etc/system/default/props.conf LEARN_MODEL = true /Applications/Splunk/etc/system/default/props.conf LEARN_SOURCETYPE = true /Applications/Splunk/etc/apps/search/local/props.conf LINE_BREAKER = ([\r\n]+) /Applications/Splunk/etc/system/default/props.conf LINE_BREAKER_LOOKBEHIND = 100 /Applications/Splunk/etc/system/default/props.conf MATCH_LIMIT = 100000 /Applications/Splunk/etc/system/local/props.conf MAX_DAYS_AGO = 5000 /Applications/Splunk/etc/system/default/props.conf MAX_DAYS_HENCE = 2 /Applications/Splunk/etc/system/default/props.conf MAX_DIFF_SECS_AGO = 3600 /Applications/Splunk/etc/system/default/props.conf MAX_DIFF_SECS_HENCE = 604800 /Applications/Splunk/etc/system/default/props.conf MAX_EVENTS = 256 /Applications/Splunk/etc/system/default/props.conf MAX_TIMESTAMP_LOOKAHEAD = 128 /Applications/Splunk/etc/system/default/props.conf MUST_BREAK_AFTER = /Applications/Splunk/etc/system/default/props.conf MUST_NOT_BREAK_AFTER = /Applications/Splunk/etc/system/default/props.conf MUST_NOT_BREAK_BEFORE = /Applications/Splunk/etc/apps/search/local/props.conf NO_BINARY_CHECK = true /Applications/Splunk/etc/system/default/props.conf SEGMENTATION = indexing /Applications/Splunk/etc/system/default/props.conf SEGMENTATION-all = full /Applications/Splunk/etc/system/default/props.conf SEGMENTATION-inner = inner /Applications/Splunk/etc/system/default/props.conf SEGMENTATION-outer = outer /Applications/Splunk/etc/system/default/props.conf SEGMENTATION-raw = none /Applications/Splunk/etc/system/default/props.conf SEGMENTATION-standard = standard /Applications/Splunk/etc/system/default/props.conf SHOULD_LINEMERGE = True /Applications/Splunk/etc/apps/search/local/props.conf TIMESTAMP_FIELDS = DateDeleted /Applications/Splunk/etc/system/default/props.conf TRANSFORMS = /Applications/Splunk/etc/system/default/props.conf TRUNCATE = 10000 /Applications/Splunk/etc/apps/search/local/props.conf category = Structured /Applications/Splunk/etc/apps/search/local/props.conf description = Get the right date from record with multiple dates included. /Applications/Splunk/etc/system/default/props.conf detect_trailing_nulls = false /Applications/Splunk/etc/apps/search/local/props.conf disabled = false /Applications/Splunk/etc/system/default/props.conf maxDist = 100 /Applications/Splunk/etc/system/default/props.conf priority = /Applications/Splunk/etc/apps/search/local/props.conf pulldown_type = 1 /Applications/Splunk/etc/system/default/props.conf sourcetype =
Yeah, i would suggest setting AUTO_KV_JSON=false for your sourcetype as well.
From the docs....
AUTO_KV_JSON = <boolean> * Used for search-time field extractions only. * Specifies whether to try json extraction automatically. * Default: true
Does the raw data in the source CSV file (assuming you have access to it) has single values in them? If yes, then it looks like the field extraction is done twice for that sourcetype. As @richgalloway suggested, find all props.conf stanza setup for that sourcetype, so we can see if there are any duplicate configurations for field extractions (could be the case where both indextime and search time field extraction is setup).
Yes, all the CSV data which goes back 1 year have single entries. Only the JSON we input today is duplicated. I was only able to find the one props.conf with [dateTimeJSON] in it, in etc/apps/search/local/.
Should also mention, 8.0.1.
I have the same issue when it comes to JSON from our Azure blob.
_raw will only indicate one field for one value but the extracted fields has double values for the single event. Only when I eval the field to rename does my reports remove the duplicate values. I only have one stanza per `sourcetype' as well.
[dateTimeJSON] DATETIME_CONFIG = INDEXED_EXTRACTIONS = json LINE_BREAKER = ([\r\n]+) NO_BINARY_CHECK = true TIMESTAMP_FIELDS = DateDeleted category = Structured description = Get the right date from record with multiple dates included. disabled = false pulldown_type = 1