I'm looking for a way to split a JSON array into multiple events, but it keeps getting indexed as a single event.
I've tried using various parameters in props.conf, but none of them seem to work.
Does anyone know how to split the array into separate events based on my condition? I want it to appear as two sets of events.
JSON string:
Splunk Search Head:
Hey, you can try this settings
[ <SOURCETYPE NAME> ]
CHARSET=UTF-8
SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)\s*{(?=\s*"attribute":\s*{)
TRUNCATE=0
INDEXED_EXTRACTIONS =JSON
TIME_PREFIX="date":\s*"
NOTE:
* When 'INDEXED_EXTRACTIONS = JSON' for a particular source type, do not also set 'KV_MODE = json' for that source type. This causes the Splunk software to extract the JSON fields twice: once at index time, and again at search time.
Trying to fiddle with structured data by means of simple regexes is doomed to cause problems sooner or later. You have a single json array. If you want to split it into separate items you should use external tool (or force your source to log separate events).
Hi @ws
You need to setup the linebreaker to distinguish between different events starting with the attributes key.
== props.conf ==
[yourSourcetype]
SHOULD_LINEMERGE=false
TRUNCATE = 100000
LINE_BREAKER=([\r\n]+)\s*{(?=\s*"attribute":\s+{)
Note: TRUNCATE can be a high number but should ideally NOT be 0!
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Hey, you can try this settings
[ <SOURCETYPE NAME> ]
CHARSET=UTF-8
SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)\s*{(?=\s*"attribute":\s*{)
TRUNCATE=0
INDEXED_EXTRACTIONS =JSON
TIME_PREFIX="date":\s*"
NOTE:
* When 'INDEXED_EXTRACTIONS = JSON' for a particular source type, do not also set 'KV_MODE = json' for that source type. This causes the Splunk software to extract the JSON fields twice: once at index time, and again at search time.
Hi @kiran_panchavat,
I noticed that the sample provided aligns with what I’m trying to achieve. However, after applying the same settings for testing, I’m still not getting the same results as you.
I’ve attached a screenshot for your reference—please help point out any mistakes or adjustments that may be needed.
I don’t believe the issue lies with the transforms.conf configuration.
JSON file:
[
{
"attribute":{
"type": "case"
},
"Id": "I0000005",
"name": "ws",
"email": "ws@gmail.com",
"case type__c": "Service Case",
"date": "17/4/2025",
"time": "16:15",
"account":{
"attribute": {
"type": "account"
},
"Id": "I0000005"
}
},
{
"attribute":{
"type": "case"
},
"Id": "I0000006",
"name": "thomas",
"email": "thomas@gmail.com",
"case type__c": "Transaction Case",
"date": "17/4/2025",
"time": "16:15",
"account":{
"attribute": {
"type": "account"
},
"Id": "I0000006"
}
}
]
Search Head:
Props.conf
Transforms.conf
Hi @ws
Can you confirm where you applied those props/transforms and what your architecture looks like? They need applying to either HF or Indexers depending where the data lands.
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Hi @livehybrid,
For testing purpose, my architecture is all-in-one setup.
For my actual deployment, to mine understand the props.conf and transforms.conf will be at my HF right? since the pulling of the json file land at my HF local server.
Hi @ws
For All-in-one then you are good, and then yes - once ready to deploy you would put this on your HF.
One things Ive just notice which I missed before is that you are changing the sourcetypes. The second set of props probably arent applying to the new sourcetype name (you cant have 2 bites of the same cherry...) so try applying the event breaker props to the original sourcetype in the [preprocess_case] stanza.
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
@livehybrid just to add on, after getting the data in. If the monitored folder json file has been update with either an appending of records or so. Do you happen to know why it indexes the whole json file again? rather than just the appended new records?? since the naming of the json remain the same.
Hi @ws
Is the full path for the JSON file the same from each time you have indexed it? If the path is different then this might explain why it has indexed twice instead of just continuing from where you last ingested.
I'm pleased that you were able to get your events split out! Please consider adding Karma / "Liking" the posts which helped 🙂
Thanks
Will
Hi @livehybrid,
Previously, I was monitoring a folder path. However, after making some adjustments, I’ve switched to monitoring a specific file instead, since the name, type, and path will always remain consistent.
Now, I'm encountering an issue where the same data gets indexed multiple times whenever the JSON file is pulled from the FTP server.
Each time the JSON file is retrieved and placed on my local Splunk server, it overwrites the existing file.
I’ve tried using initCrcLength and crcSalt, but they don’t seem to prevent the duplication, as Splunk still indexes it as new data.
Additionally, I checked the sha256sum of the JSON file after it’s pulled into my local Splunk server. The hash value changes before and after the new data overwrites the file. I'm not entirely sure how Splunk determines the file’s initial 256-byte hash for comparison.
1:
2217ee097b7d77ed4b2eabc695b89e5f30d4e8b85c8cbd261613ce65cda0b851 /home/ws/logs/cpf_case_final.json
2:
45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd /home/ws/logs/cpf_case_final.json
Salting crc is very very rarely the way to go. Usually it's about the length of the initCrcLength. If your files contain a long "header" which is constant between files, you need to raise its value.
But.
Problems with crc duplication manifest themselves with the opposite to what you're getting - data _not_ being indexed at all due to Splunk considering two files the same, not indexing data multiple times.
Is there any solution to what i'm facing??
Here's what I’ve tested so far.
1: WinSCP uploads file.json to the FTP server → Splunk local server retrieves the file to a local directory → Splunk reads and indexes the data.
sha256sum /splunk_local/file.json
45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd
2: Deleted file.json from the FTP server → Used WinSCP to re-upload the same file.json → Splunk local server pulled the file to the local directory → Splunk did not index the file.json
sha256sum /splunk_local/file.json
45b01fabce6f2a75742c192143055d33e5aa28be3d2c3ad324dd2e0af5adf8dd
3: WinSCP overwrote file.json on the FTP server with a version containing both new and existing entries → Splunk local server pulled the updated file to the local directory → Splunk re-read and re-indexed the entire file, including previously indexed data
sha256sum /splunk_local/file.json
2217ee097b7d77ed4b2eabc695b89e5f30d4e8b85c8cbd261613ce65cda0b851
I noticed that the SHA value only changes when a new entry is added to the file, as seen in scenario 3. However, in scenarios 1 and 2, the SHA value remains the same—even if I delete and re-upload the exact same file to the FTP server and pull it into my local Splunk server.
And yes, I'm pulling the file from the FTP server into my local Splunk server, where the file is being monitored.
Splunk (monitor input to be precise) doesn't care about the checksum of the whole file. It is obvious that the hash of the whole file will change as soon as _anything_ changes within the file. Whether it is a complete rewrite of the whole file contents or just adding a single byte at the end - the hash will change.
The monitor input stores some values regarding the state of the file. It stores the initCrc value which will obviously change if the file is overwritten (and length of which can be manipulated in settings). But it also stores the seekCrc which is a checksum of the last read 256 bytes (and a position of those 256 bytes within the file). I suppose in your case the file ends by closing the json array, but after subsequent "append", the actual array is appended so its closing bracket is removed, another json structure is added and after that the array is closed in a new place.
Unfortunately, you can't do much about it. As I said before - you'd be best off by scripting some external solution to read that array and dump its contents in a sane manner to another file for reading.
@kiran_panchavat, After several attempts in my situation, I tried using the following settings for JSON. While it was able to read the data, each record/value ended up having duplicated values. I tried setting the relevant KV options, but it still didn’t resolve the issue. For now, I’ve decided to proceed without using INDEXED_EXTRACTIONS. It still works, but it treats the [ as a single entry. I'm still unsure how to fully resolve this.
*Just a heads up. I'm also using transforms.conf, though I'm not entirely sure if that's what's causing the duplicate values*
INDEXED_EXTRACTIONS = JSON
either with or without the following:
KV_MODE = none
AUTO_KV_JSON = false
@livehybrid , Great! What you mentioned was part of the reason why two entries kept getting indexed together. After updating the configuration and removing the other stanza, I was able to index the JSON array as multiple events. I also noticed that it might have been due to my use of transforms.conf to assign the sourcetype.
Hi @kiran_panchavat, In my case, if I don't use INDEXED_EXTRACTIONS = JSON. Which I believe helps to automatically handle and ignore square brackets [] based on the detection of the JSON format.
Since I'm using transforms.conf to assign a sourcetype, every time the file is ingested, the indexer treats the [ character as a separate event.
Do you know if there's anyway to ignore the square brackets if i do not use INDEXED_EXTRACTIONS = JSON?
Additionally, I've noticed another issue: whenever the JSON file gets overwritten with new content. Whether it contains previously indexed data or new data. My script pulls it again, and the indexer re-indexes the file, resulting in duplicate entries in the index.