Getting Data In

JSON - Breaking single string into multiple events

burras
Communicator

I know there's a ton of these questions out here but I've got one of my own. I've looked at the other questions out there and between them and some initial help from Bert gotten a good start but I can't seem to get this to work right.

We have a single JSON package being received via HEC - this package contains anywhere from 1 to 500 events. Our users would like those events broken out into individual events within Splunk. Here's what the initial package looks like:

{"batch_id":"0-39-1490386204359","sampling_rate":1,"n":500,"events":[{"earlyAccess":null,"profileId":"d037d7da-bd83-11e6-80d6-0a2c5cb56663","ip":"74.000.000.238","deviceVersion":[{"component":"app","build":"oncue#1.19.37.108208#mallard2_GA.0","otherDetail":null},{"component":"service","build":"qa-a.stb.fios.tv","otherDetail":null},{"component":"location","build":"1000.00,1000.00","otherDetail":null},{"component":"Client","build":"108208#release_signed","otherDetail":null}],"__lane":"prod","__host":"unknown","navigationStack":[{"name":"Playback","index":0,"type":"LINEAR"}],"platformVersion":"108208#release_signed","assetType":"LinearAsset","assetId":"50281370fd523ecbaafd8f4d8145e006-adf51f1b5b403b6daa88b62d9c8567fa-2017-03-24-1","restrictedBy":null,"programStartTime":1490380200000,"attributionId":"8cca77ff-1480-b037-b0b2-015b01e09338","assetSessionId":"1490386121819","__sourceType":"device","serviceTimestamp":null,"appVersion":"1.19.37.108208","playType":"tuneIn","collectionId":"","deviceType":"501","accountId":"1004249","playRate":1000,"maestro":{"vhoId":"","host":"qa-a-aws.stb.fios.tv","userAgent":"Mozilla/5.0 (STB; CPU 501 OS 108208) OnCue/1.19.37","version":"4.4.3851","inHome":true,"ipAddress":"74.000.000.238"},"encoderDelay":124000,"__eventId":"s9a98hR/sMGKvAFbAe90tQ==","__timestamp":1490386121909,"sessionId":"617F6EA8-1490385232","__eventName":"1","programEndTime":1490390100000,"programId":"adf51f1b5b403b6daa88b62d9c8567fa","__source":"unknown","liveTuneType":"live","deviceTimestamp":1490386121819,"deviceId":"617F6EA8","__eventVersion":16,"recordingId":null,"channelId":"50281370fd523ecbaafd8f4d8145e006","timeZone":"America/New_York","eventProgramPoint":1490386121714},{"earlyAccess":null,"profileId":"45737761-ac2b-11e6-80d6-0a2c5cb56663","ip":"68.000.000.133","deviceVersion":[{"component":"app","build":"oncue#1.19.40.108253#mallard2_GA.0","otherDetail":null},{"component":"service","build":"qa-a.stb.fios.tv","otherDetail":null},{"component":"location","build":"1000.00,1000.00","otherDetail":null},{"component":"Client","build":"108253#release","otherDetail":null}],"__lane":"prod","__host":"unknown","navigationStack":[{"name":"","index":0,"type":""}],"platformVersion":"108253#release","assetType":"LinearAsset","assetId":"2732f41bdecc33aca2a23146eabd0954-5e4c3aaa6ef7312b8104c94c842d6a3f-2017-03-24-1","restrictedBy":null,"programStartTime":1490385600000,"attributionId":"ffffffff-ffff-ffff-ffff-fffffffffff","assetSessionId":"1490386010685","__sourceType":"device","serviceTimestamp":null,"appVersion":"1.19.40.108253","playType":"tuneOut","collectionId":"","deviceType":"501","accountId":"1003469","playRate":0,"maestro":{"vhoId":"","host":"qa-a-aws.stb.fios.tv","userAgent":"Mozilla/5.0 (STB; CPU 501 OS 108253) OnCue/1.19.40","version":"4.4.3851","inHome":true,"ipAddress":"68.000.000.133"},"encoderDelay":49000,"__eventId":"uXvQcxR/sMGKvAFbAe6R2w==","__timestamp":1490386063835,"sessionId":"617F7743-1490378565","__eventName":"1","programEndTime":1490387400000,"programId":"5e4c3aaa6ef7312b8104c94c842d6a3f","__source":"unknown","liveTuneType":"live","deviceTimestamp":1490386063730,"deviceId":"617F7743","__eventVersion":16,"recordingId":null,"channelId":"2732f41bdecc33aca2a23146eabd0954","timeZone":"America/New_York","eventProgramPoint":1490387400000}]}

So far, I'm apply the following props.conf to this data:

CHARSET=UTF-8
SHOULD_LINEMERGE=false
disabled=false
SEDCMD-removeheader=s/^(\{[\w\W]+\[{"earlyAccess":)/{"earlyAccess":/g
SEDCMD-removeeventcommas=s/},{"earlyAccess":/}{"earlyAccess":/g
SEDCMD-fixfooter=s/\]\}//g
LINE_BREAKER={"earlyAccess
TRUNCATE=0
TIME_PREFIX="deviceTimestamp":
TIME_FORMAT=%s%3N
KV_MODE=json

That gives me this output but doesn't break between events:

{"earlyAccess":null,"profileId":"d037d7da-bd83-11e6-80d6-0a2c5cb56663","ip":"74.000.000.238","deviceVersion":[{"component":"app","build":"oncue#1.19.37.108208#mallard2_GA.0","otherDetail":null},{"component":"service","build":"qa-a.stb.fios.tv","otherDetail":null},{"component":"location","build":"1000.00,1000.00","otherDetail":null},{"component":"Client","build":"108208#release_signed","otherDetail":null}],"__lane":"prod","__host":"unknown","navigationStack":[{"name":"Playback","index":0,"type":"LINEAR"}],"platformVersion":"108208#release_signed","assetType":"LinearAsset","assetId":"50281370fd523ecbaafd8f4d8145e006-adf51f1b5b403b6daa88b62d9c8567fa-2017-03-24-1","restrictedBy":null,"programStartTime":1490380200000,"attributionId":"8cca77ff-1480-b037-b0b2-015b01e09338","assetSessionId":"1490386121819","__sourceType":"device","serviceTimestamp":null,"appVersion":"1.19.37.108208","playType":"tuneIn","collectionId":"","deviceType":"501","accountId":"1004249","playRate":1000,"maestro":{"vhoId":"","host":"qa-a-aws.stb.fios.tv","userAgent":"Mozilla/5.0 (STB; CPU 501 OS 108208) OnCue/1.19.37","version":"4.4.3851","inHome":true,"ipAddress":"74.000.000.238"},"encoderDelay":124000,"__eventId":"s9a98hR/sMGKvAFbAe90tQ==","__timestamp":1490386121909,"sessionId":"617F6EA8-1490385232","__eventName":"1","programEndTime":1490390100000,"programId":"adf51f1b5b403b6daa88b62d9c8567fa","__source":"unknown","liveTuneType":"live","deviceTimestamp":1490386121819,"deviceId":"617F6EA8","__eventVersion":16,"recordingId":null,"channelId":"50281370fd523ecbaafd8f4d8145e006","timeZone":"America/New_York","eventProgramPoint":1490386121714}{"earlyAccess":null,"profileId":"45737761-ac2b-11e6-80d6-0a2c5cb56663","ip":"68.000.000.133","deviceVersion":[{"component":"app","build":"oncue#1.19.40.108253#mallard2_GA.0","otherDetail":null},{"component":"service","build":"qa-a.stb.fios.tv","otherDetail":null},{"component":"location","build":"1000.00,1000.00","otherDetail":null},{"component":"Client","build":"108253#release","otherDetail":null}],"__lane":"prod","__host":"unknown","navigationStack":[{"name":"","index":0,"type":""}],"platformVersion":"108253#release","assetType":"LinearAsset","assetId":"2732f41bdecc33aca2a23146eabd0954-5e4c3aaa6ef7312b8104c94c842d6a3f-2017-03-24-1","restrictedBy":null,"programStartTime":1490385600000,"attributionId":"ffffffff-ffff-ffff-ffff-fffffffffff","assetSessionId":"1490386010685","__sourceType":"device","serviceTimestamp":null,"appVersion":"1.19.40.108253","playType":"tuneOut","collectionId":"","deviceType":"501","accountId":"1003469","playRate":0,"maestro":{"vhoId":"","host":"qa-a-aws.stb.fios.tv","userAgent":"Mozilla/5.0 (STB; CPU 501 OS 108253) OnCue/1.19.40","version":"4.4.3851","inHome":true,"ipAddress":"68.000.000.133"},"encoderDelay":49000,"__eventId":"uXvQcxR/sMGKvAFbAe6R2w==","__timestamp":1490386063835,"sessionId":"617F7743-1490378565","__eventName":"1","programEndTime":1490387400000,"programId":"5e4c3aaa6ef7312b8104c94c842d6a3f","__source":"unknown","liveTuneType":"live","deviceTimestamp":1490386063730,"deviceId":"617F7743","__eventVersion":16,"recordingId":null,"channelId":"2732f41bdecc33aca2a23146eabd0954","timeZone":"America/New_York","eventProgramPoint":1490387400000}

The actual event break should be taking place at:

{"earlyAccess":

I've tried LINE_BREAKER in various formats as well as trying combinations of BREAK_ONLY_BEFORE and MUST_BREAK_AFTER but haven't had any luck getting the breaks to happen - Splunk still processes it all as a single event. Everything else is working fine with it - it's just not breaking. Any assistance on how to get these darn things to break right would be greatly appreciated...

1 Solution

hkeyser
Splunk Employee
Splunk Employee

I'm the tech who worked with burras on this case and in looking at the props, we identified the issue was due to a character that had not been escaped properly.

LINE_BREAKER=([\r\n,]*(?:{[^[{]+[)?){"earlyAccess

vs

LINE_BREAKER=([\r\n,]*(?:{[^[{]+\[)?){"earlyAccess

it's subtle, but the "[" bracket closest to the end of the line had not been escaped with "\"

after testing the ingestion of the data again, this worked for the case.

View solution in original post

hkeyser
Splunk Employee
Splunk Employee

I'm the tech who worked with burras on this case and in looking at the props, we identified the issue was due to a character that had not been escaped properly.

LINE_BREAKER=([\r\n,]*(?:{[^[{]+[)?){"earlyAccess

vs

LINE_BREAKER=([\r\n,]*(?:{[^[{]+\[)?){"earlyAccess

it's subtle, but the "[" bracket closest to the end of the line had not been escaped with "\"

after testing the ingestion of the data again, this worked for the case.

burras
Communicator

I mentioned this in my other comment above but want it here attached to the accepted answer as well - we were only able to get this working after we moved the props.conf from the indexer cluster to the HF running HEC itself. When running on the indexer cluster it was as if the props.conf didn't even exist.

burras
Communicator

Thanks to Support I was able to get this working. We found a couple of different things:

1) The props.conf was really close - we had to make 1 change to get it working properly (missing an escape on +[)?) ). Here's the final props.conf that worked:

[asset_play]
CHARSET=UTF-8
SHOULD_LINEMERGE=false
disabled=false
SEDCMD-fixfooters=s/]}//g
LINE_BREAKER=([\r\n,]*(?:{[^[{]+\[)?){"earlyAccess
TRUNCATE=0
TIME_PREFIX="deviceTimestamp":
MAX_TIMESTAMP_LOOKAHEAD=30
TIME_FORMAT=%s%3N
KV_MODE=json

2) We also discovered that putting props.conf on the indexer cluster did not work. Anything that was brought in through HEC was essentially untouched by anything in props.conf on an indexer. We had to specify this props.conf on the HF on which HEC was running.

Thanks everyone for your help with this!

Masa
Splunk Employee
Splunk Employee

I did a quick test;
inputs.conf

[http://test_json_batch]
disabled = 0
sourcetype = test_json_batch
token = __removed__

props.conf

[test_json_batch]
LINE_BREAKER = ([\[,\r\n]+)\{"(?:earlyAccess|batch_id)":
SHOULD_LINEMERGE = false
SEDCMD-remove_end = s/]}$//g

And a curl to ingest the event above;

curl -k https://10..1.100:8088/services/collector/raw -H ... -d '{"batch_id....,"eventProgramPoint":1490387400000}]}'

And, counting on "AUTO_KV_JSON = true" default search time json field extraction.

For my test above, each event was broken into event starting with {"earlyAccess" just like @beatus's screenshot.

beatus
Communicator

You should have capture groups in "LINE_BREAKER". It tells Splunk what to throw out in between events. You also don't need two of the SEDCMDs as they can be done with the LINE_BREAKER alone. This worked on my end.

[json:test]
CHARSET=UTF-8
SHOULD_LINEMERGE=false
disabled=false
SEDCMD-fixfooter=s/]}//g
LINE_BREAKER=([\r\n,]*(?:{[^[{]+[)?){"earlyAccess
TRUNCATE=0
TIME_PREFIX="deviceTimestamp":
MAX_TIMESTAMP_LOOKAHEAD = 30
TIME_FORMAT=%s%3N
KV_MODE=json

burras
Communicator

I tried with this props.conf but it's still showing as a single event for me. I can see all of the individual event data in the fields prefaced by event{}. but my customer really needs it broken into individual events because of the different timestamps in each event. The transactional data they're looking at is very time sensitive so just using that first deviceTimestamp field isn't close enough.

0 Karma

beatus
Communicator

Yup, those props give me that exact result. See screen shot.alt text

0 Karma

beatus
Communicator

Doh, I missed the part about HEC. Sorry, apparently my reading comprehension could use some work.

How high volume is the data source? You could use the old style HTTP post inputs potentially which will allow you to hit this with the typical data pipeline.

0 Karma

Masa
Splunk Employee
Splunk Employee

Regarding event separation itself,
basically solution of @beatus should work for HEC "raw" inputs. It will not work for json endpoint "collector/event"

0 Karma

burras
Communicator

We're definitely using the "raw" input for HEC. Any chance that there's a disconnect between what would be seen in the "Add Data" section breaking and what would actually be shown in production? Where it might work in one but not the other?

0 Karma

beatus
Communicator

I'd go back to basics if that's the case. Ensure the props are present & no other props may be interfering. Check with splunk btool props list --debug.

If you look at Masa's answer, you can see these props work on HEC + raw so they should be working for you.

0 Karma

burras
Communicator

I definitely agree - it sounds like it should be working. Here's what I'm seeing in the btool output:
/opt/splunk/etc/apps/search/local/props.conf [asset_play]
/opt/splunk/etc/system/default/props.conf ANNOTATE_PUNCT = True
/opt/splunk/etc/system/default/props.conf AUTO_KV_JSON = true
/opt/splunk/etc/system/default/props.conf BREAK_ONLY_BEFORE =
/opt/splunk/etc/system/default/props.conf BREAK_ONLY_BEFORE_DATE = True
/opt/splunk/etc/system/local/props.conf CHARSET = UTF-8
/opt/splunk/etc/system/default/props.conf DATETIME_CONFIG = /etc/datetime.xml
/opt/splunk/etc/system/default/props.conf HEADER_MODE =
/opt/splunk/etc/system/local/props.conf KV_MODE = json
/opt/splunk/etc/system/default/props.conf LEARN_MODEL = true
/opt/splunk/etc/system/default/props.conf LEARN_SOURCETYPE = true
/opt/splunk/etc/system/local/props.conf LINE_BREAKER = ([\r\n,]*(?:{[^[{]+[)?){"earlyAccess
/opt/splunk/etc/system/default/props.conf LINE_BREAKER_LOOKBEHIND = 100
/opt/splunk/etc/system/default/props.conf MATCH_LIMIT = 100000
/opt/splunk/etc/system/default/props.conf MAX_DAYS_AGO = 2000
/opt/splunk/etc/system/default/props.conf MAX_DAYS_HENCE = 2
/opt/splunk/etc/system/default/props.conf MAX_DIFF_SECS_AGO = 3600
/opt/splunk/etc/system/default/props.conf MAX_DIFF_SECS_HENCE = 604800
/opt/splunk/etc/system/default/props.conf MAX_EVENTS = 256
/opt/splunk/etc/system/local/props.conf MAX_TIMESTAMP_LOOKAHEAD = 30
/opt/splunk/etc/system/default/props.conf MUST_BREAK_AFTER =
/opt/splunk/etc/system/default/props.conf MUST_NOT_BREAK_AFTER =
/opt/splunk/etc/system/default/props.conf MUST_NOT_BREAK_BEFORE =
/opt/splunk/etc/system/local/props.conf SEDCMD-fixfooters = s/]}//g
/opt/splunk/etc/system/default/props.conf SEGMENTATION = indexing
/opt/splunk/etc/system/default/props.conf SEGMENTATION-all = full
/opt/splunk/etc/system/default/props.conf SEGMENTATION-inner = inner
/opt/splunk/etc/system/default/props.conf SEGMENTATION-outer = outer
/opt/splunk/etc/system/default/props.conf SEGMENTATION-raw = none
/opt/splunk/etc/system/default/props.conf SEGMENTATION-standard = standard
/opt/splunk/etc/system/local/props.conf SHOULD_LINEMERGE = false
/opt/splunk/etc/system/local/props.conf TIME_FORMAT = %s%3N
/opt/splunk/etc/system/local/props.conf TIME_PREFIX = "deviceTimestamp":
/opt/splunk/etc/system/default/props.conf TRANSFORMS =
/opt/splunk/etc/system/local/props.conf TRUNCATE = 0
/opt/splunk/etc/apps/search/local/props.conf TZ = UTC
/opt/splunk/etc/apps/search/local/props.conf category = Custom
/opt/splunk/etc/apps/search/local/props.conf description = dena asset_play
/opt/splunk/etc/system/default/props.conf detect_trailing_nulls = false
/opt/splunk/etc/system/local/props.conf disabled = false
/opt/splunk/etc/system/default/props.conf maxDist = 100
/opt/splunk/etc/system/default/props.conf priority =
/opt/splunk/etc/apps/search/local/props.conf pulldown_type = 1
/opt/splunk/etc/system/default/props.conf sourcetype =

I'm not seeing anything obvious that looks like it would be causing a problem. But running with this configuration gives me an output that's still just a single event so there's gotta be something going on somewhere...

0 Karma

woodcock
Esteemed Legend

@mmodestino_splunk, this is another jq thing, right?

0 Karma

mattymo
Splunk Employee
Splunk Employee

Ha! @woodcock knows I thought about answering this one. Tricky part here is the use of HEC which doesn't give us the chance to put the json to disk so we can pre-parse. Would need jq integrated into the indexing pipeline...here's hoping my enhancement request gets some eyes one day.

@burras what is sending the json to you? I can tell you about another method that may work for you...but it would include some changes in your solution. The problem here is that, in order for Splunk to see these as individual events, yet keep the json format, we need to unwrap the array...something the indexing pipeline doesn't handle all that well today...check out jq and if you can catch these json events and put them to disk, you can pre-parse them into single events then ingest...

- MattyMo
0 Karma

burras
Communicator

We're getting the data from an external DENA forwarder that's just doing a HTTP push to our receiver. I'll investigate jq and take a look but I'm not confident that rearranging the architecture is an option - we've got some significant limitations in the environment that might make that sort of solution unfeasible (lack of storage, exponential growth expectations for this type of data, etc.).

0 Karma

mattymo
Splunk Employee
Splunk Employee

totally understood and is why i wish it was something avail in our indexing pipeline. fingers crossed.

- MattyMo
0 Karma

burras
Communicator

I've even gotten to the point now where instead of just removing the commas between events I actually introduced a newline between events - and it still doesn't want to break. I don't know if its something that I'm doing wrong or a problem with the data itself...

SEDCMD-removeeventcommas=s/},{"earlyAccess":/}\n{"earlyAccess":/g
LINE_BREAKER=(}\n){"earlyAccess

0 Karma
Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...