Getting Data In

JSON events sent to HEC are losing JSON data when then sent to collect

mburgoon
New Member

I'm struggling to figure this one out. We have data coming in via an HEC endpoint that is JSON based, with the HEC endpoint setting sourcetype to _json.  This is splunk cloud.

Minor bit of background on our data: All of the data we send to splunk has an "event" field, which is a number, that indicates a specific type of thing that happened in our system. There's one index where this data goes into with a 45d retention period. Some of this data we want to keep around longer, so we use collect to copy the data over for longer retention.

We have a scheduled search that runs regularly that does an "index=ourIndex event IN (1,2,3,4,5,6) | collect index=longTerm output_format=hec"

We use output_format=hec because without it the data isn't searchable: "index=longTerm event=3" never shows anything. There's a bunch of _raw, but that's it.

Also, for the sake of completeness, this data is being sent by cribl. Our application normally logs CSV style data with the first 15 or so columns fixed in their meaning (everything has those common fields), the 16th column contains a description with parenthesis around a semicolon list of additional parameter/fields, where each additional CSV column has a value corresponding to that field name in that list. Sometimes that value is JSON data logged as a string. For the sake of not sending JSON data as a string in an actual JSON payload - we have cribl detect that, and expand that JSON field and construct it as a native part of the payload. So:
1,2024-03-01 00:00:00,user1,...12 other columns ...,User did something (didClick;details),1,{"where":"submit"%2c"page":"home"}
gets sent to the HEC endpoint as:
{"event":1,"_time":"2024-03-01 00:00:00","userID":"user1",... other stuff ..., "didClick":1,"details":{"where":"submit","page":"home"}}

The data that ends up missing is always the extrapolated JSON data. Anything that seems to be part of the base JSON document always seems to be fine.

Now, here's the weird part. If I run the search query that does the collect to ONLY look for a specific event and do a collect on that - things actually seem fine, data is never lost. When I introduce additional events that I want to do a collect on, some of those fields are missing for some, but not all of those events. The more events I add into the IN() clause, the more those fields go missing for those events that have extrapolated JSON in them. For each event that has missing fields, all extrapolated JSON fields are missing.

When I've tried to use the _raw field, use spath on that, then pipe that to collect - that seems to work reliably, but also seems like an unnecessary hack.

There are dozens of these events, so breaking them out into their own discreet searches isn't something I'm particularly keen on.

Any ideas or suggestions?

Labels (2)
0 Karma
Get Updates on the Splunk Community!

Enter the Splunk Community Dashboard Challenge for Your Chance to Win!

The Splunk Community Dashboard Challenge is underway! This is your chance to showcase your skills in creating ...

.conf24 | Session Scheduler is Live!!

.conf24 is happening June 11 - 14 in Las Vegas, and we are thrilled to announce that the conference catalog ...

Introducing the Splunk Community Dashboard Challenge!

Welcome to Splunk Community Dashboard Challenge! This is your chance to showcase your skills in creating ...