I'm struggling to figure this one out. We have data coming in via an HEC endpoint that is JSON based, with the HEC endpoint setting sourcetype to _json. This is splunk cloud.
Minor bit of background on our data: All of the data we send to splunk has an "event" field, which is a number, that indicates a specific type of thing that happened in our system. There's one index where this data goes into with a 45d retention period. Some of this data we want to keep around longer, so we use collect to copy the data over for longer retention.
We have a scheduled search that runs regularly that does an "index=ourIndex event IN (1,2,3,4,5,6) | collect index=longTerm output_format=hec"
We use output_format=hec because without it the data isn't searchable: "index=longTerm event=3" never shows anything. There's a bunch of _raw, but that's it.
Also, for the sake of completeness, this data is being sent by cribl. Our application normally logs CSV style data with the first 15 or so columns fixed in their meaning (everything has those common fields), the 16th column contains a description with parenthesis around a semicolon list of additional parameter/fields, where each additional CSV column has a value corresponding to that field name in that list. Sometimes that value is JSON data logged as a string. For the sake of not sending JSON data as a string in an actual JSON payload - we have cribl detect that, and expand that JSON field and construct it as a native part of the payload. So:
1,2024-03-01 00:00:00,user1,...12 other columns ...,User did something (didClick;details),1,{"where":"submit"%2c"page":"home"}
gets sent to the HEC endpoint as:
{"event":1,"_time":"2024-03-01 00:00:00","userID":"user1",... other stuff ..., "didClick":1,"details":{"where":"submit","page":"home"}}
The data that ends up missing is always the extrapolated JSON data. Anything that seems to be part of the base JSON document always seems to be fine.
Now, here's the weird part. If I run the search query that does the collect to ONLY look for a specific event and do a collect on that - things actually seem fine, data is never lost. When I introduce additional events that I want to do a collect on, some of those fields are missing for some, but not all of those events. The more events I add into the IN() clause, the more those fields go missing for those events that have extrapolated JSON in them. For each event that has missing fields, all extrapolated JSON fields are missing.
When I've tried to use the _raw field, use spath on that, then pipe that to collect - that seems to work reliably, but also seems like an unnecessary hack.
There are dozens of these events, so breaking them out into their own discreet searches isn't something I'm particularly keen on.
Any ideas or suggestions?
Hi there,
I worked around that problem by using `tojson` before the `collect`
| tojson
| collect index=schnafu
Hope this helps ...
cheers, MuS
LOL...so you formatted the data as json then used |collect mode=raw 😛
i ended up just editing the limits.conf to enable mv mode for raw mode collect and didnt end up using the json at all
Oddly - no. In the other (non orig) index, '...|table myField,_raw' shows nothing for myField, and the _raw data is there, represented as full JSON, including myField with the expected value.
is your | collect mode=hex also showing an empty _raw {} in your summary index? mine is 😕
index=orig
| collect mode=hec
| table _raw
displays {some stuff in here}
index=summary
| table _raw
displays {} nothing inside (but all the fields are search time present...just not the original _raw json {})
😕