Solved: Re: How to split large combi JSON array into indiv...

BTrust · ‎09-26-2024

Hi All,

I have this compressed (reduced version of large structure) which is a combination of basic text and JSON:

2024-07-10 07:27:28 +02:00 LiveEvent: {"data":{"time_span_seconds":300,
	"active":17519,
	"total":17519,
	"unique":4208,
	"total_prepared":16684,
	"unique_prepared":3703,
	"created":594,
	"updated":0,
	"deleted":0,"ports":[
		{"stock_id":49,
			"goods_in":0,
			"picks":2,
			"inspection_or_adhoc":0,
			"waste_time":1,
			"wait_bin":214,
			"wait_user":66,
			"stock_open_seconds":281,
			"stock_closed_seconds":19,
			"bins_above":0,
			"completed":[43757746,43756193],
			"content_codes":[],
			"category_codes":[{"category_code":4,"count":2}]},
		{"stock_id":46,
			"goods_in":0,
			"picks":1,
			"inspection_or_adhoc":0,
			"waste_time":0,
			"wait_bin":2,
			"wait_user":298,
			"stock_open_seconds":300,
			"stock_closed_seconds":0,
			"bins_above":0,
			"completed":[43769715],
			"content_codes":[],
			"category_codes":[{"category_code":4,"count":1}]},
		{"stock_id":1,
			"goods_in":0,
			"picks":3,
			"inspection_or_adhoc":0,
			"waste_time":0,
			"wait_bin":191,
			"wait_user":40,
			"stock_open_seconds":231,
			"stock_closed_seconds":69,
			"bins_above":0,
			"completed":[43823628,43823659,43823660],
			"content_codes":[],
			"category_codes":[{"category_code":1,"count":3}]}
	]},
	"uuid":"8711336c-ddcd-432f-b388-8b3940ce151a",
	"session_id":"d14fbee3-0a7a-4026-9fbf-d90eb62d0e73",
	"session_sequence_number":5113,
	"version":"2.0.0",
	"installation_id":"a031v00001Bex7fAAB",
	"local_installation_timestamp":"2024-07-10T07:35:00.0000000+02:00",
	"date":"2024-07-10",
	"app_server_timestamp":"2024-07-10T07:27:28.8839856+02:00",
	"event_type":"STOCK_AND_PILE"}

I eventually need each “stock_id” ending up as an individual event, and keep the common information along with it like: timestamp, uuid, session_id, session_sequence_number and event_type.

Can someone guide me how to use props and transforms to achieve this?

PS. I have read through several great posts on how to split JSON arrays into events, but none about how to keep common fields in each of them.

Many thanks in advance.

Best Regards,
Bjarne

PickleRick · ‎09-26-2024

TL&DR - you can't split events within Splunk itself during ingestion.

Longer explanation - each event is processed as a single entity. You could try to do a copy of the event using CLONE_SOURCETYPE and then process each of those instances separately (for example - cut some part from one copy but other part from another copy) but it's not something that can be reasonably implemented, it's unmaintaineable in the long run and you can't do it dynamically (like split a json into however many items an array has). Oh, and of course structured data manipulation in ingest time is a relatively big no-no.

So your best bet would be to pre-process your data with a third-party tool. (or at least write a scripted input doing the heavy lifting of splitting the data).

View solution in original post

richgalloway · ‎09-26-2024

I'm not sure it can be done reliably using props and transforms. I'd use a scripted input to parse the data and re-format it.

---
If this reply helps you, Karma would be appreciated.

BTrust · ‎09-26-2024

Hi @richgalloway,

Thanks for your input.
Do you happen to have any scripting ideas for this?

richgalloway · ‎09-26-2024

I have nothing specific to offer. In a previous job, I used a Python script to parse data and then restructure it so it was easier for Splunk to ingest. It wasn't JSON (I think it was XML), but still should be pretty straightforward.

---
If this reply helps you, Karma would be appreciated.

BTrust · ‎09-26-2024

And btw this one: How to split JSON array into Multiple events at Index Time?

PickleRick · ‎09-26-2024

That one relies on the fact that it was a simple array and could be cut with regexes into pieces. The splitting mechanism would break apart if the data changed - for example if there was another field added except the "local" one to the "outer" json.

BTrust · ‎09-26-2024

Hi @PickleRick,

The JSON structure is very solid, and don’t change, except there can be many (+1000) or few (4) “stock_id”.

You talked about scripting inputs as well, do you have any suggestions/examples?

PickleRick · ‎09-26-2024

Your case is completely different because you want to keep some of the "outer" information shared between separate events (which actually isn't that good idea because your license usage will get multiplied on those events).

As for the scripted input - see those resources for technicalities from Splunk side. Of course the internals - splitting the event - is entirely up to you.

https://docs.splunk.com/Documentation/Splunk/latest/AdvancedDev/ScriptSetup

https://dev.splunk.com/enterprise/docs/developapps/manageknowledge/custominputs

BTrust · ‎09-26-2024

The thing is, if se don’t split them at index time, the indexers will have even more work to do, as the structures can be huge.

PS. I’m aware of the extra license usage here as well.

BTrust · ‎09-26-2024

Hi @PickleRick,
Thanks for your feedback, though I’m surprised with the answer, as I’ve seen other clear indication and solution to splitting JSON arrays to individual events like: How to parse a JSON array delimited by "," into separate events with their unique timestamps?

PickleRick · ‎09-27-2024

1. Please, don't post links butchered by some external "protection" service.

2. You get this wrong 😉 Those articles don't describe splitting json events. They describe breaking input data stream so that it breaks on the "inner" json boundaries instead of the "outer" ones. It doesn't have anything to do with manipulating a single event already being broken from the input stream. It's siimilar to telling Splunk not to break the stream into lines but rather ingest something delimited by whitespaces separately. But your case is completely different because you want to carry over some common part (some common metadata I assume) from the outer json structure to each part extracted from the inner json array. This is way above the simple string-based manipulation that Splunk can do in the ingestion pipeline.

BTrust · ‎09-27-2024

Thanks for the advice.
Well after working with Splunk for +10 years I frankly don’t agree with the “simple string-based manipulation that Splunk can in the ingestion pipe”, I’d say I’ve seen amazing (to the extend crazy) things done with props and transforms.
Said that, Splunk might not be able to do exactly what I’m after here, but I’m willing to spend time trying anyway, as this will have a major impact on the performance at search time.

Yes, there are some meta data that need to stay with each event to be able to find them again.
I have some ideas in my head on how to twist this, but right now I’m on vacation, and can’t test them the next weeks time or so, so I’m just “warming up”, and looking for / listening in to others crazy ideas of what they have achieved in Splunk 🙂

PickleRick · ‎09-27-2024

It's not about "whose is longer". And yes, I've seen many interesting hacks but the fact remains - Splunk works one event at a time. So you can't "carry over" any info from one event to another using just props and transforms (except for that very very ugly and unmaintainable trick with actually cloning the event and separately modifying each copy). Also you cannot split an event (or merge it) after it's been through the line breaking/merging phase.

So you can't turn

{"whatever": ["a","b","c"], "something":"something"}

into

{"whatever": "a", "something":"something"}
{"whatever": "b", "something":"something"}
{"whatever": "c", "something":"something"}

Using props and transforms alone. Ingestion pipeline doesn't deal with structured data (with the exception of indexed extractions on UF but that's a different story).

BTrust · ‎09-28-2024

Longer than yesterday helps though 🙂

Ok - here are some thoughts I had getting around this, without having a chance to play with it yet.
SEDCMD - looks as a possibility, while knowing it’s not going to be the newbie kind of thing. There is support for back ref, so I thought of coping a core meta field as an addition into each stock_id, and then split the structure to events by each stuck_id

PickleRick · ‎09-28-2024

You're thinking in wrong order. That's why I'm saying it's not possible with Splunk alone.

If you don't know this one, it's one of the mainstays of understanding of Splunk indexing process- https://community.splunk.com/t5/Getting-Data-In/Diagrams-of-how-indexing-works-in-the-Splunk-platfor...

As you can see, line breaking is one of the absolute first things happening with the input stream. You can't "backtrack" your way within the ingestion pipeline to do SEDCMD before line breaking.

And, as I wrote already, it's really a very bad idea to tackle structured data with regexes.

PickleRick · ‎09-26-2024

TL&DR - you can't split events within Splunk itself during ingestion.

Longer explanation - each event is processed as a single entity. You could try to do a copy of the event using CLONE_SOURCETYPE and then process each of those instances separately (for example - cut some part from one copy but other part from another copy) but it's not something that can be reasonably implemented, it's unmaintaineable in the long run and you can't do it dynamically (like split a json into however many items an array has). Oh, and of course structured data manipulation in ingest time is a relatively big no-no.

So your best bet would be to pre-process your data with a third-party tool. (or at least write a scripted input doing the heavy lifting of splitting the data).

How to split large combi JSON array into individual events during index time

heavy forwarder

indexer

props.conf

transforms.conf

Updated Data Type Articles, Anniversary Celebrations, and More on Splunk Lantern

A Prelude to .conf25: Your Guide to Splunk University

4 Ways the Splunk Community Helps You Prepare for .conf25

Are you a member of the Splunk Community?