Getting Data In

Split JSON array into individual events on HF

nmohammed
Builder

We've logs coming to HEC as nested JSON in chunks; We're trying to break them down into individual events at the HEC level before indexing them in Splunk. I had some success to remove the header/footer with props.conf and breaking the events, but it doesn't work completely. Most of the logs are not broken into individual events.

Sample events - 

{
  "logs": [
    {
      "type": "https",
      "timestamp": "2025-03-17T23:55:54.626915Z",
      "elb": "someELB",
      "client_ip": "10.xx.xx.xx",
      "client_port": 123456,
      "target_ip": "10.xx.xx.xx",
      "target_port": 123456,
      "request_processing_time": 0,
      "target_processing_time": 0.003,
      "response_processing_time": 0,
      "elb_status_code": 200,
      "target_status_code": 200,
      "received_bytes": 69,
      "sent_bytes": 3222,
      "request": "GET https://xyz.com",
      "user_agent": "-",
      "ssl_cipher": "ECDHE-RSA-AE",
      "ssl_protocol": "TLSv1.2",
      "target_group_arn": "arn:aws:elasticloadbalancing:us-west-2:XXXXX:targetgroup/XXXXX",
      "trace_id": "Root=XXXX"
    },
    {
      "type": "https",
      "timestamp": "2025-03-17T23:56:00.285547Z",
      "elb": "someELB",
      "client_ip": "10.xx.xx.xx",
      "client_port": 123456,
      "target_ip": "10.xx.xx.xx",
      "target_port": 123456,
      "request_processing_time": 0,
      "target_processing_time": 0.003,
      "response_processing_time": 0,
      "elb_status_code": 200,
      "target_status_code": 200,
      "received_bytes": 69,
      "sent_bytes": 3222,
      "request": "GET https://xyz.com",
      "user_agent": "-",
      "ssl_cipher": "ECDHE-RSA-AE",
      "ssl_protocol": "TLSv1.2",
      "target_group_arn": "arn:aws:elasticloadbalancing:us-west-2:XXXXX:targetgroup/XXXXX",
      "trace_id": "Root=XXXX"
    },
    {
      "type": "https",
      "timestamp": "2025-03-17T23:57:39.574741Z",
      "elb": "someELB",
      "client_ip": "10.xx.xx.xx",
      "client_port": 123456,
      "target_ip": "10.xx.xx.xx",
      "target_port": 123456,
      "request_processing_time": 0,
      "target_processing_time": 0.003,
      "response_processing_time": 0,
      "elb_status_code": 200,
      "target_status_code": 200,
      "received_bytes": 69,
      "sent_bytes": 3222,
      "request": "GET https://xyz.com",
      "user_agent": "-",
      "ssl_cipher": "ECDHE-RSA-AE",
      "ssl_protocol": "TLSv1.2",
      "target_group_arn": "arn:aws:elasticloadbalancing:us-west-2:XXXXX:targetgroup/XXXXX",
      "trace_id": "XXXX"
    }
  ]
}

 I am trying to get  

{
      "type": "https",
      "timestamp": "2025-03-17T23:55:54.626915Z",
      "elb": "someELB",
      "client_ip": "10.xx.xx.xx",
      "client_port": 123456,
      "target_ip": "10.xx.xx.xx",
      "target_port": 123456,
      "request_processing_time": 0,
      "target_processing_time": 0.003,
      "response_processing_time": 0,
      "elb_status_code": 200,
      "target_status_code": 200,
      "received_bytes": 69,
      "sent_bytes": 3222,
      "request": "GET https://xyz.com",
      "user_agent": "-",
      "ssl_cipher": "ECDHE-RSA-AE",
      "ssl_protocol": "TLSv1.2",
      "target_group_arn": "arn:aws:elasticloadbalancing:us-west-2:XXXXX:targetgroup/XXXXX",
      "trace_id": "Root=XXXX"
    }

{
      "type": "https",
      "timestamp": "2025-03-17T23:56:00.285547Z",
      "elb": "someELB",
      "client_ip": "10.xx.xx.xx",
      "client_port": 123456,
      "target_ip": "10.xx.xx.xx",
      "target_port": 123456,
      "request_processing_time": 0,
      "target_processing_time": 0.003,
      "response_processing_time": 0,
      "elb_status_code": 200,
      "target_status_code": 200,
      "received_bytes": 69,
      "sent_bytes": 3222,
      "request": "GET https://xyz.com",
      "user_agent": "-",
      "ssl_cipher": "ECDHE-RSA-AE",
      "ssl_protocol": "TLSv1.2",
      "target_group_arn": "arn:aws:elasticloadbalancing:us-west-2:XXXXX:targetgroup/XXXXX",
      "trace_id": "Root=XXXX"
    }
{
      "type": "https",
      "timestamp": "2025-03-17T23:57:39.574741Z",
      "elb": "someELB",
      "client_ip": "10.xx.xx.xx",
      "client_port": 123456,
      "target_ip": "10.xx.xx.xx",
      "target_port": 123456,
      "request_processing_time": 0,
      "target_processing_time": 0.003,
      "response_processing_time": 0,
      "elb_status_code": 200,
      "target_status_code": 200,
      "received_bytes": 69,
      "sent_bytes": 3222,
      "request": "GET https://xyz.com",
      "user_agent": "-",
      "ssl_cipher": "ECDHE-RSA-AE",
      "ssl_protocol": "TLSv1.2",
      "target_group_arn": "arn:aws:elasticloadbalancing:us-west-2:XXXXX:targetgroup/XXXXX",
      "trace_id": "XXXX"
    }

 
props.conf

[source::http:lblogs]
SHOULD_LINEMERGE = false
SEDCMD-remove_prefix = s/^\{\s*\"logs\"\:\s+\[//g
SEDCMD-remove_suffix = s/\]\}$//g
LINE_BREAKER = \}(,\s+)\{
NO_BINARY_CHECK = true
TIME_PREFIX = \"timestamp\":\s+\"
pulldown_type = true
MAX_TIMESTAMP_LOOKAHEAD = 100
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%6N
TRUNCATE = 1000000

 
Current Result in Splunk are below in the attached screenshot. The header ({ logs [) and footer are removed from events,  but then split (line break) maybe just working for one event in the chunk and others are ignored.

nmohammed_1-1746135993302.png

 

Labels (2)
0 Karma

marycordova
SplunkTrust
SplunkTrust

pretty sure your \s needs to be a new line which is not necessarily the same thing as whitespace like @PickleRick said

this regex will get your breaks and only leave the footer on the last event and break the header into its own event which you can just ignore

all of this as long as your data format doesn't change LOL

[\[\}]+([,\s\r\n]+){

 

@marycordova
0 Karma

PickleRick
SplunkTrust
SplunkTrust

1. Line breaking is happening before sedcmd kicks in so you must adjust your line breaker to work on a "non-trimmed" data and then remove the header/footer from the broken events.

2. Your line breaker doesn't account for new line characters.

3. You are aware that if you simply break at curly braces you can't have any more levels in your json structure because it will break your breaking, right?

So you should first break at

([\r\n\s]+){

Then you can use sedcmd to remove the dangling

[\s\n\r]*\][\s\n\r\]*}

part.

And use a transform to drop (send to nullqueue) the header event.

PickleRick
SplunkTrust
SplunkTrust

There is one more thing I missed. You said that you're receiving the events via HEC input. Which endpoint are you using? I'd have to check about the /raw endpoint but the /event endpoint bypasses line breaking completely. So regardless of whatever you set as line breaker, the events are coming in as they are received.

0 Karma

isoutamo
SplunkTrust
SplunkTrust
Here is indexing pipelines by HEC endpoints https://www.aplura.com/assets/pdf/hec_pipelines.pdf
0 Karma

nmohammed
Builder

HI @PickleRick we're using /event in the HEC endpoint ; but even with that some of the events are getting transformed (splitting as shared in screenshots earlier). 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

No. They are not being broken. The SEDCMDs are being applied and apparently it removes part of the event data so the remaining data sometimes happens to be valid JSON and sometimes isn't. But it has nothing to do with event breaking.

0 Karma

nmohammed
Builder

Thanks. 

I will  try to update the HEC URL to /raw instead and test with the new line breaker configuration.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Remember that /raw endpoint accepts just raw data whereas /event endpoint requires specific format which it then "unpacks". So just posting the same data to /raw endpoint will result in differently represented events.

0 Karma

nmohammed
Builder

Yes, I realized that . The data is processed by Lambda to extract only relevant information and then sent over HEC. 

We did try with /raw and it instead sent log encapsulated in an root event field like the below screenshot (masked some fields)- 

nmohammed_0-1746752716310.png

I tried the following based on suggestion -

 

[source::http:lblogs]
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n\s]+){
SEDCMD-remove-extras = [\s\n\r]*\][\s\n\r\]*}
NO_BINARY_CHECK = true
TIME_PREFIX = \"timestamp\":\s+\"
pulldown_type = true
MAX_TIMESTAMP_LOOKAHEAD = 100
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%6N
TRUNCATE = 1000000

  

0 Karma

PickleRick
SplunkTrust
SplunkTrust

What puzzles me here is why are you trying to split the data on the receiving end when you have control over the sending solution and it would be way easier and more maintainable to simply split the array at the source and send each array member as separate event?

Also remember then when sending to /event endpoint you're bypassing timestamp recognition unless you append that parameter which I always forget to the URI so you should send an explicit epoch-based timestamp along with your event. The upside to this is that you don't have to worry about date parsing in Splunk then.

0 Karma

nmohammed
Builder

There are a lot of events and there being sent in chunks to save on the lambda processing cost. 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Honestly, that tells me completely nothing. If sending a json array is so much cheaper than sending separate items from that array... there's something strange here.

BTW, you are aware that you can simply send your events in batches? And that it's how it's usually done with high-volume setups? So you don't have to use a separate HTTP request for each event?

0 Karma
Get Updates on the Splunk Community!

Aligning Observability Costs with Business Value: Practical Strategies

 Join us for an engaging Tech Talk on Aligning Observability Costs with Business Value: Practical ...

Mastering Data Pipelines: Unlocking Value with Splunk

 In today's AI-driven world, organizations must balance the challenges of managing the explosion of data with ...

Splunk Up Your Game: Why It's Time to Embrace Python 3.9+ and OpenSSL 3.0

Did you know that for Splunk Enterprise 9.4, Python 3.9 is the default interpreter? This shift is not just a ...