Splunk Search

Need help with linebreaker for array of json objects

lyndac
Contributor

I am indexing json files. Each file contains an array of around 1,000 json objects (with nested arrays/objects). I need to extract each object as a single event. (See sample json source and props.conf below).

I use the "add data" button on the UI to index the file, it looks like it gets all the events. If I just do a search for all the events, the first json object does show up. However, it looks like the KV_MODE=json stumbles on the initial [ and is unable to extract the fields. Because if I search for one of the fields in the data (index=foo coach="matt"), the event is not returned. However, if I search for just the value of the field *(index=foo matt), the event is returned.

How do I modify my props.conf to correctly handle the first object in the array?

[
    {    
        "team" : "spirit",        
        "coach": "matt",
        "regDate": "2016-07-31T12:23:34Z",
        "players": [
          {
            "name":"Marissa",
            "positions": ["2B", "P", "C", "RF"]
          },
          {
            "name":"Sierra",
            "positions": ["SS","LF"]
          }
        ]
    },
    {    
        "team" : "chill",        
        "coach": "bob"
        "regDate": "2016-08-01T12:15:19Z",
        "players": [
          {
            "name":"Rhi",
            "positions": ["3B", "CF","1B"]
          }
        ]
    }
]

This is my props.conf:

 [json_linebreaker]
 JSON_TRIM_BRACES_IN_ARRAY_NAMES=true
 KV_MODE=json
 LINE_BREAKER=\s{4}\},(,[\n\r])\s{4}\{(.*)
 MAX_TIMESTAMP_LOOKAHEAD=30
 NO_BINARY_CHECK=true
 SHOULD_LINEMERGE=true
 TIME_FORMAT=%Y-%m-%dT%H:%M:%S%Z
 TIME_PREFIX=regDate\"\s*:\s*\"
0 Karma
1 Solution

lyndac
Contributor

Finally got this working by using a PREAMBLE_REGEX to discard the opening array bracket. Posting the props.conf here for completeness (in case someone else has this issue).

[json_linebreaker]
JSON_TRIM_BRACES_IN_ARRAY_NAMES=true
KV_MODE=json
PREAMBLE_REGEX=^\s{0,2}[
LINE_BREAKER=\s{4}},(,[\n\r])\s{4}({.)
MAX_TIMESTAMP_LOOKAHEAD=30
NO_BINARY_CHECK=true
SHOULD_LINEMERGE=false
TIME_FORMAT=%Y-%m-%dT%H:%M:%S%Z
TIME_PREFIX=regDate\"\s
:\s*\"

View solution in original post

0 Karma

aliakseidzianis
Path Finder

Of course I only have a small set for your data, but this seems to be working. The main challenge is to line break as you mentioned. Assuming that the first element of the json object is always the same ( in your case, it starts with "team", then this regex should work.

LINE_BREAKER = (,*\s+){\s+"team"

Once you have events breaking properly, the only thing you have left is to clean up opening and closing square brackets with SEDCMD. Finished Props looks like this:

[answers]
LINE_BREAKER = (,*\s+){\s+"team"
TIME_PREFIX = regDate":\s"
MAX_TIMESTAMP_LOOKAHEAD = 30
NO_BINARY_CHECK = true
disabled = false
KV_MODE = json
SEDCMD-remove_opening = s/^\[//g
SEDCMD-remove_cloing = s/\]$//g
JSON_TRIM_BRACES_IN_ARRAY_NAMES = true

I had a similar issue, but my json objects was wrapped yet in another json array. Same solution worked there too. As long as you can line break on the first field of the object - you should be fine.

   [
  "Records": [
    {
        "team" : "spirit",
        "coach": "matt",
        "regDate": "2016-07-31T12:23:34Z",
    },
    {
        "team" : "chill",
        "coach": "bob"
        "regDate": "2016-08-01T12:15:19Z",
    }
]

I also spoke with someone from Splunk and they do realize that json array is a common data structure nowadays and they do have an internal Jira task for it as a feature request.

I hope it helps!

kkrishnan_splun
Splunk Employee
Splunk Employee

Thank you so much. This helped a ton !!

0 Karma

lyndac
Contributor

Finally got this working by using a PREAMBLE_REGEX to discard the opening array bracket. Posting the props.conf here for completeness (in case someone else has this issue).

[json_linebreaker]
JSON_TRIM_BRACES_IN_ARRAY_NAMES=true
KV_MODE=json
PREAMBLE_REGEX=^\s{0,2}[
LINE_BREAKER=\s{4}},(,[\n\r])\s{4}({.)
MAX_TIMESTAMP_LOOKAHEAD=30
NO_BINARY_CHECK=true
SHOULD_LINEMERGE=false
TIME_FORMAT=%Y-%m-%dT%H:%M:%S%Z
TIME_PREFIX=regDate\"\s
:\s*\"

0 Karma

lyndac
Contributor

The events are breaking correctly, it's just that pesky initial square bracket. I changed SHOULD_LINEMERGE to false and it didn't seem to change anything.

0 Karma

lyndac
Contributor

I've been playing with the regex all day today. The most recent incantation is:

LINE_BREAKER=(^[[\n\r]+)|\s{4}},(,[\n\r])\s{4}{(.*)

My thinking was if I could break the [ into its own event, then I could throw away that event using a transform. However, it is still keeping the [ with the first object and now is splitting the event at random spots.

0 Karma

lquinn
Contributor

Are your events breaking correctly? If you have set LINE_BREAKER then SHOULD_LINEMERGE should be set to false, not true. For some reason, setting this through the UI does not work, Splunk just reverts it back to true and adds in a BREAK_ONLY_BEFORE setting as well as the line breaker. This could be causing part of the problem that you are seeing ...

0 Karma
Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...