Getting Data In

Why is my LINE_BREAKER parameter not breaking properly with multiple capture groups?

andrewtrobec
Motivator
 

Hello, I need help with perfecting a sourcetype that doesn't index my json files correctly when I am defining multiple capture groups within the LINE_BREAKER parameter.

I'm using this other questionto try to figure out how to make it work: https://community.splunk.com/t5/Getting-Data-In/How-to-handle-LINE-BREAKER-regex-for-multiple-captur... 

In my case my json looks like this

[{"Field 1": "Value 1", "Field N": "Value N"}, {"Field 1": "Value 1", "Field N": "Value N"}, {"Field 1": "Value 1", "Field N": "Value N"}]

Initially I tried:

LINE_BREAKER = }(,\s){

Which split the events with the exception of the first and last records which were not indexed correctly due to the "[" or "]" characters leading and trailing the payload.

After many attempts I have been unable to make it work, but based on what I've read this seems to be the most intuitive solution for defining the capture groups:

LINE_BREAKER = ^([){|}(,\s){|}(])$

It doesn't work, but rather indexes the entire payload as one event, formatted correctly, but unusable.

Could somebody please suggest how to correctly define the LINE_BREAKER parameter for the sourcetype?  Here is the full version I'm using:

[area:prd:json]
SHOULD_LINEMERGE = false
TRUNCATE = 8388608
TIME_PREFIX = \"Updated\sdate\"\:\s\"
TIME_FORMAT = %Y-%m-%d %H:%M:%S
TZ = Europe/Paris
MAX_TIMESTAMP_LOOKAHEAD = -1
KV_MODE = json
LINE_BREAKER = ^([){|}(,\s){|}(])$

Other resolutions to my problem are welcome as well!

Best regards,

Andrew

Labels (3)
0 Karma
1 Solution

lucacaldiero
Path Finder

Dear splunk user,

using this sample data

[{"Field 859": "Value aaaaa", "Field 2": "Value bbbbb"}, {"Field 1": "Value ccccc", "Field 2": "Value ddddd"}, {"Field 1": "Value eeeee", "Field 2": "Value fffff"}]
[{"Field 759:" "Value ggggg", "Field 2": "Value hhhhh"}, {"Field 1": "Value iiiii", "Field 2": "Value jjjjj"}, {"Field 1": "Value kkkkk", "Field 2": "Value lllll"}]

with this props.conf

[trbndrw_temp]
DATETIME_CONFIG = CURRENT
SHOULD_LINEMERGE = false
LINE_BREAKER = (?:\}(\s*,\s*)\{)|(\][\r\n]+\[)
TRANSFORMS-getrid = getridht

and this transforms.conf

[getridht]
INGEST_EVAL = _raw=replace(_raw, "(\[|\])","")

you may be able to achieve what you want

Happy splunking
Luca (aka "one DASH is always better")

View solution in original post

lucacaldiero
Path Finder

Dear splunk user,

using this sample data

[{"Field 859": "Value aaaaa", "Field 2": "Value bbbbb"}, {"Field 1": "Value ccccc", "Field 2": "Value ddddd"}, {"Field 1": "Value eeeee", "Field 2": "Value fffff"}]
[{"Field 759:" "Value ggggg", "Field 2": "Value hhhhh"}, {"Field 1": "Value iiiii", "Field 2": "Value jjjjj"}, {"Field 1": "Value kkkkk", "Field 2": "Value lllll"}]

with this props.conf

[trbndrw_temp]
DATETIME_CONFIG = CURRENT
SHOULD_LINEMERGE = false
LINE_BREAKER = (?:\}(\s*,\s*)\{)|(\][\r\n]+\[)
TRANSFORMS-getrid = getridht

and this transforms.conf

[getridht]
INGEST_EVAL = _raw=replace(_raw, "(\[|\])","")

you may be able to achieve what you want

Happy splunking
Luca (aka "one DASH is always better")

andrewtrobec
Motivator

Thanks Luca, this works!  Appreciated!

0 Karma

isoutamo
SplunkTrust
SplunkTrust
Hi
Based on your TIME_PREFIX, your example is not complete sample! If you want that we help you, we really need the whole example json/file.
r. Ismo
0 Karma

andrewtrobec
Motivator

Thank you @isoutamo for the response.  Here is more accurate version of payload

[
    {
        "Assigned to": "Jones, Francis",
        "Cost": 3,
        "Created date": "2024-02-28 12:52:18",
        "Extraction date": "2024-03-02 13:51:00",
        "ID": 12345,
        "Initial Cost": 3,
        "Location": "Sites",
        "Path": "Sites\\FY1\\S3",
        "Priority": 1,
        "State": "In Progress",
        "Status Change date": "2024-03-05 16:33:23",
        "Tags": "Europe; Finance",
        "Title": "Ensure correct routing of orders",
        "Updated date": "2024-03-05 16:33:23",
        "Warranty": false,
        "Wave Quarter": "Q2 22",
        "Work Item Type": "Request"
    },
    {
        "Assigned to": "Jones, Francis",
        "Cost": 3,
        "Created date": "2024-02-28 18:59:18",
        "Extraction date": "2024-03-05 16:31:00",
        "ID": 12345,
        "Initial Cost": 3,
        "Location": "Sites",
        "Path": "Sites\\FY1\\S3",
        "Priority": 1,
        "State": "In Progress",
        "Status Change date": "2024-03-05 16:33:23",
        "Tags": "Europe; Finance",
        "Title": "Ensure correct routing of orders",
        "Updated date": "2024-03-05 16:33:23",
        "Warranty": false,
        "Wave Quarter": "Q2 22",
        "Work Item Type": "Request"
    },
    {
        "Assigned to": "Jones, Francis",
        "Cost": 3,
        "Created date": "2023-01-28 18:59:18",
        "Extraction date": "2023-02-05 16:31:00",
        "ID": 12345,
        "Initial Cost": 3,
        "Location": "Sites",
        "Path": "Sites\\FY1\\S3",
        "Priority": 1,
        "State": "In Progress",
        "Status Change date": "2023-02-05 16:33:23",
        "Tags": "Europe; Finance",
        "Title": "Ensure correct routing of orders",
        "Updated date": "2024-03-05 16:33:23",
        "Warranty": false,
        "Wave Quarter": "Q2 22",
        "Work Item Type": "Request"
    }
]
0 Karma

isoutamo
SplunkTrust
SplunkTrust

Thanks.

This seems to work 

LINE_BREAKER = (\[[\s\n\r]*\{|\},[\s\n\r]+\{|\}[\s\n\r]*)

Why your regex doesn't work?

Splunk need only one capture group for line beak.  You have three separate groups even you have try to make those selectable by |.  You also need to escape some of those marks (like [{]} to recognise as a character). You can test this with https://regex101.com/r/IGQHd7/1

When I test these I use just regex101.com and/or Splunk GUI -> Settings -> Import Data -> Upload with example file on my own laptop/workstation/dev server. In that way it's easy to change those values and check how those are affecting.

You should also change

MAX_TIMESTAMP_LOOKAHEAD = 20

 As you define TIMESTAMP_PREFIX there is no reason to use -1 as its lookahead value. Splunk starts to look it after defined prefix and as you can see correct timestamp is within 20 character after it.

Why you have set KV_MODE=json? As you have break this json into separate events, it's not anymore json as a format. Now it's just regular text based event.

 

0 Karma

andrewtrobec
Motivator

Thank you for the feedback!  I will take your suggestions into consideration!

0 Karma
Get Updates on the Splunk Community!

Infographic provides the TL;DR for the 2024 Splunk Career Impact Report

We’ve been buzzing with excitement about the recent validation of Splunk Education! The 2024 Splunk Career ...

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...