Hello, I need help with perfecting a sourcetype that doesn't index my json files correctly when I am defining multiple capture groups within the LINE_BREAKER parameter.
I'm using this other questionto try to figure out how to make it work: https://community.splunk.com/t5/Getting-Data-In/How-to-handle-LINE-BREAKER-regex-for-multiple-captur...
In my case my json looks like this
[{"Field 1": "Value 1", "Field N": "Value N"}, {"Field 1": "Value 1", "Field N": "Value N"}, {"Field 1": "Value 1", "Field N": "Value N"}]
Initially I tried:
LINE_BREAKER = }(,\s){
Which split the events with the exception of the first and last records which were not indexed correctly due to the "[" or "]" characters leading and trailing the payload.
After many attempts I have been unable to make it work, but based on what I've read this seems to be the most intuitive solution for defining the capture groups:
LINE_BREAKER = ^([){|}(,\s){|}(])$
It doesn't work, but rather indexes the entire payload as one event, formatted correctly, but unusable.
Could somebody please suggest how to correctly define the LINE_BREAKER parameter for the sourcetype? Here is the full version I'm using:
[area:prd:json]
SHOULD_LINEMERGE = false
TRUNCATE = 8388608
TIME_PREFIX = \"Updated\sdate\"\:\s\"
TIME_FORMAT = %Y-%m-%d %H:%M:%S
TZ = Europe/Paris
MAX_TIMESTAMP_LOOKAHEAD = -1
KV_MODE = json
LINE_BREAKER = ^([){|}(,\s){|}(])$
Other resolutions to my problem are welcome as well!
Best regards,
Andrew
Dear splunk user,
using this sample data
[{"Field 859": "Value aaaaa", "Field 2": "Value bbbbb"}, {"Field 1": "Value ccccc", "Field 2": "Value ddddd"}, {"Field 1": "Value eeeee", "Field 2": "Value fffff"}]
[{"Field 759:" "Value ggggg", "Field 2": "Value hhhhh"}, {"Field 1": "Value iiiii", "Field 2": "Value jjjjj"}, {"Field 1": "Value kkkkk", "Field 2": "Value lllll"}]
with this props.conf
[trbndrw_temp]
DATETIME_CONFIG = CURRENT
SHOULD_LINEMERGE = false
LINE_BREAKER = (?:\}(\s*,\s*)\{)|(\][\r\n]+\[)
TRANSFORMS-getrid = getridht
and this transforms.conf
[getridht]
INGEST_EVAL = _raw=replace(_raw, "(\[|\])","")
you may be able to achieve what you want
Happy splunking
Luca (aka "one DASH is always better")
Dear splunk user,
using this sample data
[{"Field 859": "Value aaaaa", "Field 2": "Value bbbbb"}, {"Field 1": "Value ccccc", "Field 2": "Value ddddd"}, {"Field 1": "Value eeeee", "Field 2": "Value fffff"}]
[{"Field 759:" "Value ggggg", "Field 2": "Value hhhhh"}, {"Field 1": "Value iiiii", "Field 2": "Value jjjjj"}, {"Field 1": "Value kkkkk", "Field 2": "Value lllll"}]
with this props.conf
[trbndrw_temp]
DATETIME_CONFIG = CURRENT
SHOULD_LINEMERGE = false
LINE_BREAKER = (?:\}(\s*,\s*)\{)|(\][\r\n]+\[)
TRANSFORMS-getrid = getridht
and this transforms.conf
[getridht]
INGEST_EVAL = _raw=replace(_raw, "(\[|\])","")
you may be able to achieve what you want
Happy splunking
Luca (aka "one DASH is always better")
Thanks Luca, this works! Appreciated!
Thank you @isoutamo for the response. Here is more accurate version of payload
[
{
"Assigned to": "Jones, Francis",
"Cost": 3,
"Created date": "2024-02-28 12:52:18",
"Extraction date": "2024-03-02 13:51:00",
"ID": 12345,
"Initial Cost": 3,
"Location": "Sites",
"Path": "Sites\\FY1\\S3",
"Priority": 1,
"State": "In Progress",
"Status Change date": "2024-03-05 16:33:23",
"Tags": "Europe; Finance",
"Title": "Ensure correct routing of orders",
"Updated date": "2024-03-05 16:33:23",
"Warranty": false,
"Wave Quarter": "Q2 22",
"Work Item Type": "Request"
},
{
"Assigned to": "Jones, Francis",
"Cost": 3,
"Created date": "2024-02-28 18:59:18",
"Extraction date": "2024-03-05 16:31:00",
"ID": 12345,
"Initial Cost": 3,
"Location": "Sites",
"Path": "Sites\\FY1\\S3",
"Priority": 1,
"State": "In Progress",
"Status Change date": "2024-03-05 16:33:23",
"Tags": "Europe; Finance",
"Title": "Ensure correct routing of orders",
"Updated date": "2024-03-05 16:33:23",
"Warranty": false,
"Wave Quarter": "Q2 22",
"Work Item Type": "Request"
},
{
"Assigned to": "Jones, Francis",
"Cost": 3,
"Created date": "2023-01-28 18:59:18",
"Extraction date": "2023-02-05 16:31:00",
"ID": 12345,
"Initial Cost": 3,
"Location": "Sites",
"Path": "Sites\\FY1\\S3",
"Priority": 1,
"State": "In Progress",
"Status Change date": "2023-02-05 16:33:23",
"Tags": "Europe; Finance",
"Title": "Ensure correct routing of orders",
"Updated date": "2024-03-05 16:33:23",
"Warranty": false,
"Wave Quarter": "Q2 22",
"Work Item Type": "Request"
}
]
Thanks.
This seems to work
LINE_BREAKER = (\[[\s\n\r]*\{|\},[\s\n\r]+\{|\}[\s\n\r]*)
Why your regex doesn't work?
Splunk need only one capture group for line beak. You have three separate groups even you have try to make those selectable by |. You also need to escape some of those marks (like [{]} to recognise as a character). You can test this with https://regex101.com/r/IGQHd7/1
When I test these I use just regex101.com and/or Splunk GUI -> Settings -> Import Data -> Upload with example file on my own laptop/workstation/dev server. In that way it's easy to change those values and check how those are affecting.
You should also change
MAX_TIMESTAMP_LOOKAHEAD = 20
As you define TIMESTAMP_PREFIX there is no reason to use -1 as its lookahead value. Splunk starts to look it after defined prefix and as you can see correct timestamp is within 20 character after it.
Why you have set KV_MODE=json? As you have break this json into separate events, it's not anymore json as a format. Now it's just regular text based event.
Thank you for the feedback! I will take your suggestions into consideration!