Getting Data In

LINE_BREAKER in single line printed JSON doc

Contributor

I have a JSON doc that prints events like so:

{"id":72,"stationName":"W 52 St & 11 Ave","availableDocks":1,"totalDocks":39,"latitude":40.76727216,"longitude":-73.99392888,"statusValue":"In Service","statusKey":1,"availableBikes":35,"stAddress1":"W 52 St & 11 Ave","stAddress2":"","city":"","postalCode":"","location":"","altitude":"","testStation":false,"lastCommunicationTime":"2015-08-12 04:30:26 AM","landMark":""},{"id":79,"stationName":"Franklin St & W Broadway","availableDocks":32,"totalDocks":33,"latitude":40.71911552,"longitude":-74.00666661,"statusValue":"In Service","statusKey":1,"availableBikes":0,"stAddress1":"Franklin St & W Broadway","stAddress2":"","city":"","postalCode":"","location":"","altitude":"","testStation":false,"lastCommunicationTime":"2015-08-12 04:33:44 AM","landMark":""},.........

Each new event starts immediately before: {"id"

Given this doc is just one line I believe MUST_BREAK_BEFORE = \{"id" in props.conf won't work. Can someone confirm?

So I'm now wrestling with LINE_BREAKER with SHOULD_LINEMERGE = false

As per this answer, I cannot get this approach to work. LINE_BREAKER = \{\"id

Having read other threads / docs, I am thinking this is incorrect though.

Wherever the regex matches, Splunk considers the start of the first matching group to be the end of the previous event, and considers the end of the first matching group to be the start of the next event.

You are telling Splunk that this text comes between lines.

I have tried setting a number of matching groups, but still haven't cracked it? Anyone have any ideas?

In addition to this, what is the order LINE_BREAKER compared to SEDCMD in the processing pipeline? If I modify an event using SEDCMD, can I base my LINE_BREAKER on the transformed event after SEDCMD?

Tags (2)
1 Solution

Splunk Employee
Splunk Employee

You can use the following to break the records.

# props.conf
[sourcetype]
BREAK_ONLY_BEFORE = \{"id":\d+,

If this were the case, you will need to also need to remove the trailing comma from subsequent records. Why? If you leave the trailing comma, the record becomes invalid JSON format and Splunk will believe it is just text. Otherwise, the result is not catastrophic but you would need to extract the fields manually, through regular expressions in props and transforms configuration.

SEDCMD-remove_trailing_comma s/\},/}/g

--

The alternative is to use a little trick where we synthetically insert a trow-away character before the line breaker. For instance:

# props.conf
[sourcetype]
SEDCMD-add_throw_away = s/(?:,\{|\{)/br\1/g
SHOULD_LINEMERGE = false
LINE_BREAKER = (br,|br)

Either one of these should work.


EDIT 1: I tried this with a mix of the recipe above and this works:

[sourcetype]
LINE_BREAKER = (\},)\{
SHOULD_LINEMERGE = false
SEDCMD-add_closing_bracket = s/\"$/"}/g
TIME_PREFIX = lastCommunicationTime\":\"

This is what it produces:

alt text

--gc

EDIT 2: Thanks for sharing the data. I modified the props entry slightly to deal with the array notation. This worked on my end with the data you supplied. In the end we end up with 507 valid JSON message strings.

[answers-1439488145]
LINE_BREAKER = (\},)\{|(\[)\{|(\])\}
SHOULD_LINEMERGE = false
TIME_PREFIX = lastCommunicationTime\":\"
SEDCMD-add_closing_bracket = s/\"$/"}/g
SEDCMD-remove_header = s/\{\"executionTime.+?$//g

I don't know what that means to your side. I've attached another picture.

alt text

View solution in original post

Path Finder

Thanks Gilberto Castillo and himynamesdave, this was very helpful.

I thought I would post what worked for me; I had to tweak the above solution slightly.

The "SEDCMD-add_closing_bracket" didn't work in my scenario for some reason - I always ended up with a single event with "}" as its content. I guess it was adding to the end of the original event, not to each broken line.

My JSON data was in the same format except there was a longer "header" section before the actual events I wanted.

And in my data, "tts" is the timestamp in epoch milliseconds.

Here is some example data:

{"header":{"pulldata":"graphdata" ... *etc* ... "graph_object":[
{"sname":"measurement Tokyo","id":91820056,"tts":1441065717000,"tux":2704,"count":1},
{"sname":"measurement New York","id":91820060,"tts":1441065488000,"tux":6047,"count":1},
{"sname":"measurement London","id":91820090,"tts":1441065646000,"tux":6817,"count":1}
]}

So, instead of pulling out the end bracket and re-adding it as in the example above, I only pulled out the comma with the first capturing group.

LINE_BREAKER = }(,){

From props.conf:
"The contents of the first capturing group are discarded, and will not be present in any event. You are telling Splunk that this text comes between lines."

That was sufficient to delimit each event and only removed the comma, not either bracket.
Then I added a SEDCMD for removing the long header, A SEDCMD to remove the two-character footer, specified no merging of events, and added my time prefix.

SEDCMD-remove_header = s/\{\"header.+?graph_object\":\[//g
SEDCMD-remove_footer = s/\]\}//g
SHOULD_LINEMERGE = false
TIME_PREFIX = \"tts\":

Those 5 ../local/props.conf custom sourcetype lines worked for me and I ended up with individual events as JSON message strings with "tts" being the _time value.

In my case, there was no need for extra props.conf entries like TIME_FORMAT, TRUNCATE, KV_MODE, MAX_TIMESTAMP_LOOKAHEAD, NO_BINARY_CHECK, etc.

Revered Legend

Try this for your props.conf

[YourSourceType]
LINE_BREAKER=(\},\{)
MAX_TIMESTAMP_LOOKAHEAD=150
MUST_BREAK_AFTER=^\{\s*\"id\"
NO_BINARY_CHECK=1
SEDCMD-addheader=s/^(.*)/{\1}/g
SEDCMD-removejunk1=s/\{\{/{/g
SEDCMD-removejunk2=s/\}\}/}/g
0 Karma

Splunk Employee
Splunk Employee

You can use the following to break the records.

# props.conf
[sourcetype]
BREAK_ONLY_BEFORE = \{"id":\d+,

If this were the case, you will need to also need to remove the trailing comma from subsequent records. Why? If you leave the trailing comma, the record becomes invalid JSON format and Splunk will believe it is just text. Otherwise, the result is not catastrophic but you would need to extract the fields manually, through regular expressions in props and transforms configuration.

SEDCMD-remove_trailing_comma s/\},/}/g

--

The alternative is to use a little trick where we synthetically insert a trow-away character before the line breaker. For instance:

# props.conf
[sourcetype]
SEDCMD-add_throw_away = s/(?:,\{|\{)/br\1/g
SHOULD_LINEMERGE = false
LINE_BREAKER = (br,|br)

Either one of these should work.


EDIT 1: I tried this with a mix of the recipe above and this works:

[sourcetype]
LINE_BREAKER = (\},)\{
SHOULD_LINEMERGE = false
SEDCMD-add_closing_bracket = s/\"$/"}/g
TIME_PREFIX = lastCommunicationTime\":\"

This is what it produces:

alt text

--gc

EDIT 2: Thanks for sharing the data. I modified the props entry slightly to deal with the array notation. This worked on my end with the data you supplied. In the end we end up with 507 valid JSON message strings.

[answers-1439488145]
LINE_BREAKER = (\},)\{|(\[)\{|(\])\}
SHOULD_LINEMERGE = false
TIME_PREFIX = lastCommunicationTime\":\"
SEDCMD-add_closing_bracket = s/\"$/"}/g
SEDCMD-remove_header = s/\{\"executionTime.+?$//g

I don't know what that means to your side. I've attached another picture.

alt text

View solution in original post

Contributor

Hey Gilberto, this works for my sample dataset - thanks!

However, on my actual dataset this fails. Here is an example of a full response I am working with: http://www.citibikenyc.com/stations/json

And my props.conf set accordingly:

[_json_nycbikes]
CHARSET=UTF-8
INDEXED_EXTRACTIONS=json
KV_MODE=json
NO_BINARY_CHECK=true
SEDCMD-removetrailingcomma=s/\},/}/g
SEDCMD-removeclosingtag=s/]\}//g
SEDCMD-removeopeningtag=s/\{"executionTime":"\d+\-\d+\-\d+\s+\d+:\d+:\d+\s+\w+","\w+":\[//g
LINE_BREAKER=(\})\{
SHOULD_LINEMERGE=false
TRUNCATE=0
DATETIME_CONFIG = CURRENT

(The SEDCMDs strip all the information we don't need to form nice JSON objects).

Though this still props does not break events. Any ideas?

0 Karma

Splunk Employee
Splunk Employee

Thanks for sharing the data. I modified the props entry slightly to deal with the array notation. This worked on my end with the data you supplied. In the end we end up with 507 valid JSON message strings.

[answers-1439488145]
LINE_BREAKER = (\},)\{|(\[)\{|(\])\}
SHOULD_LINEMERGE = false
TIME_PREFIX = lastCommunicationTime\":\"
SEDCMD-add_closing_bracket = s/\"$/"}/g
SEDCMD-remove_header = s/\{\"executionTime.+?$//g

I don't know what that means to your side. I've attached another picture above.

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!