Getting Data In

Sed command - Large XML values in JSON events makes replacement execution fail

markconlin
Path Finder

Objective
My objective is to remove the value of an "XML" key from my JSON events.
I believe I have stumbled upon a size/resource restriction of some kind with SEDCMD.

Issue
My SEDCMD does NOT work when very large xml values are present in the event.
My SEDCMD does work correctly with small values.

Test Log Files
fake_log.json - with small xml

{ "key1": "value1", "key2": "value2", "msg": "log line I do not care about", "key3": "value3", "xml": "<smallxml>.....</smallxml>" }
{ "key1": "value1", "key2": "value2", "msg": "log line I care about", "key3": "value3", "xml": "<smallxml>.....</smallxml>" }

Test Log File (fake_log_big.json) - with BIG xml

{ "key1": "value1", "key2": "value2", "msg": "log line I do not care about", "key3": "value3", "xml": "<smallxml>.....</smallxml>" }
{ "key1": "value1", "key2": "value2", "msg": "log line I care about", "key3": "value3", "xml": "<REDACTED BUT TRUST ME ITS BIG>" }

props.conf

....
[mecst]
DATETIME_CONFIG =
INDEXED_EXTRACTIONS = json
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = false
disabled = false
SEDCMD-faketest = s/("msg": "log line I care about")(.*)"xml": ".*>"/\1\2"xml":null/

Proof the SED command works from the Linux command line
Yes, the formatting is slightly different (must escape () on the command line).

root@host:/opt/splunk/bin# cat fake_log.json | sed -e 's/\("msg": "log line I care about"\)\(.*\)"xml":.*>"/\1\2"xml":null/'
{ "key1": "value1", "key2": "value2", "msg": "log line I do not care about", "key3": "value3", "xml": "<smallxml>.....</smallxml>" }
{ "key1": "value1", "key2": "value2", "msg": "log line I care about", "key3": "value3", "xml":null }

root@ip-10-70-2-102:/opt/splunk/bin# cat fake_log_big.json | sed -e 's/\("msg": "log line I care about"\)\(.*\)"xml":.*>"/\1\2"xml":null/'
{ "key1": "value1", "key2": "value2", "msg": "log line I do not care about", "key3": "value3", "xml": "<smallxml>.....</smallxml>" }
{ "key1": "value1", "key2": "value2", "msg": "log line I care about", "key3": "value3", "xml":null }

What I tried
I used oneshot to load each of these test files with my custom sourcetype.

root@host:/opt/splunk/bin# ./splunk add oneshot fake_log_big.json -sourcetype mecst -index faketest7
root@host:/opt/splunk/bin# ./splunk add oneshot fake_log.json -sourcetype mecst -index faketest8

Results (pics attached).
Events in faketest7 have the value of "xml" key removed.
Events in faketest8 do NOT have the value of "xml" key removed.

alt text
alt text

1 Solution

markconlin
Path Finder

Yep, fixing the backtracking in the regex fixed it.

SEDCMD-faketest = s/("msg": "log line I care about")(.*?)"xml": ".*>"/\1\2"xml":null/

View solution in original post

0 Karma

markconlin
Path Finder

Yep, fixing the backtracking in the regex fixed it.

SEDCMD-faketest = s/("msg": "log line I care about")(.*?)"xml": ".*>"/\1\2"xml":null/
0 Karma

jkat54
SplunkTrust
SplunkTrust

Have you tried setting a LINE_BREAKER and really high TRUNCATE value?

0 Karma

markconlin
Path Finder

It is not clear to me how this will help. Can you explain further? The raw events, including the entire xml is a single line. My TRUNCATE value is big enough to ingest all the events with no issue.

0 Karma

jkat54
SplunkTrust
SplunkTrust

Bumping my comment

0 Karma

cpetterborg
SplunkTrust
SplunkTrust

Is your XML multi-lined as it appears? If so, you can try two things:

SEDCMD-faketest = s/(?ms)("msg": "log line I care about")(.*)"xml": ".*>"/\1\2"xml":null/

or

SEDCMD-faketest = s/("msg": "log line I care about")(.*)"xml": "[\s\S]*>"/\1\2"xml":null/

The first should do multiline substitutions and the second should work past the newlines because . would not otherwise match a newline.

markconlin
Path Finder

Although your answer also works from the command, just like mine do, it still does not work as a SEDCMD.
My concern is this is a bug caused by backtracking limits.

Look at the difference between these two exact regexs, one with a large amount of data and one with a small amount:

Large XML, creates a "catastrophic backtrack" error.
https://regex101.com/r/3dAK7O/1/
vs.
Small XML, no error.
https://regex101.com/r/0bm9OS/1

My assumption is that the same issue is occurring in the Splunk internals.... and I have no visibility into it.

0 Karma

cpetterborg
SplunkTrust
SplunkTrust

Is the xml always the last field in the JSON string? If so, then try a SEDCMD that is simpler, like:

SEDCMD-faketest = s/(?ms)("msg": "log line I care about".*"xml": ").*"/\1null }/

Without an example of your BIG xml string, it's hard to test that, but hopefully a simplified regular expression will prevent the backtrack error. If it doesn't you may have to find out the max string length that is allowed in the Splunk implementation of the sed function to see if that is the problem. Open a case with Splunk support to do that.

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...