<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Sed command - Large XML values in JSON events makes replacement execution fail in Getting Data In</title>
    <link>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370671#M67317</link>
    <description>&lt;P&gt;Yep, fixing the backtracking in the regex fixed it. &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;SEDCMD-faketest = s/("msg": "log line I care about")(.*?)"xml": ".*&amp;gt;"/\1\2"xml":null/
&lt;/CODE&gt;&lt;/PRE&gt;</description>
    <pubDate>Thu, 17 Aug 2017 18:43:50 GMT</pubDate>
    <dc:creator>markconlin</dc:creator>
    <dc:date>2017-08-17T18:43:50Z</dc:date>
    <item>
      <title>Sed command - Large XML values in JSON events makes replacement execution fail</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370664#M67310</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Objective&lt;/STRONG&gt;&lt;BR /&gt;
My objective is to remove the value of an "XML" key from my JSON events.&lt;BR /&gt;
I believe I have stumbled upon a size/resource restriction of some kind with SEDCMD.&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;Issue&lt;/STRONG&gt;&lt;BR /&gt;
My SEDCMD does NOT work when very large xml values are present in the event.&lt;BR /&gt;
My SEDCMD does work correctly with small values.&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;Test Log Files&lt;/STRONG&gt; &lt;BR /&gt;
fake_log.json - with small xml&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;{ "key1": "value1", "key2": "value2", "msg": "log line I do not care about", "key3": "value3", "xml": "&amp;lt;smallxml&amp;gt;.....&amp;lt;/smallxml&amp;gt;" }
{ "key1": "value1", "key2": "value2", "msg": "log line I care about", "key3": "value3", "xml": "&amp;lt;smallxml&amp;gt;.....&amp;lt;/smallxml&amp;gt;" }
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Test Log File (fake_log_big.json) - with BIG xml&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;{ "key1": "value1", "key2": "value2", "msg": "log line I do not care about", "key3": "value3", "xml": "&amp;lt;smallxml&amp;gt;.....&amp;lt;/smallxml&amp;gt;" }
{ "key1": "value1", "key2": "value2", "msg": "log line I care about", "key3": "value3", "xml": "&amp;lt;REDACTED BUT TRUST ME ITS BIG&amp;gt;" }
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;&lt;STRONG&gt;props.conf&lt;/STRONG&gt;&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;....
[mecst]
DATETIME_CONFIG =
INDEXED_EXTRACTIONS = json
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = false
disabled = false
SEDCMD-faketest = s/("msg": "log line I care about")(.*)"xml": ".*&amp;gt;"/\1\2"xml":null/
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;&lt;STRONG&gt;Proof the SED command works from the Linux command line&lt;/STRONG&gt;&lt;BR /&gt;
Yes, the formatting is slightly different (must escape () on the command line). &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;root@host:/opt/splunk/bin# cat fake_log.json | sed -e 's/\("msg": "log line I care about"\)\(.*\)"xml":.*&amp;gt;"/\1\2"xml":null/'
{ "key1": "value1", "key2": "value2", "msg": "log line I do not care about", "key3": "value3", "xml": "&amp;lt;smallxml&amp;gt;.....&amp;lt;/smallxml&amp;gt;" }
{ "key1": "value1", "key2": "value2", "msg": "log line I care about", "key3": "value3", "xml":null }

root@ip-10-70-2-102:/opt/splunk/bin# cat fake_log_big.json | sed -e 's/\("msg": "log line I care about"\)\(.*\)"xml":.*&amp;gt;"/\1\2"xml":null/'
{ "key1": "value1", "key2": "value2", "msg": "log line I do not care about", "key3": "value3", "xml": "&amp;lt;smallxml&amp;gt;.....&amp;lt;/smallxml&amp;gt;" }
{ "key1": "value1", "key2": "value2", "msg": "log line I care about", "key3": "value3", "xml":null }
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;&lt;STRONG&gt;What I tried&lt;/STRONG&gt;&lt;BR /&gt;
I used oneshot to load each of these test files with my custom sourcetype.&lt;/P&gt;

&lt;P&gt;root@host:/opt/splunk/bin# ./splunk add oneshot fake_log_big.json -sourcetype mecst -index faketest7&lt;BR /&gt;
root@host:/opt/splunk/bin# ./splunk add oneshot fake_log.json -sourcetype mecst -index faketest8&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;Results (pics attached).&lt;/STRONG&gt;&lt;BR /&gt;
Events in faketest7 have the value of "xml" key removed.&lt;BR /&gt;
Events in faketest8 do NOT have the value of "xml" key removed.&lt;/P&gt;

&lt;P&gt;&lt;IMG src="https://community.splunk.com/storage/temp/209582-screen-shot-2017-08-15-at-34822-pm.png" alt="alt text" /&gt;&lt;BR /&gt;
&lt;IMG src="https://community.splunk.com/storage/temp/209584-screen-shot-2017-08-15-at-35515-pm.png" alt="alt text" /&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Sep 2020 15:21:59 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370664#M67310</guid>
      <dc:creator>markconlin</dc:creator>
      <dc:date>2020-09-29T15:21:59Z</dc:date>
    </item>
    <item>
      <title>Re: Sed command - Large XML values in JSON events makes replacement execution fail</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370665#M67311</link>
      <description>&lt;P&gt;Is your XML multi-lined as it appears? If so, you can try two things:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;SEDCMD-faketest = s/(?ms)("msg": "log line I care about")(.*)"xml": ".*&amp;gt;"/\1\2"xml":null/
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;or&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;SEDCMD-faketest = s/("msg": "log line I care about")(.*)"xml": "[\s\S]*&amp;gt;"/\1\2"xml":null/
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The first should do multiline substitutions and the second should work past the newlines because . would not otherwise match a newline.&lt;/P&gt;</description>
      <pubDate>Tue, 15 Aug 2017 22:15:31 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370665#M67311</guid>
      <dc:creator>cpetterborg</dc:creator>
      <dc:date>2017-08-15T22:15:31Z</dc:date>
    </item>
    <item>
      <title>Re: Sed command - Large XML values in JSON events makes replacement execution fail</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370666#M67312</link>
      <description>&lt;P&gt;Have you tried setting a LINE_BREAKER and really high TRUNCATE value?&lt;/P&gt;</description>
      <pubDate>Tue, 15 Aug 2017 22:15:53 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370666#M67312</guid>
      <dc:creator>jkat54</dc:creator>
      <dc:date>2017-08-15T22:15:53Z</dc:date>
    </item>
    <item>
      <title>Re: Sed command - Large XML values in JSON events makes replacement execution fail</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370667#M67313</link>
      <description>&lt;P&gt;Although your answer also works from the command, just like mine do, it still does not work as a SEDCMD. &lt;BR /&gt;
My concern is this is a bug caused by backtracking limits. &lt;/P&gt;

&lt;P&gt;Look at the difference between these two exact regexs, one with a large amount of data and one with a small amount:&lt;/P&gt;

&lt;P&gt;Large XML, creates a "catastrophic backtrack" error.&lt;BR /&gt;
&lt;A href="https://regex101.com/r/3dAK7O/1/"&gt;https://regex101.com/r/3dAK7O/1/&lt;/A&gt;&lt;BR /&gt;
vs.&lt;BR /&gt;
Small XML, no error.&lt;BR /&gt;
&lt;A href="https://regex101.com/r/0bm9OS/1"&gt;https://regex101.com/r/0bm9OS/1&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;My assumption is that the same issue is occurring in the Splunk internals.... and I have no visibility into it. &lt;/P&gt;</description>
      <pubDate>Thu, 17 Aug 2017 02:07:24 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370667#M67313</guid>
      <dc:creator>markconlin</dc:creator>
      <dc:date>2017-08-17T02:07:24Z</dc:date>
    </item>
    <item>
      <title>Re: Sed command - Large XML values in JSON events makes replacement execution fail</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370668#M67314</link>
      <description>&lt;P&gt;Bumping my comment&lt;/P&gt;</description>
      <pubDate>Thu, 17 Aug 2017 11:07:42 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370668#M67314</guid>
      <dc:creator>jkat54</dc:creator>
      <dc:date>2017-08-17T11:07:42Z</dc:date>
    </item>
    <item>
      <title>Re: Sed command - Large XML values in JSON events makes replacement execution fail</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370669#M67315</link>
      <description>&lt;P&gt;It is not clear to me how this will help. Can you explain further? The raw events, including the entire xml is a single line. My TRUNCATE value is big enough to ingest all the events with no issue. &lt;/P&gt;</description>
      <pubDate>Thu, 17 Aug 2017 14:46:58 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370669#M67315</guid>
      <dc:creator>markconlin</dc:creator>
      <dc:date>2017-08-17T14:46:58Z</dc:date>
    </item>
    <item>
      <title>Re: Sed command - Large XML values in JSON events makes replacement execution fail</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370670#M67316</link>
      <description>&lt;P&gt;Is the &lt;CODE&gt;xml&lt;/CODE&gt; &lt;STRONG&gt;always&lt;/STRONG&gt; the last field in the JSON string? If so, then try a SEDCMD that is simpler, like:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;SEDCMD-faketest = s/(?ms)("msg": "log line I care about".*"xml": ").*"/\1null }/
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Without an example of your BIG xml string, it's hard to test that, but hopefully a simplified regular expression will prevent the backtrack error. If it doesn't you may have to find out the max string length that is allowed in the Splunk implementation of the &lt;CODE&gt;sed&lt;/CODE&gt; function to see if that is the problem. Open a case with Splunk support to do that.&lt;/P&gt;</description>
      <pubDate>Thu, 17 Aug 2017 15:38:47 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370670#M67316</guid>
      <dc:creator>cpetterborg</dc:creator>
      <dc:date>2017-08-17T15:38:47Z</dc:date>
    </item>
    <item>
      <title>Re: Sed command - Large XML values in JSON events makes replacement execution fail</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370671#M67317</link>
      <description>&lt;P&gt;Yep, fixing the backtracking in the regex fixed it. &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;SEDCMD-faketest = s/("msg": "log line I care about")(.*?)"xml": ".*&amp;gt;"/\1\2"xml":null/
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Thu, 17 Aug 2017 18:43:50 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Sed-command-Large-XML-values-in-JSON-events-makes-replacement/m-p/370671#M67317</guid>
      <dc:creator>markconlin</dc:creator>
      <dc:date>2017-08-17T18:43:50Z</dc:date>
    </item>
  </channel>
</rss>

