Getting Data In

Removing line breaks within XML log data

cparham
Explorer

This is related to http://answers.splunk.com/questions/2141/xml-log-source-type

How would I remove line breaks found inside elements <Sql_Text>...</Sql_Text>. Many of the SQL statements inside <Sql_Text> tags have hard line breaks often at very undesirable places.

For example:

<Sql_Text>UPDATE APP_USR SET CONTACT_ID = 2 WHERE APP_  
ID = 64  
</Sql_Text>

I need it to look like this to make searching possible:

<Sql_Text>  
UPDATE APP_USR SET CONTACT_ID = 2 WHERE APP_ID = 64  
</Sql_Text>  

or even this

<Sql_Text>UPDATE APP_USR SET CONTACT_ID = 2 WHERE APP_ID = 64</Sql_Text>  

Thanks.

Tags (1)
1 Solution

Lowell
Super Champion

I don't think this is possible to do within a single regular expression. So you'll have to either:

  1. Fix this externally using a scripted input. (This is the most complicated approach and the only one that will yield correct index-time results.)
  2. Fix your data a search time using a sequence of commands (which could be consolidated using a macro for simplicity.

Here is a search-time solution that would get reformat your text the way you are looking for:

| rex "(?s)^(?<_sqltxt_pre>.*?<Sql_Text>)(?<Sql_Text>.*?)(?<_sqltxt_post></Sql_Text>.*)$" | rex field=Sql_Text mode=sed "s/[\r\n]//g" | eval _raw=_sqltxt_pre . trim(Sql_Text) . _sqltxt_post

So for your example above, if you want to search for APP_ID 64, then you would need a search like so:

sourcetype=oracle_audit_xml | rex "(?s)^(?<_sqltxt_pre>.*?<Sql_Text>)(?<Sql_Text>.*?)(?<_sqltxt_post></Sql_Text>.*)$" | rex field=Sql_Text mode=sed "s/[\r\n]//g" | eval _raw=_sqltxt_pre . trim(Sql_Text) . _sqltxt_post | search "APP_ID = 64"

View solution in original post

Lowell
Super Champion

I don't think this is possible to do within a single regular expression. So you'll have to either:

  1. Fix this externally using a scripted input. (This is the most complicated approach and the only one that will yield correct index-time results.)
  2. Fix your data a search time using a sequence of commands (which could be consolidated using a macro for simplicity.

Here is a search-time solution that would get reformat your text the way you are looking for:

| rex "(?s)^(?<_sqltxt_pre>.*?<Sql_Text>)(?<Sql_Text>.*?)(?<_sqltxt_post></Sql_Text>.*)$" | rex field=Sql_Text mode=sed "s/[\r\n]//g" | eval _raw=_sqltxt_pre . trim(Sql_Text) . _sqltxt_post

So for your example above, if you want to search for APP_ID 64, then you would need a search like so:

sourcetype=oracle_audit_xml | rex "(?s)^(?<_sqltxt_pre>.*?<Sql_Text>)(?<Sql_Text>.*?)(?<_sqltxt_post></Sql_Text>.*)$" | rex field=Sql_Text mode=sed "s/[\r\n]//g" | eval _raw=_sqltxt_pre . trim(Sql_Text) . _sqltxt_post | search "APP_ID = 64"

Lowell
Super Champion

(?s) tells the regular expression engine that . (dot) can match anything. (Without this the expression .* will not match across the end of a line, which we need in this case since your event spans multiple lines.) Basically we are breaking each event into 3 pieces, modifying the middle (e.g. your "Sql_Text") and then joining the 3 pieces back into one big event. I thought I added a basic explanation, but would appearer that I forgot. 😉 Sorry about that. Best of luck! I recommend http://www.regular-expressions.info/ for better understanding regular expressions.

0 Karma

cparham
Explorer

Thanks. The search-time solution appears to work.
[although I'm a little confused by the first rex expression 🙂 I think I follow most of it except the first group match (?s)]

0 Karma

Lowell
Super Champion

Is there a constant-width wrapping going on here? The problem with completely stripping out end-of-line characters like this is that there could be times where there are legitimate line breaks in the actual SQL statement; so you could end up with values lines being mashed together that shouldn't be.

0 Karma
Get Updates on the Splunk Community!

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

The Transformative Power of AI and ML in Enhancing Observability   In the realm of IT operations, the ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...