Getting Data In

Removing line breaks within XML log data

Explorer

This is related to http://answers.splunk.com/questions/2141/xml-log-source-type

How would I remove line breaks found inside elements <Sql_Text>...</Sql_Text>. Many of the SQL statements inside <Sql_Text> tags have hard line breaks often at very undesirable places.

For example:

<Sql_Text>UPDATE APP_USR SET CONTACT_ID = 2 WHERE APP_  
ID = 64  
</Sql_Text>

I need it to look like this to make searching possible:

<Sql_Text>  
UPDATE APP_USR SET CONTACT_ID = 2 WHERE APP_ID = 64  
</Sql_Text>  

or even this

<Sql_Text>UPDATE APP_USR SET CONTACT_ID = 2 WHERE APP_ID = 64</Sql_Text>  

Thanks.

Tags (1)
1 Solution

Super Champion

I don't think this is possible to do within a single regular expression. So you'll have to either:

  1. Fix this externally using a scripted input. (This is the most complicated approach and the only one that will yield correct index-time results.)
  2. Fix your data a search time using a sequence of commands (which could be consolidated using a macro for simplicity.

Here is a search-time solution that would get reformat your text the way you are looking for:

| rex "(?s)^(?<_sqltxt_pre>.*?<Sql_Text>)(?<Sql_Text>.*?)(?<_sqltxt_post></Sql_Text>.*)$" | rex field=Sql_Text mode=sed "s/[\r\n]//g" | eval _raw=_sqltxt_pre . trim(Sql_Text) . _sqltxt_post

So for your example above, if you want to search for APP_ID 64, then you would need a search like so:

sourcetype=oracle_audit_xml | rex "(?s)^(?<_sqltxt_pre>.*?<Sql_Text>)(?<Sql_Text>.*?)(?<_sqltxt_post></Sql_Text>.*)$" | rex field=Sql_Text mode=sed "s/[\r\n]//g" | eval _raw=_sqltxt_pre . trim(Sql_Text) . _sqltxt_post | search "APP_ID = 64"

View solution in original post

Super Champion

I don't think this is possible to do within a single regular expression. So you'll have to either:

  1. Fix this externally using a scripted input. (This is the most complicated approach and the only one that will yield correct index-time results.)
  2. Fix your data a search time using a sequence of commands (which could be consolidated using a macro for simplicity.

Here is a search-time solution that would get reformat your text the way you are looking for:

| rex "(?s)^(?<_sqltxt_pre>.*?<Sql_Text>)(?<Sql_Text>.*?)(?<_sqltxt_post></Sql_Text>.*)$" | rex field=Sql_Text mode=sed "s/[\r\n]//g" | eval _raw=_sqltxt_pre . trim(Sql_Text) . _sqltxt_post

So for your example above, if you want to search for APP_ID 64, then you would need a search like so:

sourcetype=oracle_audit_xml | rex "(?s)^(?<_sqltxt_pre>.*?<Sql_Text>)(?<Sql_Text>.*?)(?<_sqltxt_post></Sql_Text>.*)$" | rex field=Sql_Text mode=sed "s/[\r\n]//g" | eval _raw=_sqltxt_pre . trim(Sql_Text) . _sqltxt_post | search "APP_ID = 64"

View solution in original post

Super Champion

(?s) tells the regular expression engine that . (dot) can match anything. (Without this the expression .* will not match across the end of a line, which we need in this case since your event spans multiple lines.) Basically we are breaking each event into 3 pieces, modifying the middle (e.g. your "Sql_Text") and then joining the 3 pieces back into one big event. I thought I added a basic explanation, but would appearer that I forgot. 😉 Sorry about that. Best of luck! I recommend http://www.regular-expressions.info/ for better understanding regular expressions.

0 Karma

Explorer

Thanks. The search-time solution appears to work.
[although I'm a little confused by the first rex expression 🙂 I think I follow most of it except the first group match (?s)]

0 Karma

Super Champion

Is there a constant-width wrapping going on here? The problem with completely stripping out end-of-line characters like this is that there could be times where there are legitimate line breaks in the actual SQL statement; so you could end up with values lines being mashed together that shouldn't be.

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!