Splunk Search

Escaping the double-quotes when ingesting data?

marrette
Path Finder

Hi,

I'm attempting to deal with data coming from a query run by the Splunk DB Connector. It pulls all the data in fine but one part of it's ingestion is really getting me: One of the fields is just a XML document which contains double quotes " marks. What Splunk seems to be doing is it only captures the data in that field up to the first double-quote and then decides it's the end of the sentence. I'd like to be able to get the whole XML as it is?

For example:
XML data:

<ShowPlanXML xmlns="http://schemas.example.com" Version="1.2" Build="11.0.6615.2"><BatchSequence><Batch><Statements>Much much much more content here</Statements></Batch></BatchSequence></ShowPlanXML>

However when this makes it to Splunk it has a field name but it stops after the equals sympbol:

query_plan = <ShowPlanXML xmlns=

I know all the data is in Splunk as I can see it, it's just not being properly captures into the field.

I assume the solution is to tell Splunk to escape the double-quotes as the data is coming in. But where/how do I do this?

Thanks
Eddie

Tags (1)
1 Solution

marrette
Path Finder

Well after a bit more research I came up with a fix for this. Possibly this is basic 101 Splunk as if I'd known what a "search time field extraction" was before hand I would have probably fixed it immediately. Anyway here is my solution and the process I used to get to it.

Step 1 - Work out how to match the data with a regex

As I knew the data was coming from the DB query but getting lost as it was being split into keys and values by Splunk due to quotes in the data. I took a look at the raw data to work out how to match it. To aid this write a Splunk search that brings up the data, and then pipe the _raw version of the data into a table to show exactly what the data looks like:

index=myindex sourcetype=SQLEXTRACT source=mysource | table _raw

From this I could determine what regex would match the full value. In the raw data the field and value looked like this:

query_plan="<ShowPlanXML xmlns="http://schemas.example.com" Version="1.2" Build="11.0.6615.2"><BatchSequence><Batch><Statements>Much much much more content here</Statements></Batch></BatchSequence></ShowPlanXML>"

Using the Splunk search interface and the rex command I was able to develop a regex to match the XML data in the value above. Here is an example of my finished and working search with regex:

index=myindex sourcetype=SQLEXTRACT source=mysource | rex field=_raw "query_plan=\"(?<query_plan_xml>.+<\/ShowPlanXML>)" | table query_plan_xml

Note the use of a pipe and the table command after the rex command to get Splunk to display the value it had just captured in the regex.

Step 2 - How to make this a Search Time Field Extraction
This second part took me ages to get working - it turned out the problem was this configuration needs to be made on the search head - which in hindsight seems obvious because I want to do a search time extraction.

Simple extractions like this can be just added to the props.conf file on your search head (most probably in %splunk%/etc/apps/search/local). In the file called props.conf you add a new stanza (this is the part enclosed in square braces) named to match the sourcetype of your data - in this example 'mysource' and then you using the keyword 'EXTRACT-' to tell Splunk to use the regex to extract data. The syntax of this command is EXTRACT-classname = regex expression to match.

So in the props.conf on my search head I added the following at the end of the file:

[SQLEXTRACT]
EXTRACT-queryplanXML_class = query_plan=\"(?<query_plan>.+<\/ShowPlanXML>)

This line simply tells Splunk that when it is searching data with a sourcetype of 'SQLEXTRACT' it should run the following lines on it. The extract command then match's the XML data and returns it with a key name of 'query_plan' which overwrites the existing incorrectly matched data. I'm not really sure what the class name part does or where else it might be used - I just picked something unlikely to be used anywhere else.

Once you have updated the props.conf file on your search head make sure to get Splunk to re-read this file. You can do this by restarting Splunk but because this is a search time configuration you can get Splunk to re-read the props.conf file by running the following search:

| extract reload=t

This final command won't give you any results but that's okay because Splunk will reload the props.conf file, just re-run your original search and you should find the incorrectly matched search is now returning all the data as it should be!

View solution in original post

0 Karma

santosh_sshanbh
Path Finder

The workaround I used for this problem is by replacing the double quotes by null string using REPLACE function of SQL. The changes I have done in dbinput configuration only.

0 Karma

nbarbato
Engager

@santosh_sshanbh yup. doing the same thing. DBConnect should address this issue. Free text fields should be able to get ingested and indexed "as is" without breaking the indexing.

0 Karma

marrette
Path Finder

Well after a bit more research I came up with a fix for this. Possibly this is basic 101 Splunk as if I'd known what a "search time field extraction" was before hand I would have probably fixed it immediately. Anyway here is my solution and the process I used to get to it.

Step 1 - Work out how to match the data with a regex

As I knew the data was coming from the DB query but getting lost as it was being split into keys and values by Splunk due to quotes in the data. I took a look at the raw data to work out how to match it. To aid this write a Splunk search that brings up the data, and then pipe the _raw version of the data into a table to show exactly what the data looks like:

index=myindex sourcetype=SQLEXTRACT source=mysource | table _raw

From this I could determine what regex would match the full value. In the raw data the field and value looked like this:

query_plan="<ShowPlanXML xmlns="http://schemas.example.com" Version="1.2" Build="11.0.6615.2"><BatchSequence><Batch><Statements>Much much much more content here</Statements></Batch></BatchSequence></ShowPlanXML>"

Using the Splunk search interface and the rex command I was able to develop a regex to match the XML data in the value above. Here is an example of my finished and working search with regex:

index=myindex sourcetype=SQLEXTRACT source=mysource | rex field=_raw "query_plan=\"(?<query_plan_xml>.+<\/ShowPlanXML>)" | table query_plan_xml

Note the use of a pipe and the table command after the rex command to get Splunk to display the value it had just captured in the regex.

Step 2 - How to make this a Search Time Field Extraction
This second part took me ages to get working - it turned out the problem was this configuration needs to be made on the search head - which in hindsight seems obvious because I want to do a search time extraction.

Simple extractions like this can be just added to the props.conf file on your search head (most probably in %splunk%/etc/apps/search/local). In the file called props.conf you add a new stanza (this is the part enclosed in square braces) named to match the sourcetype of your data - in this example 'mysource' and then you using the keyword 'EXTRACT-' to tell Splunk to use the regex to extract data. The syntax of this command is EXTRACT-classname = regex expression to match.

So in the props.conf on my search head I added the following at the end of the file:

[SQLEXTRACT]
EXTRACT-queryplanXML_class = query_plan=\"(?<query_plan>.+<\/ShowPlanXML>)

This line simply tells Splunk that when it is searching data with a sourcetype of 'SQLEXTRACT' it should run the following lines on it. The extract command then match's the XML data and returns it with a key name of 'query_plan' which overwrites the existing incorrectly matched data. I'm not really sure what the class name part does or where else it might be used - I just picked something unlikely to be used anywhere else.

Once you have updated the props.conf file on your search head make sure to get Splunk to re-read this file. You can do this by restarting Splunk but because this is a search time configuration you can get Splunk to re-read the props.conf file by running the following search:

| extract reload=t

This final command won't give you any results but that's okay because Splunk will reload the props.conf file, just re-run your original search and you should find the incorrectly matched search is now returning all the data as it should be!

0 Karma

marrette
Path Finder

Though further digging around I've discovered that running the SQL query directly from DB Connect's SQL Explorer it captures all the fields perfectly, including the query_plan field.

So the incorrect matching of quotes must be happening after the data has passed though the DB Connect app and into Splunk proper. It appears I'm hitting a known thing (feature/bug) with DB Connect: http://docs.splunk.com/Documentation/DBX/3.1.3/DeployDBX/Troubleshooting#Incomplete_field_values_are...

It seems to recommend using an extract to fix this - but I'm stuck as to where to even begin with this as reading the Extract documentation doesn't seem to give any way of matching this field as a whole either.

grittonc
Contributor

I've often had to escape the double-quotes inside a field by adding escape characters in SQL. Ugly, but seems to work well.

0 Karma

cmerriman
Super Champion

have you tried using xpath? http://docs.splunk.com/Documentation/Splunk/7.1.0/SearchReference/Xpath
It would be a search-time extraction, but i think it might do the trick.

0 Karma

marrette
Path Finder

I don't think that's exactly what I want, or at least I can't get it to work. I say this because the data comes into Splunk as a series of field names and values enclosed in double-quotes and separated by a comma:
fieldname="value", (fieldname equal double-quote value double-quote comma)

This means the XML part is just a large value (enclosed by quotes) with a field name of query_plan however Splunk has ended that field early because it thinks it's got all when it see's the first double quote in the XML.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...