Getting Data In

How can I delete all double quotes in input data before it import to splunk?

tamduong16
Contributor

I have a csv file in which it contains random double quotes and I want to remove all these quotes before it actually get import into splunk. Could someone show me how to do this at the add data state? Or is there any splunk app that could help me do this? I'm new to splunk. Please help!

0 Karma
1 Solution

DalJeanis
SplunkTrust
SplunkTrust

Updated ...

@tamduong16 -
(1) Did you put the code into props.conf on your indexer/heavy forwarder?

(2) Did you make it the FIRST line for that sourcetype, and make sure the sourcetype is capitalized and spelled exactly the way it should be?
(3) Did you make sure that the sourcetype stanza for that sourcetype would be executed before any other applicable stanzas?

(4) Did you restart splunk afterwards?

(5) if you did all of those, then let's try this one. Some places in splunk require escapes to be double-escaped. That should not be the case here, but it's something that I'd try...

[your sourcetype]
SEDCMD-removeDoubleQuotes= s/(?<!,)\\"([^\\"]*)\\"/\1/g

Looks like this will work...

[your sourcetype]
SEDCMD-removeDoubleQuotes= s/(?<!,)\"([^\"]*)\"/\1/g

This translates as - if you find a quote \" that is not preceded by a comma (?<!,) then collect everything as a block (_) that is not a quote [^\"]* until the next quote \".

Replace all that s/, including the quotes /(?<!,)\"([^\"]*)\"/ with the block you just grabbed, which did not include the quotes /\1/ and do that as many times as you find it /g.

This code will run off the rails if there is a single extra double-quote, because there is no principled way to differentiate between a single double-quote in the middle of a field, and a double-quote in the middle of a field that then has commas inside the quotes before it gets to the end-quote.

The only way to differentiate, if any, isn't going to be susceptible to any regex that you'd want to maintain.


You are going to need something more complicated than that.

For your data, you are going to need a regex that says, if there is a double-quote that is not preceded by a comma, then take everything that is after that doublequote, up to and including the next double-quote, and replace it with everything that is between the double-quotes.

It will take a little while to work that regex syntax out.

View solution in original post

DalJeanis
SplunkTrust
SplunkTrust

Updated ...

@tamduong16 -
(1) Did you put the code into props.conf on your indexer/heavy forwarder?

(2) Did you make it the FIRST line for that sourcetype, and make sure the sourcetype is capitalized and spelled exactly the way it should be?
(3) Did you make sure that the sourcetype stanza for that sourcetype would be executed before any other applicable stanzas?

(4) Did you restart splunk afterwards?

(5) if you did all of those, then let's try this one. Some places in splunk require escapes to be double-escaped. That should not be the case here, but it's something that I'd try...

[your sourcetype]
SEDCMD-removeDoubleQuotes= s/(?<!,)\\"([^\\"]*)\\"/\1/g

Looks like this will work...

[your sourcetype]
SEDCMD-removeDoubleQuotes= s/(?<!,)\"([^\"]*)\"/\1/g

This translates as - if you find a quote \" that is not preceded by a comma (?<!,) then collect everything as a block (_) that is not a quote [^\"]* until the next quote \".

Replace all that s/, including the quotes /(?<!,)\"([^\"]*)\"/ with the block you just grabbed, which did not include the quotes /\1/ and do that as many times as you find it /g.

This code will run off the rails if there is a single extra double-quote, because there is no principled way to differentiate between a single double-quote in the middle of a field, and a double-quote in the middle of a field that then has commas inside the quotes before it gets to the end-quote.

The only way to differentiate, if any, isn't going to be susceptible to any regex that you'd want to maintain.


You are going to need something more complicated than that.

For your data, you are going to need a regex that says, if there is a double-quote that is not preceded by a comma, then take everything that is after that doublequote, up to and including the next double-quote, and replace it with everything that is between the double-quotes.

It will take a little while to work that regex syntax out.

View solution in original post

DalJeanis
SplunkTrust
SplunkTrust

@tamduong16 - for future people who might encounter similar issues, please let us know what worked. Thanks, Dal.

0 Karma

Chandras11
Communicator

@DalJeans : Thanks for the solution here. I encountered the same issue but I looked in all the wrong places. Finally, I tried to remove double quotes with a batch file (JEPRL) and it worked perfectly.
The Following code can remove double quotes in all csvs in a folder:

for /f "delims=" %%a in ('dir *.csv /b /a-d ') do (
   echo processing "%%a" 
   call jrepl "\x22" "" /x /f "%%a" /O -
)

DalJeanis
SplunkTrust
SplunkTrust

Hey, @Chandras11 - thanks for helping the community by letting us know what worked for you!

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Hmmm. You asked "please help" and then you downvoted the posts rather than just telling us what exactly you did that didn't seem to work? That's not appropriate, @tamduong16, at least until someone has determined WHY the post did not work. Around here, you usually only downvote when something is obviously bad advice, and neither of these posts qualified.

Most of the time, when a solution does not work for the original poster, it is because either they did not apply the solution exactly the way that it was intended, they missed giving some significant information regarding their situation, or they missed some underlying principle that no one thought to tell them.

In this case, @sbbadri's first solution should have done what you originally asked, and mine should have corrected some of the issues where you did NOT ask the right question. Although I didn't post the run-anywhere code, it was tested fully and worked fine.

Here's that run-anywhere code...

| makeresults 
| eval mydata="one,,,four,five get rid of the \"quotes around this\",,seven,\"eight and keep the quotes\",,,eleven"
| rex field=mydata mode=sed "s/(?<!,)\"([^\"]*)\"/\1/g"
0 Karma

sbbadri
Motivator

In Indexer,

props.conf

[your sourcetype]
SEDCMD-removeDoubleQuotes= s/\"//g
or
[your sourcetype]
SEDCMD-removeDoubleQuotes= s/"//g
or
[your sourcetype]
SEDCMD-removeDoubleQuotes= s/(")//g

.conf21 CFS Extended through 5/20!

Don't miss your chance
to share your Splunk
wisdom in-person or
virtually at .conf21!

Call for Speakers has
been extended through
Thursday, 5/20!