I have a csv file in which it contains random double quotes and I want to remove all these quotes before it actually get import into splunk. Could someone show me how to do this at the add data state? Or is there any splunk app that could help me do this? I'm new to splunk. Please help!
Updated ...
@tamduong16 -
(1) Did you put the code into props.conf on your indexer/heavy forwarder?
(2) Did you make it the FIRST line for that sourcetype, and make sure the sourcetype is capitalized and spelled exactly the way it should be?
(3) Did you make sure that the sourcetype stanza for that sourcetype would be executed before any other applicable stanzas?
(4) Did you restart splunk afterwards?
(5) if you did all of those, then let's try this one. Some places in splunk require escapes to be double-escaped. That should not be the case here, but it's something that I'd try...
[your sourcetype]
SEDCMD-removeDoubleQuotes= s/(?<!,)\\"([^\\"]*)\\"/\1/g
Looks like this will work...
[your sourcetype]
SEDCMD-removeDoubleQuotes= s/(?<!,)\"([^\"]*)\"/\1/g
This translates as - if you find a quote \"
that is not preceded by a comma (?<!,)
then collect everything as a block (_)
that is not a quote [^\"]*
until the next quote \"
.
Replace all that s/
, including the quotes /(?<!,)\"([^\"]*)\"/
with the block you just grabbed, which did not include the quotes /\1/
and do that as many times as you find it /g
.
This code will run off the rails if there is a single extra double-quote, because there is no principled way to differentiate between a single double-quote in the middle of a field, and a double-quote in the middle of a field that then has commas inside the quotes before it gets to the end-quote.
The only way to differentiate, if any, isn't going to be susceptible to any regex that you'd want to maintain.
You are going to need something more complicated than that.
For your data, you are going to need a regex that says, if there is a double-quote that is not preceded by a comma, then take everything that is after that doublequote, up to and including the next double-quote, and replace it with everything that is between the double-quotes.
It will take a little while to work that regex syntax out.
Updated ...
@tamduong16 -
(1) Did you put the code into props.conf on your indexer/heavy forwarder?
(2) Did you make it the FIRST line for that sourcetype, and make sure the sourcetype is capitalized and spelled exactly the way it should be?
(3) Did you make sure that the sourcetype stanza for that sourcetype would be executed before any other applicable stanzas?
(4) Did you restart splunk afterwards?
(5) if you did all of those, then let's try this one. Some places in splunk require escapes to be double-escaped. That should not be the case here, but it's something that I'd try...
[your sourcetype]
SEDCMD-removeDoubleQuotes= s/(?<!,)\\"([^\\"]*)\\"/\1/g
Looks like this will work...
[your sourcetype]
SEDCMD-removeDoubleQuotes= s/(?<!,)\"([^\"]*)\"/\1/g
This translates as - if you find a quote \"
that is not preceded by a comma (?<!,)
then collect everything as a block (_)
that is not a quote [^\"]*
until the next quote \"
.
Replace all that s/
, including the quotes /(?<!,)\"([^\"]*)\"/
with the block you just grabbed, which did not include the quotes /\1/
and do that as many times as you find it /g
.
This code will run off the rails if there is a single extra double-quote, because there is no principled way to differentiate between a single double-quote in the middle of a field, and a double-quote in the middle of a field that then has commas inside the quotes before it gets to the end-quote.
The only way to differentiate, if any, isn't going to be susceptible to any regex that you'd want to maintain.
You are going to need something more complicated than that.
For your data, you are going to need a regex that says, if there is a double-quote that is not preceded by a comma, then take everything that is after that doublequote, up to and including the next double-quote, and replace it with everything that is between the double-quotes.
It will take a little while to work that regex syntax out.
@tamduong16 - for future people who might encounter similar issues, please let us know what worked. Thanks, Dal.
@DalJeans : Thanks for the solution here. I encountered the same issue but I looked in all the wrong places. Finally, I tried to remove double quotes with a batch file (JEPRL) and it worked perfectly.
The Following code can remove double quotes in all csvs in a folder:
for /f "delims=" %%a in ('dir *.csv /b /a-d ') do (
echo processing "%%a"
call jrepl "\x22" "" /x /f "%%a" /O -
)
Hey, @Chandras11 - thanks for helping the community by letting us know what worked for you!
Hmmm. You asked "please help" and then you downvoted the posts rather than just telling us what exactly you did that didn't seem to work? That's not appropriate, @tamduong16, at least until someone has determined WHY the post did not work. Around here, you usually only downvote when something is obviously bad advice, and neither of these posts qualified.
Most of the time, when a solution does not work for the original poster, it is because either they did not apply the solution exactly the way that it was intended, they missed giving some significant information regarding their situation, or they missed some underlying principle that no one thought to tell them.
In this case, @sbbadri's first solution should have done what you originally asked, and mine should have corrected some of the issues where you did NOT ask the right question. Although I didn't post the run-anywhere code, it was tested fully and worked fine.
Here's that run-anywhere code...
| makeresults
| eval mydata="one,,,four,five get rid of the \"quotes around this\",,seven,\"eight and keep the quotes\",,,eleven"
| rex field=mydata mode=sed "s/(?<!,)\"([^\"]*)\"/\1/g"
In Indexer,
props.conf
[your sourcetype]
SEDCMD-removeDoubleQuotes= s/\"//g
or
[your sourcetype]
SEDCMD-removeDoubleQuotes= s/"//g
or
[your sourcetype]
SEDCMD-removeDoubleQuotes= s/(")//g