Solved: Re: regex expression works sometimes and not at ot...

bsizemore · ‎06-11-2014

Hello,

I am working with Omniture Data contained in a csv in a scripted data input. I fetch a file using curl, chop the header of the file off using sed. The data looks like this before it is sent to the indexers: http://pastebin.com/HfjuT0Nq

I use the following query to regex out the data into the following fields:

index=entapps sourcetype=seoAdHocAnalysis | rex "^(?P<Dimension>[^,]+)[^,\n]*,(?P<Item>[^,]+),(?P<Visits>[^x]+),(?P<ChannelImageSearch>.+)[^,\n]*,(?P<ChannelOrganicSearch>.+)" | table _time Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

Some of the results are totally correct, while others are wrong (denoted with pink highlighting):
https://tmpdmp.com/bdd93316f0fc85f7/a63faaad56b830b1

I know why: some csv rows have numbers with quotations around them that contains commas as a part of the value. This situation makes it impossible to use the field extraction feature of 6.x in my experience thusfar. I assume there is a better way, and I am looking at sed/awk as a method for grooming k/v pairs before ingestion, but that has its own bugaboos.

bsizemore · ‎06-11-2014

I solved this with the following regex:

^(?P<Dimension>[^,]+),(?P<Item>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<Visits>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelImageSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelOrganicSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b))$

The field extraction looks like this:

index=entapps sourcetype=seoAdHocAnalysis | eval Visits=trim(Visits,"\"") | eval ChannelImageSearch=trim(ChannelImageSearch,"\"") | eval ChannelOrganicSearch=trim(ChannelOrganicSearch,"\"") | table Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

The outcome of this regex, and turning it into a field extraction yielded the following results: https://tmpdmp.com/5242b8819245b48d/ade0131ff0693292

View solution in original post

bsizemore · ‎06-11-2014

I solved this with the following regex:

^(?P<Dimension>[^,]+),(?P<Item>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<Visits>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelImageSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelOrganicSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b))$

The field extraction looks like this:

index=entapps sourcetype=seoAdHocAnalysis | eval Visits=trim(Visits,"\"") | eval ChannelImageSearch=trim(ChannelImageSearch,"\"") | eval ChannelOrganicSearch=trim(ChannelOrganicSearch,"\"") | table Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

The outcome of this regex, and turning it into a field extraction yielded the following results: https://tmpdmp.com/5242b8819245b48d/ade0131ff0693292

regex expression works sometimes and not at others

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms