Splunk Search

regex expression works sometimes and not at others

bsizemore
Path Finder

Hello,

I am working with Omniture Data contained in a csv in a scripted data input. I fetch a file using curl, chop the header of the file off using sed. The data looks like this before it is sent to the indexers: http://pastebin.com/HfjuT0Nq

I use the following query to regex out the data into the following fields:

index=entapps sourcetype=seoAdHocAnalysis | rex "^(?P<Dimension>[^,]+)[^,\n]*,(?P<Item>[^,]+),(?P<Visits>[^x]+),(?P<ChannelImageSearch>.+)[^,\n]*,(?P<ChannelOrganicSearch>.+)" | table _time Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

Some of the results are totally correct, while others are wrong (denoted with pink highlighting):
https://tmpdmp.com/bdd93316f0fc85f7/a63faaad56b830b1

I know why: some csv rows have numbers with quotations around them that contains commas as a part of the value. This situation makes it impossible to use the field extraction feature of 6.x in my experience thusfar. I assume there is a better way, and I am looking at sed/awk as a method for grooming k/v pairs before ingestion, but that has its own bugaboos.

Tags (3)
0 Karma
1 Solution

bsizemore
Path Finder

I solved this with the following regex:

^(?P<Dimension>[^,]+),(?P<Item>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<Visits>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelImageSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelOrganicSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b))$

The field extraction looks like this:

index=entapps sourcetype=seoAdHocAnalysis | eval Visits=trim(Visits,"\"") | eval ChannelImageSearch=trim(ChannelImageSearch,"\"") | eval ChannelOrganicSearch=trim(ChannelOrganicSearch,"\"") | table Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

The outcome of this regex, and turning it into a field extraction yielded the following results: https://tmpdmp.com/5242b8819245b48d/ade0131ff0693292

View solution in original post

bsizemore
Path Finder

I solved this with the following regex:

^(?P<Dimension>[^,]+),(?P<Item>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<Visits>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelImageSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelOrganicSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b))$

The field extraction looks like this:

index=entapps sourcetype=seoAdHocAnalysis | eval Visits=trim(Visits,"\"") | eval ChannelImageSearch=trim(ChannelImageSearch,"\"") | eval ChannelOrganicSearch=trim(ChannelOrganicSearch,"\"") | table Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

The outcome of this regex, and turning it into a field extraction yielded the following results: https://tmpdmp.com/5242b8819245b48d/ade0131ff0693292

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

This challenge was first posted on Slack #puzzles channelFor a previous puzzle, I needed a set of fixed-length ...

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

  &#x1f680; Your data just got a serious AI upgrade — are you ready? Say hello to the Agentic Era with the ...

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...