Splunk Search

regex expression works sometimes and not at others

bsizemore
Path Finder

Hello,

I am working with Omniture Data contained in a csv in a scripted data input. I fetch a file using curl, chop the header of the file off using sed. The data looks like this before it is sent to the indexers: http://pastebin.com/HfjuT0Nq

I use the following query to regex out the data into the following fields:

index=entapps sourcetype=seoAdHocAnalysis | rex "^(?P<Dimension>[^,]+)[^,\n]*,(?P<Item>[^,]+),(?P<Visits>[^x]+),(?P<ChannelImageSearch>.+)[^,\n]*,(?P<ChannelOrganicSearch>.+)" | table _time Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

Some of the results are totally correct, while others are wrong (denoted with pink highlighting):
https://tmpdmp.com/bdd93316f0fc85f7/a63faaad56b830b1

I know why: some csv rows have numbers with quotations around them that contains commas as a part of the value. This situation makes it impossible to use the field extraction feature of 6.x in my experience thusfar. I assume there is a better way, and I am looking at sed/awk as a method for grooming k/v pairs before ingestion, but that has its own bugaboos.

Tags (3)
0 Karma
1 Solution

bsizemore
Path Finder

I solved this with the following regex:

^(?P<Dimension>[^,]+),(?P<Item>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<Visits>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelImageSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelOrganicSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b))$

The field extraction looks like this:

index=entapps sourcetype=seoAdHocAnalysis | eval Visits=trim(Visits,"\"") | eval ChannelImageSearch=trim(ChannelImageSearch,"\"") | eval ChannelOrganicSearch=trim(ChannelOrganicSearch,"\"") | table Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

The outcome of this regex, and turning it into a field extraction yielded the following results: https://tmpdmp.com/5242b8819245b48d/ade0131ff0693292

View solution in original post

bsizemore
Path Finder

I solved this with the following regex:

^(?P<Dimension>[^,]+),(?P<Item>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<Visits>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelImageSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelOrganicSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b))$

The field extraction looks like this:

index=entapps sourcetype=seoAdHocAnalysis | eval Visits=trim(Visits,"\"") | eval ChannelImageSearch=trim(ChannelImageSearch,"\"") | eval ChannelOrganicSearch=trim(ChannelOrganicSearch,"\"") | table Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

The outcome of this regex, and turning it into a field extraction yielded the following results: https://tmpdmp.com/5242b8819245b48d/ade0131ff0693292

Get Updates on the Splunk Community!

September Community Champions: A Shoutout to Our Contributors!

As we close the books on another fantastic month, we want to take a moment to celebrate the people who are the ...

Splunk Decoded: Service Maps vs Service Analyzer Tree View vs Flow Maps

It’s Monday morning, and your phone is buzzing with alert escalations – your customer-facing portal is running ...

What’s New in Splunk Observability – September 2025

What's NewWe are excited to announce the latest enhancements to Splunk Observability, designed to help ITOps ...