Solved: CSV Field Extraction with spaces in field name

burras · ‎10-13-2016

I have a syslog feed coming in to our Splunk system that is essentially a CSV file. It's a conglomeration of the results of different health check runs on a platform. Depending on the type of health check run, the fields are not always the same. However, each field in the syslog feed contains it's own field names. Time and host fields are added automatically on syslog ingest. Examples of some different health results are here:

<date>,server1,Start Time Local=Thu Oct 13 09:45:00 EDT 2016,End Time Local=Thu Oct 13 10:00:00 EDT 2016,Site=site1,Group=HTTP,ContextRoot=/xmlapi,HTTPMethod=POST,SourceIP=127.0.0.1,Status=200,Count=360,Rate=0.400,Average Latency=1.2666

<date>,server2,Start Time Local=Thu Oct 13 09:45:00 EDT 2016,End Time Local=Thu Oct 13 10:00:00 EDT 2016,site=site2,Group=System,Disk=/dev/sdd1,Average Utilization=10.0

<date>,server3,Start Time Local=Thu Oct 13 09:45:00 EDT 2016,End Time Local=Thu Oct 13 10:00:00 EDT 2016,site=site3, Group=System,CPU ID=0,CPU Type=Idle,Average Utilization=96.4666

For the most part Splunk does a great job of automatically pulling out these fields and accounting for spaces in the field values. The problem we're running into is where the actual field name contains spaces (i.e. Average Utilization, CPU Type, Start Time Local, etc.). These just show up in the field view as Utilization, Type, and Local and end up overlapping other values. Is there a way to have Splunk identify anything between the "," and the "=" as the field name rather than just the portion right before the "="?

I know that I could conceivably turn off the automatic field extraction and extract field names myself but with the different number and types of fields included in the events that becomes an incredibly complex and time consuming task to manage, especially since they're already talking about bringing in additional health checks that would add new values to the results.

So far, I've tried updated props.conf to work with these as an indexed extraction of CSV but that didn't make a difference in how they were processed as well as other tinkering but nothing has been effective. Any help or ideas would be greatly appreciated...

lukejadamec · ‎10-13-2016

This may seem too easy, but using your logs it appears to work.
Use SEDCMD in props.conf to replace the spaces with underscores.

inputs.conf
[monitor://c:\temp\sed-csv.log]
disable=false
index=sedcsv
sourcetype=sedcsv

props.conf
[sedcsv]
SEDCMD-replacespace = s/ /_/g

This replaces all spaces in your events with underscores. Without any other specialized config Splunk appears to be picking out all of the fields with the correct = value.
Since you did not specify a date/time format I just put one in in the beginning of each event.
The only field that is not automatically extracted is the server field, but that is because it is not a field=value combination. However, pulling that out at search time should not be hard.

If I understand the pipeline correctly, this config will need to be added at the forwarder, and will only affect newly ingested logs after a restart.

View solution in original post

lukejadamec · ‎10-13-2016

This may seem too easy, but using your logs it appears to work.
Use SEDCMD in props.conf to replace the spaces with underscores.

inputs.conf
[monitor://c:\temp\sed-csv.log]
disable=false
index=sedcsv
sourcetype=sedcsv

props.conf
[sedcsv]
SEDCMD-replacespace = s/ /_/g

This replaces all spaces in your events with underscores. Without any other specialized config Splunk appears to be picking out all of the fields with the correct = value.
Since you did not specify a date/time format I just put one in in the beginning of each event.
The only field that is not automatically extracted is the server field, but that is because it is not a field=value combination. However, pulling that out at search time should not be hard.

If I understand the pipeline correctly, this config will need to be added at the forwarder, and will only affect newly ingested logs after a restart.

burras · ‎10-14-2016

Thanks - this worked. I tried the initial response by Iguinn but that one never seemed to work quite correctly. The SEDCMD option worked great. Appreciate the help everyone!

lguinn2 · ‎10-14-2016

Isn't this also going to put underscores in the timestamp, for example: Local=Thu_Oct_13_09:45:00_EDT_2016?

lukejadamec · ‎10-14-2016

Yes. But I don't think those are the times used for _time. In the example logs there is "date" at the beginning which I assumed was the timestamp for _time - not sure why examples of the timestamp were left out.
If those are the timestamps, then it should be fairly easy to construct a time format so that they can be read.
Regardless, I did not see your answer until I posted mine, and I came up with my answer after giving up on trying to figure out how to do it they way you illustrated. Your answer is probably the way it should be done.

burras · ‎10-14-2016

Yeah, we do timestamps based on actual receive time at the forwarder, not based on any specific field in the log.

lguinn2 · ‎10-13-2016

Put thefollowing entries in props.conf and transforms.conf on your search head. This turns off the automatic field extraction, and instead tells Splunk that the key-value pairs are separated by commas, and that an equals sign appears between the key and the value. Note that this is still search-time field extraction.

props.conf

[yoursourcetypehere]
REPORT-ecf = extract_csv_fields
KV_MODE = none

transforms.conf

[extract_csv_fields]
DELIMS = ",", "="
MV_ADD = true
CLEAN_KEYS =true

Second, you said that you can't break up the incoming data. Well,you can but you may not want to do so. If this is the only problem that you have with your syslog feed, then I wouldn't bother to break it up.

burras · ‎10-13-2016

I should also mention - all of this data is coming in as a single sourcetype. The syslog feed in question doesn't have the capability to break out the different event types into different facilities so no easy way to break them out into different sourcetypes where it'd be easier to work with 😞

sk314 · ‎10-13-2016

Can you enclose your fields with " "?

burras · ‎10-14-2016

I wish we could, but the vendor isn't willing to update their code without some exorbitant fees...

CSV Field Extraction with spaces in field name

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

Cloud Platform & Enterprise: Classic Dashboard Export Feature Deprecation