Getting Data In

splunk field extraction csv

jbanda
Path Finder

Ultimately I'm trying to get meaningful data out of exchange message tracking logs (which are in single-line-record, comma-separated-field form) by using splunk...but since there's still a lot that's hazy about how splunk fully works..I thought I'd try baby steps first.

First, we're using version 4.1.3 on our splunk server. Reading around, I've seen a few posts where people try to fight with csv files and field extractions (manual or auto). I've been trying to piece together the different bits of information, and I attempted to get a simple test going. I have a file that currently looks like this:

#Fields: username,firstname,initial,lastname,emailaddress,company
jdoe1,John,D,Doe1,jdoe1@company1.com,company1
jdoe2,Jane,J,Doe2,jdoe2@company1.com,company1

I left the first line there to mimic what our current exchange message tracking log looks like (we are running Exchange 2010).

It was hard to find a good concrete example that multple sources seemed to reference. I found many sources that all seemed to do things a little different. The following files are what I managed to piece together. I figured I'd try a manual field specification first, and went this route:

on the client (also 4.1.3), I have this in inputs.conf:

[monitor:///path/to/test.log]
source = testcsv
sourcetype = testcsvlog
crcSalt = <SOURCE>
disabled = 0

in props.conf, I have this:

[testcsvlog]
REPORT-testcsvlogextract = testcsvlog_extractions

and in transforms.conf, I have this:

[source::/path/to/test.log]
sourcetype = testcsvlog
priority = 100

[testcsvlog_extractions]
DELMIS=","
FIELDS="username","firstname","initial","lastname","emailaddress","company"

Oddly enough, the first unsual thing I noticed is that when I let splunk do its thing with test.log, it put both records in the same entry when I looked up events with the sourcetype of "testcsvlog". I was expecing that since the records are separated on different lines, it would pick up the individual records as separate events. Adding additional lines to test.log though seemed to have them show up as separate entries, so at least part of this was working. I'm not sure why it did that, but if anyone can explain that, I'd appreciate it

The second thing I noticed is that unfortunately, I saw no reference to any of the fields I specified. Depending on the source I read, some made a reference to the fact that splunk just cannot work with "csv-style" input, while others claimed to get it to work. I found a reference to putting "KV_MODE = none" in props.conf to disable some checks that splunk supposedly does when it detects a csv file, but I also tried that with no different results.

I am obviously missing something (or several things) if I cannot get even this simple example to work. Can anyone point me in the right direction as to what I need to do?

Tags (1)
1 Solution

Lowell
Super Champion

I don't think any of this will fully "answer" your question, but perhaps I can help clear some things up for you. It looks like you done a lot of research and digging around before posting a question here; you've done well and I think you are really close to getting a working config. Hopefully some of my ramblings will help you out:


First, you said that you have this in your transforms.conf, this should be in props.conf

[source::/path/to/test.log]
sourcetype = testcsvlog
priority = 100

That said, this probably is not your problem since all this does would be doing (if it where in the correct config file), would be to set the sourcetype of your input to "testcsvlog", which you are already forcing it in your inputs.conf which appears to be done correctly in your example.


Do your events have any timestamps? Time is very important to Splunk, finding a timestamp in your event is one way splunk knows how to split up your events. So if your events don't have timestamps, and you want just one event per line of your input file, then you can simply add "SHOULD_LINEMERGE=false" in your props.conf entry and that will force your events to be one line each.


I would suggest that you do not use crcSalt = <SOURCE> unless you full understand all the implications of it. Unfortunately, that "feature" has been posted a number of times on this site and while it definitely is useful in certain situation, most situations it's not really needed and it can cause more problems that be helpful. If you simply want to reload the same file over and over again for testing purposes, then I suggest either dumping your indexes and reindexing all your data (assuming you can purge your data because your still in the early setup phase of using splunk), or you can use the "oneshot" input mode to re-index the same file as many times as you want. You can use one shot mode by running the following command:

$SPLUNK_HOME/bin/splunk add oneshot /path/to/test.log


On the topic of 'auto' vs 'manual' csv handling. I think what you are referring to is the automatic header extraction feature. For me personally, I've never been able to make the "auto" think work right, and it isn't supported when you start using forwarders anyways (which I use), so I've given up on it for the time being (perhaps I will reconsider if this feature is improved in an newer release.) I know it works for some people, so you can try it if you would like....

That said, your given transforms.conf example looks good.

See if this config works for you: (You may want to archive off your other config and just try one at a time if there is any possiblity that you could have overlapping stanzas configured in different places, just FYI. The btool utility is very helpful here, and there are some other helpful debugging techniques here: What’s the best way to track down props.conf problems?)

inputs.conf:

[monitor:///path/to/test.log]

props.conf:

[source::/path/to/test.log]
sourcetype = testcsvlog
# You really shouldn't have to bump the priority here.

[testcsvlog]
SHOULD_LINEMERGE=false
KV_MODE = none
REPORT-testcsvlogextract = testcsvlog_extractions

transforms.conf:

[testcsvlog_extractions]
DELIMS = ","
FIELDS = "username","firstname","initial","lastname","emailaddress","company"

Then restart splunkd, and run the "oneshot" command line thing I posted above.

View solution in original post

agthurber
Explorer

Excellent question, I have a similar task at hand, but after attending the 2011 conf I am going to download the new App for Exchange and see if i can tweak it to analyze logs coming in from port25 Power MTA, but if that does not work I will be using your example to start my own App. Thanks for documenting!

0 Karma

Lowell
Super Champion

I don't think any of this will fully "answer" your question, but perhaps I can help clear some things up for you. It looks like you done a lot of research and digging around before posting a question here; you've done well and I think you are really close to getting a working config. Hopefully some of my ramblings will help you out:


First, you said that you have this in your transforms.conf, this should be in props.conf

[source::/path/to/test.log]
sourcetype = testcsvlog
priority = 100

That said, this probably is not your problem since all this does would be doing (if it where in the correct config file), would be to set the sourcetype of your input to "testcsvlog", which you are already forcing it in your inputs.conf which appears to be done correctly in your example.


Do your events have any timestamps? Time is very important to Splunk, finding a timestamp in your event is one way splunk knows how to split up your events. So if your events don't have timestamps, and you want just one event per line of your input file, then you can simply add "SHOULD_LINEMERGE=false" in your props.conf entry and that will force your events to be one line each.


I would suggest that you do not use crcSalt = <SOURCE> unless you full understand all the implications of it. Unfortunately, that "feature" has been posted a number of times on this site and while it definitely is useful in certain situation, most situations it's not really needed and it can cause more problems that be helpful. If you simply want to reload the same file over and over again for testing purposes, then I suggest either dumping your indexes and reindexing all your data (assuming you can purge your data because your still in the early setup phase of using splunk), or you can use the "oneshot" input mode to re-index the same file as many times as you want. You can use one shot mode by running the following command:

$SPLUNK_HOME/bin/splunk add oneshot /path/to/test.log


On the topic of 'auto' vs 'manual' csv handling. I think what you are referring to is the automatic header extraction feature. For me personally, I've never been able to make the "auto" think work right, and it isn't supported when you start using forwarders anyways (which I use), so I've given up on it for the time being (perhaps I will reconsider if this feature is improved in an newer release.) I know it works for some people, so you can try it if you would like....

That said, your given transforms.conf example looks good.

See if this config works for you: (You may want to archive off your other config and just try one at a time if there is any possiblity that you could have overlapping stanzas configured in different places, just FYI. The btool utility is very helpful here, and there are some other helpful debugging techniques here: What’s the best way to track down props.conf problems?)

inputs.conf:

[monitor:///path/to/test.log]

props.conf:

[source::/path/to/test.log]
sourcetype = testcsvlog
# You really shouldn't have to bump the priority here.

[testcsvlog]
SHOULD_LINEMERGE=false
KV_MODE = none
REPORT-testcsvlogextract = testcsvlog_extractions

transforms.conf:

[testcsvlog_extractions]
DELIMS = ","
FIELDS = "username","firstname","initial","lastname","emailaddress","company"

Then restart splunkd, and run the "oneshot" command line thing I posted above.

View solution in original post

Lowell
Super Champion

Glad you have it working. Getting all the splunk concepts (especially where config files go in different situations) can be tricky at first, but sounds like you are well on your way. Happy splunking!

0 Karma

jbanda
Path Finder

Lowell, now I "see the light". It seems to be working with the regex version, but I'll probalby try the field extraction method with another sourcetype. I think my entire problem was that I didn't realize the nature of a light forwarder. Now, I placed the inputs.conf on my light forwarder, specifying a file, index and sourcetype...ando nothing else. On my splunk server, I have the appropriate props.conf and transforms.conf in an active app. I think (at least I hope) I have the relationship correct now. Thanks again for your extremely helpful input!

0 Karma

Lowell
Super Champion

Yes, "| file" is a search command that you can type into the UI search bar; or run via the search command line interface or any of the other ways you can run a search (it's pretty flexible). I updated the other page to hopefully make the most obvious case more clear.

0 Karma

Lowell
Super Champion

About the light forwarder thing... keep in mind that we are talking about field-extractions, and by their very definition such fields are only extracted at search time and therefore are only relevant on the splunk instance where you are searching. This is preferred, because you can go back after the fact and make corrections (since it's often difficult to get field extractions 100% right the first time around.) There is no need to deploy your field extractions to your forwarders, but it doesn't hurt anything if you do (they are just not used.) Does that help?

0 Karma

Lowell
Super Champion

Commas within your data can be a pain. But I'm pretty sure splunk handles them properly with delimiter-based extractions as long as you have that value within double quotes, like ...,"My Co, Inc.",... should be fine; but there are some corner cases I've run into, mostly with \ characters.

0 Karma

jbanda
Path Finder

Also, I'd like to get some clarification if this can indeed be done on the light-forwarder (client end). I could not get this to work on there at all, only on the splunk server (I would get the data, but my fields wouldn't show up). Is this just the case of another typo, or is this limitation really there?

0 Karma

jbanda
Path Finder

ha, ok, I should have refreshed my screen earlier, but I see Lowell replied with an obvious typo on my part! I'm guessing that would have worked, but since I was ultimately going to use this to parse exchange message tracking logs, I think I would have had to go the regex route...only because several fields contained commas (like the subject of a message). The cvs-regex (which I found on www.kimgentes.com) is a "master cvs parser" that handles commas in fields, provided they are enclosed between "". This is the regex that would need to be included per field: ("(?:[^"]|"")"|[^,])

0 Karma

jbanda
Path Finder

[testcsvlog_extractions]
REGEX=^("(?:[^"]|"")"|[^,]),("(?:[^"]|"")"|[^,]),("(?:[^"]|"")"|[^,]),("(?:[^"]|"")"|[^,]),("(?:[^"]|"")"|[^,]),("(?:[^"]|"")"|[^,])
FORMAT=username::$1 firstname::$2 initials::$3 lastname::$4 email::$5 company::$6

0 Karma

jbanda
Path Finder

props.conf:

[testcsvlog]
SHOULD_LINEMERGE=false
KV_MODE = none
REPORT-testcsvlogextract = testcsvlog_extractions

0 Karma

jbanda
Path Finder

Now that I've managed to do more testing...I found the following: based on what you said about the "auto thing" not being supported using forwarders, I decided to try this locally on the splunk server. I still got no results, but I decided to forgo the "DELIMS" route and try using regex. I finally got something to work, but ONLY from the splunk server itself. I must be missing something, because it seems horribly limited on the light-forwarder if that cannot be done there. Below will be my props.conf and transoforms.conf

0 Karma

jamesar
Explorer

I know this post is old but auto header extraction should now work on a light forwarder...

E.g: props.conf

[my_source_type]
SHOULD_LINEMERGE=false
HEADER_FIELD_LINE_NUMBER=3
HEADER_FIELD_DELIMITER=;
FIELD_DELIMITER=;
0 Karma

Lowell
Super Champion

Doh! I found it. "DELMIS" should be "DELIMS". I totally missed that. (Splunk 4.2 will include a config file syntax checker.) These thinks can take a long time to track down by hand! (I've updated my answer)

0 Karma

Lowell
Super Champion

btool is a command line tool. You should be able to run: $SPLUNK_HOME/bin/btool props list testcsvlog to confirm your stanza. Also try the command splunk test sourcetype /path/to/test.log to make sure your props.conf settings are actually being applied to your input path. With the combination of these tools, you should be able to find various config issues, for example if your config file is named 'transform.conf' (instead of 'transforms.conf') using btool you will see an empty output. I'm guessing you have a simple silly mistake somewhere in your config; hang in there.

0 Karma

jbanda
Path Finder

Also, thanks for the oneshot recommendation. I like this method a lot better. I saw some reference to btool and "| file", but I wasn't sure where I was supposed to run these from. Should these be run from the splunk client that has the logs, or from the splunk server? I'm also not exactly sure what the "| file" command is. Is that supposed to be entered in the search bar from the web interface? Again, is that from the client or the server?

0 Karma

jbanda
Path Finder

Lowell,
thanks for the detailed answer! I know I put some of those "questionable" attributes in the config files because I found other examples where people claimed it helped them overcome their problem.

I also see now how not having a date would make the records appear as 1. I did try your example though, including "SHOULD_LINEMERGE=false" in props.conf, but that didn't seem to make a difference...nor did it make my fields show up.

I'm about ready to give up on this and preprocess the logs in a field=value format..which I KNOW works...

0 Karma