Forgive me, I believe this has been asked and answered in other forms, but I'm unable to figure out how to work this out based on the answers provided. I'm not that great at the RegEx stuff.
I am trying to extract data from the file names that I am inputting into Splunk. Ultimately I think that I would like to have this information available as an index. All the file names are formatted mostly the same way.
File names appear as follows:
12345_filename_78910.csv OR
1234567_file_name_8910.csv
I want to capture the items so that they map ID1, ID2, and filename and output as indexes. __.csv
Basically the length of the number and filename string could vary, and there can be another underscore in the filename which can just stay put in the string (no need to throw it away).
Ideally, I would like this to happen as the files are coming into Splunk, and not as part of a search. But a search method would be useful for testing out the RegEx. However, would this ultimately be best as part of a config file then?
By index, do you mean you want to create indexed fields from the filename? So that in your example you have:
12345_filename_78910.csv
You want the following fields?
ID1=12345
filename=filename OR 12345_filename_78910.csv (?)
ID2=78910
A search time extraction would look something like this:
| stats count | eval source="/some/path/1234567_file_na_454054_me_8910.csv" | rex field=source "(?<ID1>\d+)_(?<filename>.*)_(?<ID2>\d+)\.csv"
You could add this to your config, so the extraction is done automatically. In props.conf add:
[your_sourcetype]
EXTRACT-file_properties = (?<ID1>\d+)_(?<filename>.*)_(?<ID2>\d+)\.csv in source
That should get you an automated extraction at search time. You can convert this to an index time extraction if you need to, but you might want to test and see if the search performance is really that bad. Keep in mind you are searching in the source field, which is an indexed field itself. Check this page for information for caveats about indexed time fields:
http://docs.splunk.com/Documentation/Splunk/6.3.3/Data/Configureindex-timefieldextraction
By index, do you mean you want to create indexed fields from the filename? So that in your example you have:
12345_filename_78910.csv
You want the following fields?
ID1=12345
filename=filename OR 12345_filename_78910.csv (?)
ID2=78910
A search time extraction would look something like this:
| stats count | eval source="/some/path/1234567_file_na_454054_me_8910.csv" | rex field=source "(?<ID1>\d+)_(?<filename>.*)_(?<ID2>\d+)\.csv"
You could add this to your config, so the extraction is done automatically. In props.conf add:
[your_sourcetype]
EXTRACT-file_properties = (?<ID1>\d+)_(?<filename>.*)_(?<ID2>\d+)\.csv in source
That should get you an automated extraction at search time. You can convert this to an index time extraction if you need to, but you might want to test and see if the search performance is really that bad. Keep in mind you are searching in the source field, which is an indexed field itself. Check this page for information for caveats about indexed time fields:
http://docs.splunk.com/Documentation/Splunk/6.3.3/Data/Configureindex-timefieldextraction
Thanks Jeramiah, I tested out the RegEx
| stats count | eval
source="/some/path/1234567_file_na_454054_me_8910.csv"
| rex field=source
"(?<ID1>\d+)_(?<filename>.*)_(?<ID2>\d+)\.csv"
And it returned what I wanted. You're awesome!
I suppose the ID's and file name can be extracted at search-time. I'm new to Splunk and still evaluating it's usefulness, so my intuition on what I think would work best is probably getting the better of me. I'm guessing the reason that I was considering an index-time extraction was because we will have our data coming in in packets of four like this:
123_file1_456.csv
123_file2_456.csv
123_file3_456.csv
123_file4_456.csv
789_file1_345.csv
789_file2_345.csv
789_file3_345.csv
789_file4_345.csv
The IDs tie these files together and contain important data and the filename contain important test info. Ultimately, I'm creating dashboards where users will search for the file(s) and graph pertinent data based on those cross referenced ID's.
For now, I'll try the editing the props.conf and see how that works for me.
Thanks again!