Splunk Search

Extract indexes from source filenames

HLVarian
Path Finder

Forgive me, I believe this has been asked and answered in other forms, but I'm unable to figure out how to work this out based on the answers provided. I'm not that great at the RegEx stuff.

I am trying to extract data from the file names that I am inputting into Splunk. Ultimately I think that I would like to have this information available as an index. All the file names are formatted mostly the same way.

File names appear as follows:
12345_filename_78910.csv OR
1234567_file_name_8910.csv

I want to capture the items so that they map ID1, ID2, and filename and output as indexes. __.csv

Basically the length of the number and filename string could vary, and there can be another underscore in the filename which can just stay put in the string (no need to throw it away).

Ideally, I would like this to happen as the files are coming into Splunk, and not as part of a search. But a search method would be useful for testing out the RegEx. However, would this ultimately be best as part of a config file then?

0 Karma
1 Solution

Jeremiah
Motivator

By index, do you mean you want to create indexed fields from the filename? So that in your example you have:

12345_filename_78910.csv

You want the following fields?
ID1=12345
filename=filename OR 12345_filename_78910.csv (?)
ID2=78910

A search time extraction would look something like this:

| stats count | eval source="/some/path/1234567_file_na_454054_me_8910.csv" | rex field=source "(?<ID1>\d+)_(?<filename>.*)_(?<ID2>\d+)\.csv"

You could add this to your config, so the extraction is done automatically. In props.conf add:

[your_sourcetype]
EXTRACT-file_properties = (?<ID1>\d+)_(?<filename>.*)_(?<ID2>\d+)\.csv in source

That should get you an automated extraction at search time. You can convert this to an index time extraction if you need to, but you might want to test and see if the search performance is really that bad. Keep in mind you are searching in the source field, which is an indexed field itself. Check this page for information for caveats about indexed time fields:

http://docs.splunk.com/Documentation/Splunk/6.3.3/Data/Configureindex-timefieldextraction

View solution in original post

Jeremiah
Motivator

By index, do you mean you want to create indexed fields from the filename? So that in your example you have:

12345_filename_78910.csv

You want the following fields?
ID1=12345
filename=filename OR 12345_filename_78910.csv (?)
ID2=78910

A search time extraction would look something like this:

| stats count | eval source="/some/path/1234567_file_na_454054_me_8910.csv" | rex field=source "(?<ID1>\d+)_(?<filename>.*)_(?<ID2>\d+)\.csv"

You could add this to your config, so the extraction is done automatically. In props.conf add:

[your_sourcetype]
EXTRACT-file_properties = (?<ID1>\d+)_(?<filename>.*)_(?<ID2>\d+)\.csv in source

That should get you an automated extraction at search time. You can convert this to an index time extraction if you need to, but you might want to test and see if the search performance is really that bad. Keep in mind you are searching in the source field, which is an indexed field itself. Check this page for information for caveats about indexed time fields:

http://docs.splunk.com/Documentation/Splunk/6.3.3/Data/Configureindex-timefieldextraction

HLVarian
Path Finder

Thanks Jeramiah, I tested out the RegEx

   | stats count | eval
    source="/some/path/1234567_file_na_454054_me_8910.csv"
    | rex field=source
    "(?<ID1>\d+)_(?<filename>.*)_(?<ID2>\d+)\.csv"

And it returned what I wanted. You're awesome!

I suppose the ID's and file name can be extracted at search-time. I'm new to Splunk and still evaluating it's usefulness, so my intuition on what I think would work best is probably getting the better of me. I'm guessing the reason that I was considering an index-time extraction was because we will have our data coming in in packets of four like this:

123_file1_456.csv
123_file2_456.csv
123_file3_456.csv
123_file4_456.csv

789_file1_345.csv
789_file2_345.csv
789_file3_345.csv
789_file4_345.csv

The IDs tie these files together and contain important data and the filename contain important test info. Ultimately, I'm creating dashboards where users will search for the file(s) and graph pertinent data based on those cross referenced ID's.

For now, I'll try the editing the props.conf and see how that works for me.

Thanks again!

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...