Getting Data In

Creating custom sourcetype from part of log filename results with field sourcetype=$1

Explorer

Firstly, I'll give my apologies now as you'll find my attempt to explain my problem will most likely show my inexperience with Splunk.

To start off - I'm running Splunk 6 on RedHat Enterprise Linux 5.

I'm attempting to ingest many application log files into Splunk where part of the filename contains the application subsystem, a date and time string, and a process ID. Suffice to say, these logs are only created once by a job triggered from the application - and never used in any subsequent jobs.

I've based my research on suggestions from blogs and other posts in this forum such as:

http://blogs.splunk.com/2010/02/11/sourcetypes-gone-wild/

http://answers.splunk.com/answers/25560/field-names-from-file-including-source-and-host.html

http://answers.splunk.com/answers/83619/source-sourcetype-defined-by-folder-names.html

A sample of log files I'm wanting to ingest are:

/app/prod/app_1/logs/ABCDE123_20141013163738_24772.log
/app/prod/app_1/logs/XYZABC456_20141013093007_16799.log
/app/prod/app_1/logs/EFGHIJK789_20141013093007_16799.log
/app/prod/app_1/logs/123ABC_20141013093007_16799.log

In my universal forwarder I have an inputs.conf file with the following entry:

[monitor:///app/prod/app_1/logs/*.log]
disabled = false
followTail = 1
index = app_index

In my indexer I have a props.conf file with the following entry:

[source::/app/prod/app_1/logs/*.log]
TRANSFORMS-set_sourcetype_app_logs = set_sourcetype_app_logs

Also in my indexer a transforms.conf file with the following entry:

[set_sourcetype_app_logs]
DEST_KEY=MetaData:Sourcetype
SOURCE_KEY=MetaData:Source
REGEX=\w+(?=_\w+_\w+\.log$)
FORMAT=sourcetype::$1

My expectation is that indexed logs should a source like "/app/prod/app_1/logs/ABCDE123_20141013163738_24772.log" and a sourcetype like "ABCDE123"

However, once the logs are ingested and indexed, a search reveals that all data ingested appeared literally with sourcetype of '$1' instead of the intended filename regex,

Do I have a problem with my transforms.conf regex or is my configuration completely off the mark?

Any help would be greatly appreciated.

Thanks,
Bobby

0 Karma
1 Solution

Splunk Employee
Splunk Employee

Since you want to use the first part of the filename, you need to change your regex.

REGEX=\w+(?=_\w+_\w+\.log$)

You want the \w characters that precede the _number_number.log, so you have make them a capturing group, like so:

REGEX=(\w+)(?=_\w+_\w+\.log$)

Now $1 is ABCDE123 and similar.

View solution in original post

Splunk Employee
Splunk Employee

Since you want to use the first part of the filename, you need to change your regex.

REGEX=\w+(?=_\w+_\w+\.log$)

You want the \w characters that precede the _number_number.log, so you have make them a capturing group, like so:

REGEX=(\w+)(?=_\w+_\w+\.log$)

Now $1 is ABCDE123 and similar.

View solution in original post

Explorer

Perfect! Worked like a charm. Thanks for your help!

0 Karma

Splunk Employee
Splunk Employee

Incidentally if you wanted to use the entire regex match, you could have used $0, but I encourage the explicit capture group approach.

0 Karma

Splunk Employee
Splunk Employee

Are you sure (?=) acts as a capturing group? I think it's just a zero-width assertion that doesn't capture any text. Why not drop the ?=

I'm a little skittish of a sourcetype like 20141013093007_16799.log though. That doesn't sound like a data format, which is what sourcetypes are. It sounds like the time of day at which the data was produced. I would typically want to call this data something like "app_1".

Aside: you probably want to disable followTail, it's not really reasonable/safe and splunk only get the new data anyway. FollowTail is just useful when first setting up a forwarder.

Explorer

Hi jrodman,

Thanks for your feedback
Just to be clear the sourcetype I wanted was "XYZABC456" and not "20141013093007_16799.log"

Regarding the regex I tested this on www.regexr.com and it seemed to work fine there. According to regexr "(?=)" is a positive lookahead so "(?=_\w+_\w+\.log$)" looks for the pattern "_word_word.log" and the preceding "\w" matches the word before the lookahead pattern.

I did consider sticking with "app_1" at one point - but we have so many different job types (the examples are only a very small subset) - that extracting job name as the sourcetype from the filename would be more useful.

Cheers,
Bobby

0 Karma

Splunk Employee
Splunk Employee

Oh sorry, reading comprehension fail on my part.

0 Karma

Splunk Employee
Splunk Employee

Make that an answer and I'll upvote it.

0 Karma

Splunk Employee
Splunk Employee

Are you sure it's the answer to the question? I still can't tell.

0 Karma