Getting Data In

Creating custom sourcetype from part of log filename results with field sourcetype=$1

bobmacks
Explorer

Firstly, I'll give my apologies now as you'll find my attempt to explain my problem will most likely show my inexperience with Splunk.

To start off - I'm running Splunk 6 on RedHat Enterprise Linux 5.

I'm attempting to ingest many application log files into Splunk where part of the filename contains the application subsystem, a date and time string, and a process ID. Suffice to say, these logs are only created once by a job triggered from the application - and never used in any subsequent jobs.

I've based my research on suggestions from blogs and other posts in this forum such as:

http://blogs.splunk.com/2010/02/11/sourcetypes-gone-wild/

http://answers.splunk.com/answers/25560/field-names-from-file-including-source-and-host.html

http://answers.splunk.com/answers/83619/source-sourcetype-defined-by-folder-names.html

A sample of log files I'm wanting to ingest are:

/app/prod/app_1/logs/ABCDE123_20141013163738_24772.log
/app/prod/app_1/logs/XYZABC456_20141013093007_16799.log
/app/prod/app_1/logs/EFGHIJK789_20141013093007_16799.log
/app/prod/app_1/logs/123ABC_20141013093007_16799.log

In my universal forwarder I have an inputs.conf file with the following entry:

[monitor:///app/prod/app_1/logs/*.log]
disabled = false
followTail = 1
index = app_index

In my indexer I have a props.conf file with the following entry:

[source::/app/prod/app_1/logs/*.log]
TRANSFORMS-set_sourcetype_app_logs = set_sourcetype_app_logs

Also in my indexer a transforms.conf file with the following entry:

[set_sourcetype_app_logs]
DEST_KEY=MetaData:Sourcetype
SOURCE_KEY=MetaData:Source
REGEX=\w+(?=_\w+_\w+\.log$)
FORMAT=sourcetype::$1

My expectation is that indexed logs should a source like "/app/prod/app_1/logs/ABCDE123_20141013163738_24772.log" and a sourcetype like "ABCDE123"

However, once the logs are ingested and indexed, a search reveals that all data ingested appeared literally with sourcetype of '$1' instead of the intended filename regex,

Do I have a problem with my transforms.conf regex or is my configuration completely off the mark?

Any help would be greatly appreciated.

Thanks,
Bobby

0 Karma
1 Solution

jrodman
Splunk Employee
Splunk Employee

Since you want to use the first part of the filename, you need to change your regex.

REGEX=\w+(?=_\w+_\w+\.log$)

You want the \w characters that precede the _number_number.log, so you have make them a capturing group, like so:

REGEX=(\w+)(?=_\w+_\w+\.log$)

Now $1 is ABCDE123 and similar.

View solution in original post

jrodman
Splunk Employee
Splunk Employee

Since you want to use the first part of the filename, you need to change your regex.

REGEX=\w+(?=_\w+_\w+\.log$)

You want the \w characters that precede the _number_number.log, so you have make them a capturing group, like so:

REGEX=(\w+)(?=_\w+_\w+\.log$)

Now $1 is ABCDE123 and similar.

bobmacks
Explorer

Perfect! Worked like a charm. Thanks for your help!

0 Karma

jrodman
Splunk Employee
Splunk Employee

Incidentally if you wanted to use the entire regex match, you could have used $0, but I encourage the explicit capture group approach.

0 Karma

jrodman
Splunk Employee
Splunk Employee

Are you sure (?=) acts as a capturing group? I think it's just a zero-width assertion that doesn't capture any text. Why not drop the ?=

I'm a little skittish of a sourcetype like 20141013093007_16799.log though. That doesn't sound like a data format, which is what sourcetypes are. It sounds like the time of day at which the data was produced. I would typically want to call this data something like "app_1".

Aside: you probably want to disable followTail, it's not really reasonable/safe and splunk only get the new data anyway. FollowTail is just useful when first setting up a forwarder.

bobmacks
Explorer

Hi jrodman,

Thanks for your feedback
Just to be clear the sourcetype I wanted was "XYZABC456" and not "20141013093007_16799.log"

Regarding the regex I tested this on www.regexr.com and it seemed to work fine there. According to regexr "(?=)" is a positive lookahead so "(?=_\w+_\w+\.log$)" looks for the pattern "_word_word.log" and the preceding "\w" matches the word before the lookahead pattern.

I did consider sticking with "app_1" at one point - but we have so many different job types (the examples are only a very small subset) - that extracting job name as the sourcetype from the filename would be more useful.

Cheers,
Bobby

0 Karma

jrodman
Splunk Employee
Splunk Employee

Oh sorry, reading comprehension fail on my part.

0 Karma

sowings
Splunk Employee
Splunk Employee

Make that an answer and I'll upvote it.

0 Karma

jrodman
Splunk Employee
Splunk Employee

Are you sure it's the answer to the question? I still can't tell.

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...