Firstly, I'll give my apologies now as you'll find my attempt to explain my problem will most likely show my inexperience with Splunk.
To start off - I'm running Splunk 6 on RedHat Enterprise Linux 5.
I'm attempting to ingest many application log files into Splunk where part of the filename contains the application subsystem, a date and time string, and a process ID. Suffice to say, these logs are only created once by a job triggered from the application - and never used in any subsequent jobs.
I've based my research on suggestions from blogs and other posts in this forum such as:
http://blogs.splunk.com/2010/02/11/sourcetypes-gone-wild/
http://answers.splunk.com/answers/25560/field-names-from-file-including-source-and-host.html
http://answers.splunk.com/answers/83619/source-sourcetype-defined-by-folder-names.html
A sample of log files I'm wanting to ingest are:
/app/prod/app_1/logs/ABCDE123_20141013163738_24772.log
/app/prod/app_1/logs/XYZABC456_20141013093007_16799.log
/app/prod/app_1/logs/EFGHIJK789_20141013093007_16799.log
/app/prod/app_1/logs/123ABC_20141013093007_16799.log
In my universal forwarder I have an inputs.conf file with the following entry:
[monitor:///app/prod/app_1/logs/*.log]
disabled = false
followTail = 1
index = app_index
In my indexer I have a props.conf file with the following entry:
[source::/app/prod/app_1/logs/*.log]
TRANSFORMS-set_sourcetype_app_logs = set_sourcetype_app_logs
Also in my indexer a transforms.conf file with the following entry:
[set_sourcetype_app_logs]
DEST_KEY=MetaData:Sourcetype
SOURCE_KEY=MetaData:Source
REGEX=\w+(?=_\w+_\w+\.log$)
FORMAT=sourcetype::$1
My expectation is that indexed logs should a source like "/app/prod/app_1/logs/ABCDE123_20141013163738_24772.log" and a sourcetype like "ABCDE123"
However, once the logs are ingested and indexed, a search reveals that all data ingested appeared literally with sourcetype of '$1' instead of the intended filename regex,
Do I have a problem with my transforms.conf regex or is my configuration completely off the mark?
Any help would be greatly appreciated.
Thanks,
Bobby
Since you want to use the first part of the filename, you need to change your regex.
REGEX=\w+(?=_\w+_\w+\.log$)
You want the \w characters that precede the _number_number.log, so you have make them a capturing group, like so:
REGEX=(\w+)(?=_\w+_\w+\.log$)
Now $1 is ABCDE123 and similar.
Since you want to use the first part of the filename, you need to change your regex.
REGEX=\w+(?=_\w+_\w+\.log$)
You want the \w characters that precede the _number_number.log, so you have make them a capturing group, like so:
REGEX=(\w+)(?=_\w+_\w+\.log$)
Now $1 is ABCDE123 and similar.
Perfect! Worked like a charm. Thanks for your help!
Incidentally if you wanted to use the entire regex match, you could have used $0, but I encourage the explicit capture group approach.
Are you sure (?=)
acts as a capturing group? I think it's just a zero-width assertion that doesn't capture any text. Why not drop the ?=
I'm a little skittish of a sourcetype like 20141013093007_16799.log though. That doesn't sound like a data format, which is what sourcetypes are. It sounds like the time of day at which the data was produced. I would typically want to call this data something like "app_1".
Aside: you probably want to disable followTail, it's not really reasonable/safe and splunk only get the new data anyway. FollowTail is just useful when first setting up a forwarder.
Hi jrodman,
Thanks for your feedback
Just to be clear the sourcetype I wanted was "XYZABC456" and not "20141013093007_16799.log"
Regarding the regex I tested this on www.regexr.com and it seemed to work fine there. According to regexr "(?=)" is a positive lookahead so "(?=_\w+_\w+\.log$)"
looks for the pattern "_word_word.log" and the preceding "\w"
matches the word before the lookahead pattern.
I did consider sticking with "app_1" at one point - but we have so many different job types (the examples are only a very small subset) - that extracting job name as the sourcetype from the filename would be more useful.
Cheers,
Bobby
Oh sorry, reading comprehension fail on my part.
Make that an answer and I'll upvote it.
Are you sure it's the answer to the question? I still can't tell.