Splunk Search

How to index a single data source and apply multiple sourcetypes based on the format of the log line?

Builder

I want to index our Apache error logs. There's just one nasty problem: there are multiple formats for events in the logs.

For example: PHP errors are formatted one way; Smarty errors are formatted another way. (There are at least 4 variations in our logs.) (There are 15 formats, each with 1-4 variants, in our logs.)

What I would like to do is send the entire file to a single index (e.g., Apache_error), but apply different sourcetypes based on the format of the log line.

I think I need to do something like this in inputs.conf, can someone confirm if this is the appropriate way to do it (and what the parameter I need is to specify the regex for each log line type)?

[monitor:///apache/log/location/error_log]
{some sort of regex/transforms.conf/props.conf to extract only one format of line}
index=Apache_error
source=Apache_error
sourcetype={name of sourcetype to match log line format}  

{additional monitor stanzas as needed to cover all log line formats}

Or if this isn't the way to do it, what IS the right way to do it?

EDIT: Adding a sample of data (there are 15 total log line formats) and the search I wrote that takes the sample and extracts all the fields (using "rex"). It's UGLY, and I'm not sure if it will work properly with the props.conf/transforms.conf thing, because these rex commands are somewhat nested (i.e., if you don't do them in the EXACT order shown in the search - or as near to as exact as to make no nevermind - they stomp on each other, and I'm not sure how inheritance works with props.conf/transforms.conf).

Regex is NOT my strong point, so if anyone has suggestions on how to make the rex commands better, PLEASE tell me! 🙂 NOTE: I tried every combination I could think of for the ones that are identical except for the trailing ", referer: " bit (which is 100% *optional** on every log line), and for an unknown reason, (, referer: (?<error_referer>\S+)|\n) just does NOT work for them, I ended up having to do separate rex commands for them.*

Sample (genericized) data:

TABULAR DATA NAME
Trying to open /export/sites/bondbuyer_05/data/import/tables/filename.txt
Error message at /export/sites/bondbuyer_05/bin/custom/tabular_data_converters/filexml line ###.
sh: -c: line 0: unexpected EOF while looking for matching `''
sh: -c: line 1: syntax error: unexpected end of file
ls: *[some-string]*: No such file or directory
[Mon Jan 19 12:23:38 2015] [notice] Apache configured -- resuming normal operations
[Mon Jan 19 12:23:38 2015] [notice] Digest: done
[Mon Jan 19 12:23:38 2015] [notice] Digest: generating secret for digest authentication ...
[Mon Jan 19 12:23:38 2015] [notice] Graceful restart requested, doing restart
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] Invalid Type in request (GET or POST or Secure or garbage) (relative or absolute URI or *) Secure-HTTP/(version) 200 OK
[Mon Jan 19 12:23:38 2015] [error] [client 10.110.70.254] client sent (error message)
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] script '/export/sites/requested/script/location' not found or unable to stat, referer: http://requested.script/referrer
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] script not found or unable to stat: /export/sites/requested/script/location, referer: http://requested.script/referrer
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] File does not exist: /export/sites/requested/object/location, referer: http://requested.object/referrer
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] user  not found: /relative/requested/URI
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] user username: authentication for "/relative/requested/URI": Password Mismatch, referer: http://requested.URI/referrer
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] Directory index forbidden by Options directive: /export/sites/requested/directory/location, referer: http://requested.directory/referrer
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] PHP Warning:  error message in /export/sites/requested/php/location.php on line ####, referer: http://referrer.url
[Mon Jan 19 12:23:38 2015] [error] [client 10.110.70.254] request failed: error message, referer: http://referrer.url
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] client denied by server configuration: /export/sites/path/to/requested/file, referer: http://referrer.url
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] (36)File name too long: access to  /relative/path/to/requested/file failed, referer: http://referrer.url
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] (13)Permission denied: access to  /relative/path/to/requested/file denied, referer: http://referrer.url
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] [Mon Jan 19 12:23:38 2015] [ZkError:Type] "Error message which may include portions wrapped in "double quotes" in /absolute/URI : eval()'d code at line ####" URI: http:///relative/URI/to.referrer APACHE: (Apache.cookie.value|--unset/empty--), referer: http://requested.object/referrer
[Mon Jan 19 12:23:38 2015] [error] [client 174.35.32.146] [Mon Jan 19 12:23:38 2015] [IPS_PHP:Type] "Error message in /absolute/URI at line ###" URI: http://internal.system.domain.com/full/URL/path APACHE: --unset/empty--

Search command to extract fields from all data (I've added comments referencing line numbers from the sample data above to explain which rex goes with which data):

#Lines 1-6
rex field=_raw "(?<error_message>.*)" | 
#Lines 7 and 10
rex field=_raw "\[(?<error_time>\w{3} \w{3} \d+ \d{2}:\d{2}:\d{2} \d{4})\] \[(?<error_type>\w+)\] (?<error_message>.*)" | 
#Lines 8-9
rex field=_raw "\[(?<error_time>\w{3} \w{3} \d+ \d{2}:\d{2}:\d{2} \d{4})\] \[(?<error_type>\w+)\] (?<error_class>[\w\s\d]+): (?<error_message>.*)" | 
#Lines 11-12
rex field=_raw "\[(?<error_time>\w{3} \w{3} \d+ \d{2}:\d{2}:\d{2} \d{4})\] \[(?<error_type>\w+)\] \[client (?<error_client>\d+\.\d+\.\d+\.\d+)\] (?<error_message>.*)" | 
#Line 13
rex field=_raw "\[(?<error_time>\w{3} \w{3} \d+ \d{2}:\d{2}:\d{2} \d{4})\] \[(?<error_type>\w+)\] \[client (?<error_client>\d+\.\d+\.\d+\.\d+)\] (?<error_message>.*)(, referer: (?<error_referrer>.*))" | 
#Lines 14-23
rex field=_raw "\[(?<error_time>\w{3} \w{3} \d+ \d{2}:\d{2}:\d{2} \d{4})\] \[(?<error_type>\w+)\] \[client (?<error_client>\d+\.\d+\.\d+\.\d+)\] (|\(\d+\))(?<error_class>[\w\s\d]+): (?<error_message>.*)(, referer: (?<error_referrer>.*)|\n)" | 
#Line 24
rex field=_raw "\[(?<error_time>\w{3} \w{3} \d+ \d{2}:\d{2}:\d{2} \d{4})\] \[(?<error_type>\w+)\] \[client (?<error_client>\d+\.\d+\.\d+\.\d+)\] \[\w{3} \w{3} \d+ \d{2}:\d{2}:\d{2} \d{4}\] \[(?<error_class>ZkError:\S+)\] \"(?<error_message>.*)\" URI: (?<error_uri>\S+) APACHE: (?<error_apache>\S+), referer: (?<error_referrer>.*)" | 
#Line 25
rex field=_raw "\[(?<error_time>\w{3} \w{3} \d+ \d{2}:\d{2}:\d{2} \d{4})\] \[(?<error_type>\w+)\] \[client (?<error_client>\d+\.\d+\.\d+\.\d+)\] \[\w{3} \w{3} \d+ \d{2}:\d{2}:\d{2} \d{4}\] \[(?<error_class>IPS_PHP:\S+)\] \"(?<error_message>.*)\" URI: (?<error_uri>\S+) APACHE: (?<error_apache>\S+)" | 
table _raw, error_time, error_type, error_client, error_class, error_message, error_uri, error_apache, error_referrer

90% of our errors are in the format matching line 24.

0 Karma

Legend

You are sort of on the right track, but that's not the way it works.

On the forwarder, you will have an inputs.conf file that specifies the monitor stanza. Assign a sourcetype at this point and specify the index. Personally, I would assign the most common sourcetype in inputs.conf. But on the forwarder, inputs are managed as data blocks, not individual events. So your regex doesn't go in inputs.conf...

[monitor:///apache/log/location/error_log]
index=Apache_error

Once the data has arrived at the indexer, it can be parsed and the sourcetypes can be assigned based on a regex. The following props.conf and transforms.conf must be on the indexer. Assuming that you assigned the common sourcetype access_combined, you can do the assignment this way:

props.conf

[access_combined]
TRANSFORMS-assignSourcetype = assignPHP, assignSmarty, assignX

transforms.conf

[assignPHP]
REGEX = a regular expression to match PHP events
DEST_KEY =MetaData:Sourcetype
FORMAT =sourcetype::PHP_error

[assignSmarty]
REGEX = a regular expression to match Smarty events
DEST_KEY =MetaData:Sourcetype
FORMAT =sourcetype::Smarty_error
...

If an event does not match any of the regular expressions, the sourcetype will remain unchanged. Note that I removed the source= from your inputs.conf - I prefer that the source remain as the original filename, as it makes auditing and tracing the data sources easier.

Here is a link to Override sourcetypes on a per-event basis in the documentation.

Builder

I just updated the original post with additional detail...hope you guys will take another look.

0 Karma

Builder

I'd be satisfied (for the short-term, at least) with the props.conf/transforms.conf solution if I can figure out the inheritance/order of execution so that the sourcetypes are assigned correctly.

Is it assigned by the order of the stanzas in transforms.conf, or by the order that the stanzas are called in the props.conf TRANSFORMS parameter?

0 Karma

Builder

If you understand the question, this section in documentation describes ordering transforms as follows:

If you have a set of transforms that must be run in a specific order
and which belong to the same host,
source, or source type, you can place
them in a comma-separated list within
the same props.conf stanza. Splunk
Enterprise will apply them in the
specified order.

0 Karma

Builder

Maybe I'm not following this correctly, but this sounds like it'll apply the field extractions (from props.conf/transforms.conf) at search time rather than at index time. I'd much rather have the fields extracted at index time because search-time extractions are SLOW.

Am I reading this wrong?

P.S. Sorry it took so long to reply...other projects came up.

0 Karma

Builder

IMHO search-time extractions are the standard, and index-time extractions generally need an incredible case for justification. If you are really experiencing slowness with basic search-time field extractions, perhaps other aspects of your searches, hardware or deployment type is holding splunk back. The fact that splunk data is stored in a somewhat unstructured format and that the magic happens at search time is one of its real strengths.

0 Karma

Splunk Employee
Splunk Employee

If search time extractions are slow then memory adjustments can help here.
However, DEST_KEY=MetaData:Sourcetype operations are happening index time, you will want to put these props and transforms on your forwarder and indexing tiers.

0 Karma

Builder

If I understand you correctly, putting these props.conf and transforms.conf settings on the forwarder (rather than the indexer) will do what I'm looking for, then?

From my experience as a sysadmin, "throw more hardware at it" is never an appropriate answer to performance problems. That's the answer for capacity problems. 🙂

The server that I was experiencing that pain on had 32GB of memory, and it would take in excess of 11,000ms to extract fields when searching more than 1,000,000 events (and we generate about 350,000 of those events per hour). We've since switched to index-time extractions and can search 50,000,000 of those events without any significant performance degradation, using about 10% of the memory that was required to search 1,000,000 events before.

0 Karma

Splunk Employee
Splunk Employee

transforms.conf
Safe to deploy on indexers, search heads, and heavy forwarders.
Contains the references that props needs for to regex routines on data and metadata.


example to change the index of the data based on regex in data

[rewriteindex]
REGEX = ^Message\sfrom\s(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})
DEST
KEY = _MetaData:Index
FORMAT = myindex

example to change the sourcetype of the data based on regex in data

[rewritesourcetype]
REGEX = ^Message\sfrom\sIP\s(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})
DEST
KEY = MetaData:Sourcetype
FORMAT = sourcetype::mysourcetype

example to change the host of the data based on regex in data

[rewritehost]
REGEX = ^Message\sfrom\sIP\s(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})
DEST
KEY = MetaData:Host
FORMAT = host::$1

Regex Transform that Removes lines that Start with a # from files
[TrashComments]
REGEX = ^\s*#
DEST_KEY = queue
FORMAT = nullQueue

example to drop data that matches a regex, like Event Codes
see also Windows Black List with Universal forwarder ver. 6 and up.

[dropwindowssecuritycodes]
REGEX = EventCode=(1111|4444|3333)
DEST
KEY = queue
FORMAT = nullQueue

0 Karma

Builder

Drat. We're using the Universal Forwarder, not the heavy forwarder.

0 Karma