Getting Data In
Highlighted

Is Splunk's "syslog-host" REGEX in $SPLUNK_HOME/etc/system/default/transforms.conf broken?

Esteemed Legend

In $SPLUNK_HOME/etc/system/default/ we find this troublesome configuration in transforms.conf:

[syslog-host]
DEST_KEY = MetaData:Host
REGEX = :\d\d\s+(?:\d+\s+|(?:user|daemon|local.?)\.\w+\s+)*\[?(\w[\w\.\-]{2,})\]?\s
FORMAT = host::$1

It matches these in props.conf:

########## EMAIL ##########

[postfix_syslog]
pulldown_type = 1
MAX_TIMESTAMP_LOOKAHEAD = 32
TIME_FORMAT = %b %d %H:%M:%S
TRANSFORMS-host = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
category = Email
description = Output produced by the Postfix email server

[sendmail_syslog]
pulldown_type = 1
MAX_TIMESTAMP_LOOKAHEAD = 32
SHOULD_LINEMERGE = False
TIME_FORMAT = %b %d %H:%M:%S
TRANSFORMS = syslog-host
REPORT-syslog = sendmail-extractions
category = Email
description = Output produced by the Sendmail email server

########## OSs ##########

[linux_messages_syslog]
pulldown_type = 1
MAX_TIMESTAMP_LOOKAHEAD = 32
TIME_FORMAT = %b %d %H:%M:%S
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
category = Operating System
description = Format found within the Linux log file /var/log/messages

[windows_snare_syslog]
pulldown_type = 1
MAX_TIMESTAMP_LOOKAHEAD = 32
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
TIME_FORMAT = %b %d %H:%M:%S
category = Operating System
description = Output produced by the Snare syslog server on Windows

[syslog]
pulldown_type = true
maxDist = 3
TIME_FORMAT = %b %d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 32
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
category = Operating System
description = Output produced by many syslog daemons, as described in RFC3164 by the IETF

########## ROUTERS AND FIREWALLS ##########

[cisco_syslog]
pulldown_type = 0
MAX_TIMESTAMP_LOOKAHEAD = 32
SHOULD_LINEMERGE = False
TIME_FORMAT = %b %d %H:%M:%S
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions

This RegEx is far too permissive because at its simplest, it factors down to this:

:\d\d\s+\[?(?<capture>\w[\w\.\-]{2,})\]?\s

I am seeing that it matches logs like this and setting the host value to the nonsensical GET and GGG:

<13>2019-07-18T20:31:20.854753+00:00 GET login?hsgid=00000000-0000-0000-0 HTTP/1.1#015
<13>2019-07-18T16:49:09.691477+00:00 GET / HTTP/1.1#015
<13>2019-07-17T20:28:52.087901+00:00 GGG

The problem is that I do not have any representative logs to see what it is really supposed to be doing. I suspect that the fix is to change the * to a + so it would be this:

:\d\d\s+(?:\d+\s+|(?:user|daemon|local.?)\.\w+\s+)+\[?(\w[\w\.\-]{2,})\]?\s

I do realize that the heart of the problem is that we should NOT be using sourcetype value of syslog and we are working to correct this but you would not believe how many different things are in that sourcetype so it is taking a long time.

0 Karma
Highlighted

Re: Is Splunk's "syslog-host" REGEX in $SPLUNK_HOME/etc/system/default/transforms.conf broken?

Communicator

I don't have an answer, but a question: The logs you quotes seem to be HTTP server logs, rather than syslog messages. Is that the case?

I assume that syslog field extraction/transformation rules would not work to parse HTTP server logs.

0 Karma
Highlighted

Re: Is Splunk's "syslog-host" REGEX in $SPLUNK_HOME/etc/system/default/transforms.conf broken?

Esteemed Legend

Yes, as I said, the logs should not be there, but they are. That isn't really the point. The point is that this RegEx is so absurdly permissive that it cannot be correct.

0 Karma