Getting Data In

Is Splunk's "syslog-host" REGEX in $SPLUNK_HOME/etc/system/default/transforms.conf broken?

woodcock
Esteemed Legend

In $SPLUNK_HOME/etc/system/default/ we find this troublesome configuration in transforms.conf:

[syslog-host]
DEST_KEY = MetaData:Host
REGEX = :\d\d\s+(?:\d+\s+|(?:user|daemon|local.?)\.\w+\s+)*\[?(\w[\w\.\-]{2,})\]?\s
FORMAT = host::$1

It matches these in props.conf:

########## EMAIL ##########

[postfix_syslog]
pulldown_type = 1
MAX_TIMESTAMP_LOOKAHEAD = 32
TIME_FORMAT = %b %d %H:%M:%S
TRANSFORMS-host = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
category = Email
description = Output produced by the Postfix email server

[sendmail_syslog]
pulldown_type = 1
MAX_TIMESTAMP_LOOKAHEAD = 32
SHOULD_LINEMERGE = False
TIME_FORMAT = %b %d %H:%M:%S
TRANSFORMS = syslog-host
REPORT-syslog = sendmail-extractions
category = Email
description = Output produced by the Sendmail email server

########## OSs ##########

[linux_messages_syslog]
pulldown_type = 1
MAX_TIMESTAMP_LOOKAHEAD = 32
TIME_FORMAT = %b %d %H:%M:%S
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
category = Operating System
description = Format found within the Linux log file /var/log/messages

[windows_snare_syslog]
pulldown_type = 1
MAX_TIMESTAMP_LOOKAHEAD = 32
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
TIME_FORMAT = %b %d %H:%M:%S
category = Operating System
description = Output produced by the Snare syslog server on Windows

[syslog]
pulldown_type = true
maxDist = 3
TIME_FORMAT = %b %d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 32
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
category = Operating System
description = Output produced by many syslog daemons, as described in RFC3164 by the IETF

########## ROUTERS AND FIREWALLS ##########

[cisco_syslog]
pulldown_type = 0
MAX_TIMESTAMP_LOOKAHEAD = 32
SHOULD_LINEMERGE = False
TIME_FORMAT = %b %d %H:%M:%S
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions

This RegEx is far too permissive because at its simplest, it factors down to this:

:\d\d\s+\[?(?<capture>\w[\w\.\-]{2,})\]?\s

I am seeing that it matches logs like this and setting the host value to the nonsensical GET and GGG:

<13>2019-07-18T20:31:20.854753+00:00 GET login?hsgid=00000000-0000-0000-0 HTTP/1.1#015
<13>2019-07-18T16:49:09.691477+00:00 GET / HTTP/1.1#015
<13>2019-07-17T20:28:52.087901+00:00 GGG

The problem is that I do not have any representative logs to see what it is really supposed to be doing. I suspect that the fix is to change the * to a + so it would be this:

:\d\d\s+(?:\d+\s+|(?:user|daemon|local.?)\.\w+\s+)+\[?(\w[\w\.\-]{2,})\]?\s

I do realize that the heart of the problem is that we should NOT be using sourcetype value of syslog and we are working to correct this but you would not believe how many different things are in that sourcetype so it is taking a long time.

0 Karma

ww9rivers
Contributor

I don't have an answer, but a question: The logs you quotes seem to be HTTP server logs, rather than syslog messages. Is that the case?

I assume that syslog field extraction/transformation rules would not work to parse HTTP server logs.

0 Karma

woodcock
Esteemed Legend

Yes, as I said, the logs should not be there, but they are. That isn't really the point. The point is that this RegEx is so absurdly permissive that it cannot be correct.

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...