In $SPLUNK_HOME/etc/system/default/
we find this troublesome configuration in transforms.conf:
[syslog-host]
DEST_KEY = MetaData:Host
REGEX = :\d\d\s+(?:\d+\s+|(?:user|daemon|local.?)\.\w+\s+)*\[?(\w[\w\.\-]{2,})\]?\s
FORMAT = host::$1
It matches these in props.conf:
########## EMAIL ##########
[postfix_syslog]
pulldown_type = 1
MAX_TIMESTAMP_LOOKAHEAD = 32
TIME_FORMAT = %b %d %H:%M:%S
TRANSFORMS-host = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
category = Email
description = Output produced by the Postfix email server
[sendmail_syslog]
pulldown_type = 1
MAX_TIMESTAMP_LOOKAHEAD = 32
SHOULD_LINEMERGE = False
TIME_FORMAT = %b %d %H:%M:%S
TRANSFORMS = syslog-host
REPORT-syslog = sendmail-extractions
category = Email
description = Output produced by the Sendmail email server
########## OSs ##########
[linux_messages_syslog]
pulldown_type = 1
MAX_TIMESTAMP_LOOKAHEAD = 32
TIME_FORMAT = %b %d %H:%M:%S
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
category = Operating System
description = Format found within the Linux log file /var/log/messages
[windows_snare_syslog]
pulldown_type = 1
MAX_TIMESTAMP_LOOKAHEAD = 32
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
TIME_FORMAT = %b %d %H:%M:%S
category = Operating System
description = Output produced by the Snare syslog server on Windows
[syslog]
pulldown_type = true
maxDist = 3
TIME_FORMAT = %b %d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 32
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
category = Operating System
description = Output produced by many syslog daemons, as described in RFC3164 by the IETF
########## ROUTERS AND FIREWALLS ##########
[cisco_syslog]
pulldown_type = 0
MAX_TIMESTAMP_LOOKAHEAD = 32
SHOULD_LINEMERGE = False
TIME_FORMAT = %b %d %H:%M:%S
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions
This RegEx is far too permissive because at its simplest, it factors down to this:
:\d\d\s+\[?(?<capture>\w[\w\.\-]{2,})\]?\s
I am seeing that it matches logs like this and setting the host
value to the nonsensical GET
and GGG
:
<13>2019-07-18T20:31:20.854753+00:00 GET login?hsgid=00000000-0000-0000-0 HTTP/1.1#015
<13>2019-07-18T16:49:09.691477+00:00 GET / HTTP/1.1#015
<13>2019-07-17T20:28:52.087901+00:00 GGG
The problem is that I do not have any representative logs to see what it is really supposed to be doing. I suspect that the fix is to change the *
to a +
so it would be this:
:\d\d\s+(?:\d+\s+|(?:user|daemon|local.?)\.\w+\s+)+\[?(\w[\w\.\-]{2,})\]?\s
I do realize that the heart of the problem is that we should NOT be using sourcetype
value of syslog
and we are working to correct this but you would not believe how many different things are in that sourcetype so it is taking a long time.
I don't have an answer, but a question: The logs you quotes seem to be HTTP server logs, rather than syslog messages. Is that the case?
I assume that syslog field extraction/transformation rules would not work to parse HTTP server logs.
Yes, as I said, the logs should not be there, but they are. That isn't really the point. The point is that this RegEx is so absurdly permissive that it cannot be correct.