While there is always more Splunk work to be done then I will ever have time to actually accomplish, I have several dashboards that I look at to be a bit proactive about our environment and have been trying to slowly work through correcting the issues. One of those dashboards looks for logs with a hostname of one of our central syslog servers and I exclude the index where the system logs should be; this let's me find instances where the hostname extraction is not working. Today's example was a two ascii character hostname.
The regular expression in question is the syslog-host extraction in $SPLUNK_HOME/etc/system/default/transforms.conf
:\d\d\s+(?:\d+\s+|(?:user|daemon|local.?)\.\w+\s+)*\[?(\w[\w\.\-]{2,})\]?\s
The syslog-host-full appears to also be affected.
Just to make this easier, an example log message (anonymized) might be:
Sep 10 08:22:21 xx arpwatch: bogon 10.0.0.12 0:AA:BB:CC:DD:EE
The hostname should be xx, but the extraction isn't working.
The portion of the regex the actually extracts the hostname is (\w[\w\.\-]{2,})
In English that says require a word character followed by at least two word, periods, or dashes which requires three character hostnames.
My first guess towards why it was written that way is they were trying to impose a minimum length (Debian's installer enforces (enforced?) a two character limit and I know at least at one point Ubuntu enforced a three character limit).
My best guess, however, is that they could possible be trying to require that if there's a period there's a character after it. The problem is without comments or example log messages, I have no way of knowing why it was written this way.
My assumption towards rules is:
Thus I think I'd change the host portion to [a-z0-9\-.]*[a-z0-9]
(so the full regex would be :\d\d\s+(?:\d+\s+|(?:user|daemon|local.?)\.\w+\s+)*\[?([a-zA-Z0-9\-.]*[a-zA-Z0-9])\]?\s
)
Under the Robustness principle you could argue that Splunk should error on the side of being liberal, enforcing the rules just enough to help limit false positives and thus the regex should be a more liberal [\w.\-]+
I ran a query to look at our syslog data and display where host != new_host and the results looked correct.
index=* a_syslog_filter
| rex ":\d\d\s+(?:\d+\s+|(?:user|daemon|local.?)\.\w+\s+)*\[?(?<new_host>[a-zA-Z0-9\-.]*[a-zA-Z0-9])\]?\s"
| dedup host new_host | sort host new_host
| table index sourcetype host new_host _raw
| where host!=new_host
The other verification query I ran was:
| metadata type=hosts | regex host!="[a-zA-Z0-9\-.]*[a-zA-Z0-9]"
The thought here was to look for hostnames which would not match using the regular expression.
If anyone runs those queries and sees issue with the suggestion, I'd love to see the examples!
The reason Splunk should be defensive about the hostname is that syslog "formats" vary a good deal, and people end up with random strings extracted as the host which creates significant problems for users.
It makes perfect sense for you to specialize this extraction to what makes sense in your environment, but if anything there is a good deal more pressure to make the host extraction an explicit opt-in.
The implementation is not broken. It's an attempt to "work well out of the box" in the face of messy data which is the history of Splunk back when we were adding many sourcetypes out of the box. It may not be the best solution in the current era, but shooting for the 100% correct leads to shipping no sourcetypes out of the box. It's not like even 90% of syslog data follows it's respective RFC. There is value in being permissive, but there's a lot more value in not doing host extractions out of the box at all without admin opt-in.
RFC952 defines a hostname as 1 to 63 octets (excluding 0 due to root) thus I think it is pretty clear the current implementation is broken. The change comes down to risk vs reward. We operate more like a service provider then cuts a customer, so we're a bit risk adverse and thus asking the community if they saw issues. While there's potentially more risk if Splunk makes the change, the alternative is shipping known bad configurations raising the effort require for a good install (especially since host is done at index time)
Note that you could always override the extraction in a /local/transforms.conf using your own regex.
I fully intend to submit something via the support portal assuming no one raises an objection. I wanted to vet this through the community to see if there is a better answer due to either a better way to do it or possibly I overlooked something. We have a support contract, so I've put in more then my fare share of tickets (to be fair, we have a VERY diverse environment, so if there's an issue, we're quite likely to hit it).
Beautifully argued. Please submit to Splunk Support, as this is the official way to get things fixed! (Submit a case at the bottom right of this page: http://www.splunk.com/support - it doesn't matter if you have a support contract or not.)
Also, I think there is a historical reason for this - up until about 2008, several Linux distributions enforced a minimum hostname length. Some may still do so. I'm not saying that is right, just saying that may be why.