I'm collecting all syslog messages from my datacenter on a central rsyslog server.
rsyslog splits the messages following the following directory structure:
/log/yyyy/mm/dd/host/service.log
service is extracted from the syslog message, grouping messages from the same daemon in one log file.
I have one monitor input, looking for the whole tree. The host is extracted from the path. The sourcetype is left as "automatic". The idea being that Splunk could analyse every log file, and finds out if it is a postfix/apache/snmp/cron, .... logfile.
It works quite well, but all sourcetypes are xxx-too_small
(i.e. postfix-too_small, snmpd-too-small, ...)
I'm suspecting that as we are starting a new logfile for every host, service and day, at midnight there will be only one or two events in a new file. Splunk sees this new file, tries to find out what it is, get it quite right, but tags the sourcetype with "too_small", as there are less than 100 events.
My questions:
Many thanks in advance for any tips & tricks!
Couple of months later, I learned some more.
the above file split for the central syslog proved to be a disaster for splunk. Somehow, it generated thousands of sourcetypes (because syslog generated thousands of different service names). -> This lead Splunk indexes to be completely fubar (any single search just consumed all CPU)
Fresh start: we are now keeping standard syslog messages in a separate tree (for archiving purposes), and dumping everything else in one syslog file per host. These files are then regularly rotated, and after two rotation discarded (data is in Splunk, and in separate archive)
This looks now much better. Sourcetype is fixed to be syslog. Not as fun as automatic sourcetype detection, but hey, these are really syslog messages...
I've also just read the following blog entry: http://blogs.splunk.com/2010/02/11/sourcetypes-gone-wild/ which explain how I could now extract from this single stream of syslog different sourcetype per event. And probably reroute them to different indexes...
Question: how expensive is it to run regexp on every event during indexing?