Getting Data In

sourcetype "too_small" and log rotation on rsyslog server

grundsch
Communicator

I'm collecting all syslog messages from my datacenter on a central rsyslog server.
rsyslog splits the messages following the following directory structure:
/log/yyyy/mm/dd/host/service.log

service is extracted from the syslog message, grouping messages from the same daemon in one log file.
I have one monitor input, looking for the whole tree. The host is extracted from the path. The sourcetype is left as "automatic". The idea being that Splunk could analyse every log file, and finds out if it is a postfix/apache/snmp/cron, .... logfile.

It works quite well, but all sourcetypes are xxx-too_small

(i.e. postfix-too_small, snmpd-too-small, ...)

I'm suspecting that as we are starting a new logfile for every host, service and day, at midnight there will be only one or two events in a new file. Splunk sees this new file, tries to find out what it is, get it quite right, but tags the sourcetype with "too_small", as there are less than 100 events.

My questions:

  • how can I suppress this "too_small"?
  • how you guys with central syslog servers are handling such setup? (I suppose I'm not alone indexing central syslog server) Especially, how are you handling the creation of new log files (i.e new sources from a point of view of Splunk) with few events in it?

Many thanks in advance for any tips & tricks!

grundsch
Communicator

Couple of months later, I learned some more.

  • the above file split for the central syslog proved to be a disaster for splunk. Somehow, it generated thousands of sourcetypes (because syslog generated thousands of different service names). -> This lead Splunk indexes to be completely fubar (any single search just consumed all CPU)

  • Fresh start: we are now keeping standard syslog messages in a separate tree (for archiving purposes), and dumping everything else in one syslog file per host. These files are then regularly rotated, and after two rotation discarded (data is in Splunk, and in separate archive)

This looks now much better. Sourcetype is fixed to be syslog. Not as fun as automatic sourcetype detection, but hey, these are really syslog messages...

I've also just read the following blog entry: http://blogs.splunk.com/2010/02/11/sourcetypes-gone-wild/ which explain how I could now extract from this single stream of syslog different sourcetype per event. And probably reroute them to different indexes...

Question: how expensive is it to run regexp on every event during indexing?

Register for .conf21 Now! Go Vegas or Go Virtual!

How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20.