I need help troubleshooting an issue where I am missing events being forwarded from a linux syslog daemon to my heavy forwarders. Beginning the first day of each month, for three or four days, this feed drops from ~50,000 indexed events per hour to maybe ~150. Then, magically, the feed resumes ~50,000 events per hour for the remainder of the month. Only this one index source is affected. All traffic is UDP. To troubleshoot: I've removed the load balancer from the equation and send directly to one heavy forwarder We can see the syslog events leaving the source server Using tcpdump, I can see events from the source server hitting port 514 on the heavy forwarder I have a dashboard showing blocking on agg, index, parsing and typing queues. There is none. I tested the regexs in my transforms on the actual events captured with tcpdump. All test correctly. While this event was in progress Opened a support case with diag logs from the HWF and one of my indexer servers (nothing yet) There are no errors or warnings in the internal logs for the heavy forwarder used in this test. I've looked at all the log channels on the HWF (1236 of them) but I don't know which one(s) to elevate the logging level for I tried starting splunk with --debug but I do not see any additional internal logging. I may not have done this correctly. (splunk start --debug) More strange There are two syslog feeds from the source server being used in this troubleshooting effort. The second feed is unaffected. There are 78 source servers in this group. All exhibit the same behavior making it seem that splunk is the common denominator. I do use a props and transforms configuration for port 514 to parse the index name and sourcetype for a multitude of incoming syslog feeds bound for different indexes. This configuration has not changed for a very long time (and does not change at the first of the month - for a few days). Frankly I'm lost. There must be a way to expose what is happening to these events either at the heavy forwarder or on the indexers but I'm out of ideas. Does anyone have a thought about how I might capture the information I need to diagnose whatever is happening? At this time, the feed has returned to normal i.e. ~50,000 indexed events per hour. Thank you in advance for any advice you have.
... View more