Hello there.
I'm having a performance problem. I have a "central UF" which is supposed to ingest MessageTracking logs from several Exchange servers. As you can guess from the "several Exchage servers" part, the logs are shared over CIFS shares (the hosts are in the same domain; to make things more complicated to debug, only the service account the UF runs with has access to those shares but my administrator account doesn't :-)).
Anyway, since there are several Exchange instances and each of the directories has quite a lot of files the UF sometimes gets "clogged" and - especially after restart - needs a lot of time to check all the logfiles, decide that it doesn't need to ingest most of them and start forwarding real data. To make things more annoying, since the monitor inputs are the same that are responsible for ingesting forwarder's own logs, until this process completes I don't even have _internal entries from this host and have to check the physical log files on the forwarder machine to do any debugging or monitoring.
The windows events, on the other hand, get forwarded right from the forwarder restart.
So I'm wondering whether I can do anything to improve the efficiency of this ingestion process.
I know that the "officailly recommended" way would be to install forwarders on each of the Exchange servers and ingest the files straight from there but due to organizational limitations that's out of the question (at least at the moment). So I'm stuck with just this one UF.
I already raised thruput, but judging from the metrics.log it's not an issue of output throttling and queue blocking.
I raised ingestion pipelines to 2 and my descriptors limit is set at 2000 at the moment.
The typical single directory monitor input definition looks something like this:
[monitor://\\host1\mtrack$\]
disabled = 0
whitelist = \.LOG$
host = host1
sourcetype = MSExchange:2013:MessageTracking
index = exchange
ignoreOlderThan = 3d
_meta=site::site1
And I have around 14, maybe 16 of those to monitor. Which means that when I do splunk list inputstatus I'm getting around 500k files (most of them get ignored but they have to be checked first for modification time and possibly for CRC)!
I think I will have to tell the customer that it's simply beyond the performance limits of any machine (especially when doing all this file stating over the network) but I was wondering if there are any tweaks I could apply even in this situation.