Hi, we met a tough issue , there's a system generate more than 10MB/s log to forwarder to index server at a special peak time, then the data will be delayed by half an hour or more. In our scenario, the index server has 2 CPUs with 16 core ,16GB ram, 15000rpm sas HDD with raid 5. I noticed it's loading's not heavy , CPU below 6 percent ,load average about 0.1, HDD iops at 60. The index server configuration : maxKBps=0 in limits.conf, queue max size set 2048MB in server.conf. The forwarder side I configured the thrughput with maxKBps=0.
With above configuration the maximum bandwidth between forwarder and index is about 2MB/s.
So I have some question:
1. Can splunk run faster (speed up the data throughput much more than 2MB/s)? I didn't see any bottle neck with current hardware performance. The index engine worked in single threading? It's possible to disable the auto fields finding on some special source type or indexes to speed up the data throughput?
2. It seems the splunk process doesn't consume much resource while indexing data, it's limited by the splunk indexing process's ability itself or the software want to spare the resource for other routine jobs like real time search ,report ?
3. If I have a single log file with horrible size (3MB360024=253GB) per day,is it possible to make it searchable near real time level?
4. In your experience ,what's the maximum data throughput per second can a single indexer approach?
Interesting. You seem to have done your homework. Have you contacted Splunk Support?
Could there be other limits in UF which you could overcome by switching to a Heavy Forwarder? This is not my strong side, just thinking ...
It's common to hit numbers like 10MB/s on modern hardware in splunk 5 or 6, though there are many variables and with some data you might have lower numbers without anything being wrong. I have seen scenarios where 20MB/s was achieved. 2MB/s sounds like problem territory.
We don't have "Quality of Service"-like controls to prioritize one large file over all other data, so if the system can't handle the aggregate data, the largest single datasource may lag.
There are many potential bottlenecks in the system, and it's hard to diagnose this without full support contact. Of course you probably dealt with this problem long ago, but I'm answering because the core question about expected throughput is important.