I found out what the problem was. There is a Cribl server between UF and Indexer, which I mistakenly ruled out as the source of the problem during throubleshooting. I bypassed Cribl for a while and the problem disappeared.
The rest was already pretty fast. I found that there was a persistent queue enabled for Linux input/source in the "Alway On" mode. The persistent queue was not turned on for Windows Input/source. Windows logs were OK all the time. After turning it off for Linux data, the problem disappeared.
I don't understand why the persistent queue behaves this way, but I don't have time to investigate further. Maybe it's a Cribl bug or a misunderstanding of functionality. The input queue is not required in the project, so I can leave it off.
For me, it's currently resolved
Thank you all for your help and your time
Yeah, you're right. It was the other-way sawtooth. It looks strange. Are you sure you don't have any network-level issues? And don't you see any other interesting stuff in _internal (outside of the Metrics component) for this forwarder?
I have two weeks off, so I'll continue troubleshooting after that.
In my opinion there are not any interesting stuff in _internal log. You can see it on the screenshot. I used cluster command to reduce log number. There is component != metric in SPL.
Right. That was !=, not =.
You're mostly interested in
index=_internal component=AutoLoadBalancedConnectionStrategy host=<your_forwarder>
I looked at the events for the component you mentioned and found that there is only one type of log entry.
I also tried it for the "last 7 days" time range.
Which kind of logs you are collecting? Is it possible that there is some log or input which stalled this after it has read and then UF just wait free resources to read next one?
Have you only one or several pipelines in your UF?
Any performance data from OS level and which OS, version you have?
I am collecting logs from some files from /var/log and sysmon from journald.
last 90 minutes
/opt/splunkforwarder/var/log/splunk/audit.log | 41 |
/opt/splunkforwarder/var/log/splunk/health.log | 39 |
/opt/splunkforwarder/var/log/splunk/metrics.log | 8911 |
/opt/splunkforwarder/var/log/splunk/splunkd.log | 598 |
/var/log/audit/audit.log | 7 |
/var/log/messages | 936 |
/var/log/secure | 10 |
journald://sysmon | 919 |
inputs.conf
[monitor:///var/log/syslog]
I got a direct access to the sever again and I checked OS version. It is Red Hat Enterprise Linux release 9.4 (Plow).
I will try to add pipeline and I will check if it helps. I am going to check if there is not something connected with sysmon.
It was right. There were only few log entries in audit.log during the period. I checked it on filesystem. After my ssh connection there is more log entrie.
Last 90 minuts
/opt/splunkforwarder/var/log/splunk/audit.log | 2 |
/opt/splunkforwarder/var/log/splunk/conf.log | 1 |
/opt/splunkforwarder/var/log/splunk/configuration_change.log | 3 |
/opt/splunkforwarder/var/log/splunk/health.log | 26 |
/opt/splunkforwarder/var/log/splunk/metrics.log | 8975 |
/opt/splunkforwarder/var/log/splunk/splunkd-utility.log | 10 |
/opt/splunkforwarder/var/log/splunk/splunkd.log | 1055 |
/opt/splunkforwarder/var/log/watchdog/watchdog.log | 3 |
/var/log/audit/audit.log | 1337 |
/var/log/messages | 9418 |
/var/log/secure | 543 |
journald://sysmon | 6482 |
I revealed an interesting correlation. You can see a "gap" or change in behavior in the graph. It starts after the UF is restarted. There are messages "Found currently active indexer. Connected to idx=X.X.X.X:9992:0, reuse=1." before UF restart. After 20 minutes from restart they are back.
I tried setting parallelIngestionPipelines = 2 in server.conf and the behavior did not change.
I also tried stopping sysmon deamon and disabling sysmon journald input. It had no effect on the above behavior.
Based on number of your log events it had been surprise if that was helped.
Have you look network interface stats, if there is something weird?
Was it so, that this same issue was in all your Linux uf nodes? If yes then it heavily pointed to some configuration issue!
Can you show your outputs.conf settings exported by btool with —debug option?
I did not find anything weird about the interface stats.
Similar problem occurs in all Linux nodes, but differs in period/delay.
There is btool output configuration
I found out what the problem was. There is a Cribl server between UF and Indexer, which I mistakenly ruled out as the source of the problem during throubleshooting. I bypassed Cribl for a while and the problem disappeared.
The rest was already pretty fast. I found that there was a persistent queue enabled for Linux input/source in the "Alway On" mode. The persistent queue was not turned on for Windows Input/source. Windows logs were OK all the time. After turning it off for Linux data, the problem disappeared.
I don't understand why the persistent queue behaves this way, but I don't have time to investigate further. Maybe it's a Cribl bug or a misunderstanding of functionality. The input queue is not required in the project, so I can leave it off.
For me, it's currently resolved
Thank you all for your help and your time
Any errors on either side of the connection?
UF host for last 60 minutes with now errors and warnings
IDX side
Still a problem here. This morning we had to reboot from the Splunk servers due to a security patch of the operating system. You can see it at the beginning of the graph. This meant that the connection between UF and IDX had to be re-established, i.e. when IDX or UF restarts, about 20 minutes yesterday and today 10 minutes is not the delay or batch processing.
These errors are completely unrelated. You'd need to dig deeper to find something relevant regarding inputs on the receiving side or outputs on the sending site.
And the shape of your graph does look awfully close to a situation with periodic batch input which then unloads with a limited thruput connection.
I know that these errors are unrelated. I tried to show that internal log are not full of "error" messages.
Situation is
Index time
SendQ
TCPout
Queues
internal messages (clustered)