Solved: Ingesting delay and batch data sending

emzed · ‎07-24-2024

Hello,

I have problem with Linux UFs. I seem it is sending data in batches. The period between batches is about 9 minutes. It means that for oldest messages in batch it creates 9 minutes delay on indexer.

It starts approximatly 21 minutes after its restart. During these 21 minutes is delay constant and low.

All Linux UFs behave in similar way. It start 21 minutes after UF restart, but period is different.

I use UF version are 9.2.0.1 and 9.2.1.

I have checked

- queues state in internal logs, it looks ok

- UF truhput is set to 10240

I have independently tested that after restarting the UF the data is coming in with a low and constant delay. After about 21 minutes it stops for about 9 minutes.

After 9 minutes, a batch of messages arrive and are indexed, creating a sawtooth progression in the graph.

It doesn't depend on the type of data. It behaves the same for internal UF logs and other logs.

I currently collect data using file monitor input and journald input.

I can't figure out what the problem is.

Thanks in advanced for help

Michal

emzed · ‎08-14-2024

I found out what the problem was. There is a Cribl server between UF and Indexer, which I mistakenly ruled out as the source of the problem during throubleshooting. I bypassed Cribl for a while and the problem disappeared.

The rest was already pretty fast. I found that there was a persistent queue enabled for Linux input/source in the "Alway On" mode. The persistent queue was not turned on for Windows Input/source. Windows logs were OK all the time. After turning it off for Linux data, the problem disappeared.

I don't understand why the persistent queue behaves this way, but I don't have time to investigate further. Maybe it's a Cribl bug or a misunderstanding of functionality. The input queue is not required in the project, so I can leave it off.

For me, it's currently resolved

Thank you all for your help and your time

View solution in original post

PickleRick · ‎07-26-2024

Yeah, you're right. It was the other-way sawtooth. It looks strange. Are you sure you don't have any network-level issues? And don't you see any other interesting stuff in _internal (outside of the Metrics component) for this forwarder?

emzed · ‎07-28-2024

I have two weeks off, so I'll continue troubleshooting after that.

In my opinion there are not any interesting stuff in _internal log. You can see it on the screenshot. I used cluster command to reduce log number. There is component != metric in SPL.

PickleRick · ‎07-28-2024

Right. That was !=, not =.

You're mostly interested in

index=_internal component=AutoLoadBalancedConnectionStrategy host=<your_forwarder>

emzed · ‎08-12-2024

I looked at the events for the component you mentioned and found that there is only one type of log entry.

I also tried it for the "last 7 days" time range.

isoutamo · ‎08-12-2024

Which kind of logs you are collecting? Is it possible that there is some log or input which stalled this after it has read and then UF just wait free resources to read next one?

Have you only one or several pipelines in your UF?

Any performance data from OS level and which OS, version you have?

emzed · ‎08-12-2024

I am collecting logs from some files from /var/log and sysmon from journald.

last 90 minutes

/opt/splunkforwarder/var/log/splunk/audit.log	41
/opt/splunkforwarder/var/log/splunk/health.log	39
/opt/splunkforwarder/var/log/splunk/metrics.log	8911
/opt/splunkforwarder/var/log/splunk/splunkd.log	598
/var/log/audit/audit.log	7
/var/log/messages	936
/var/log/secure	10
journald://sysmon	919

inputs.conf

[monitor:///var/log/syslog]

disabled = 0

sourcetype = syslog

index = linux

[monitor:///var/log/messages]

disabled = 0

sourcetype = syslog

index = linux

[monitor:///var/log/secure]

disabled = 0

sourcetype = linux_secure

index = linux

[monitor:///var/log/auth.log]

disabled = 0

sourcetype = linux_secure

index = linux

[monitor:///var/log/audit/audit.log]

disabled = 0

sourcetype = linux_audit

index = linux

[journald://sysmon]

interval = 5

journalctl-quiet = true

journalctl-include-fields = PRIORITY,_SYSTEMD_UNIT,_SYSTEMD_CGROUP,_TRANSPORT,_PID,_UID,_MACHINE_ID,_GID,_COMM,_EXE

journalctl-exclude-fields = __MONOTONIC_TIMESTAMP,__SOURCE_REALTIME_TIMESTAMP

journalctl-filter = _SYSTEMD_UNIT=sysmon.service

sourcetype = sysmon:linux

index = linux

I did not change number of pipelines. I thing that default count is 1.

I will find out the OS version later. I do not have direct access to the OS. I thing it is CentOS/Redhat 8 or 9, but I may be wrong.

isoutamo · ‎08-12-2024

Not so many or floody inputs. Maybe you still should add another pipeline and check if it helps?
Based on amount of entries from audit.log it is quite low. Can you check is there really so few entries on source?
If those are entries from one Linux node from 90 minutes period it’s really unused.

emzed · ‎08-13-2024

I got a direct access to the sever again and I checked OS version. It is Red Hat Enterprise Linux release 9.4 (Plow).

I will try to add pipeline and I will check if it helps. I am going to check if there is not something connected with sysmon.

It was right. There were only few log entries in audit.log during the period. I checked it on filesystem. After my ssh connection there is more log entrie.

Last 90 minuts

/opt/splunkforwarder/var/log/splunk/audit.log	2
/opt/splunkforwarder/var/log/splunk/conf.log	1
/opt/splunkforwarder/var/log/splunk/configuration_change.log	3
/opt/splunkforwarder/var/log/splunk/health.log	26
/opt/splunkforwarder/var/log/splunk/metrics.log	8975
/opt/splunkforwarder/var/log/splunk/splunkd-utility.log	10
/opt/splunkforwarder/var/log/splunk/splunkd.log	1055
/opt/splunkforwarder/var/log/watchdog/watchdog.log	3
/var/log/audit/audit.log	1337
/var/log/messages	9418
/var/log/secure	543
journald://sysmon	6482

I revealed an interesting correlation. You can see a "gap" or change in behavior in the graph. It starts after the UF is restarted. There are messages "Found currently active indexer. Connected to idx=X.X.X.X:9992:0, reuse=1." before UF restart. After 20 minutes from restart they are back.

emzed · ‎08-13-2024

I tried setting parallelIngestionPipelines = 2 in server.conf and the behavior did not change.

I also tried stopping sysmon deamon and disabling sysmon journald input. It had no effect on the above behavior.

isoutamo · ‎08-13-2024

Based on number of your log events it had been surprise if that was helped.

Have you look network interface stats, if there is something weird?

Was it so, that this same issue was in all your Linux uf nodes? If yes then it heavily pointed to some configuration issue!

Can you show your outputs.conf settings exported by btool with —debug option?

emzed · ‎08-13-2024

I did not find anything weird about the interface stats.

Similar problem occurs in all Linux nodes, but differs in period/delay.

There is btool output configuration

isoutamo · ‎08-13-2024

I cannot see anything special here.
Do you have UFs in other OS like windows or some Unix and if, have those the same issue?
Can you post your indexer’s relevant inputs.conf output from btool too?

emzed · ‎08-14-2024

I found out what the problem was. There is a Cribl server between UF and Indexer, which I mistakenly ruled out as the source of the problem during throubleshooting. I bypassed Cribl for a while and the problem disappeared.

The rest was already pretty fast. I found that there was a persistent queue enabled for Linux input/source in the "Alway On" mode. The persistent queue was not turned on for Windows Input/source. Windows logs were OK all the time. After turning it off for Linux data, the problem disappeared.

I don't understand why the persistent queue behaves this way, but I don't have time to investigate further. Maybe it's a Cribl bug or a misunderstanding of functionality. The input queue is not required in the project, so I can leave it off.

For me, it's currently resolved

Thank you all for your help and your time

PickleRick · ‎07-24-2024

Any errors on either side of the connection?

emzed · ‎07-25-2024

UF host for last 60 minutes with now errors and warnings

IDX side

Still a problem here. This morning we had to reboot from the Splunk servers due to a security patch of the operating system. You can see it at the beginning of the graph. This meant that the connection between UF and IDX had to be re-established, i.e. when IDX or UF restarts, about 20 minutes yesterday and today 10 minutes is not the delay or batch processing.

PickleRick · ‎07-25-2024

These errors are completely unrelated. You'd need to dig deeper to find something relevant regarding inputs on the receiving side or outputs on the sending site.

And the shape of your graph does look awfully close to a situation with periodic batch input which then unloads with a limited thruput connection.

emzed · ‎07-26-2024

I know that these errors are unrelated. I tried to show that internal log are not full of "error" messages.

Situation is

Thruput is not limited (thruput se to 10240)
Number of logs is low
logs in files are generated fluently, i checked by "tail -f"
during aprox 20 minutes after UF restart there is no problem
after this time problem appears
the problem is
- Data are buffered somewhere in front of indexer server, it is buffered aprox 9 minutes. After I restarted UF or droped TCP session, data were suddenly sent to the indexer.
- I belive It must be buffered on UF side. I saw no dat period and then data burst on indexer site.
- Shape of the grahp is saying the same thing. Data are somehere for some period of time and then are flushed to indexer. Older data are bigger diff a newer data are lower diff.