Getting Data In

Ingesting delay and batch data sending

emzed
Path Finder
Hello,
 
I have problem with Linux UFs. I seem it is sending data in batches. The period between batches is about 9 minutes. It means that for oldest messages in batch it creates 9 minutes delay on indexer. 
It starts approximatly 21 minutes after its restart. During these 21 minutes is delay constant and low.
 
 
All Linux UFs behave in similar way. It start 21 minutes after UF restart, but period is different. 
 
I use UF version are 9.2.0.1 and  9.2.1. 
 
I have checked
 
- queues state in internal logs, it looks ok
- UF truhput is set to 10240
 
I have independently tested that after restarting the UF the data is coming in with a low and constant delay. After about 21 minutes it stops for about 9 minutes. 
After 9 minutes, a batch of messages arrive and are indexed, creating a sawtooth progression in the graph.
indexing_delay_2.png
It doesn't depend on the type of data. It behaves the same for internal UF logs and other logs. 
 
I currently collect data using file monitor input and journald input.
 
I can't figure out what the problem is.
 
Thanks in advanced for help
 
Michal
Labels (1)
0 Karma
1 Solution

emzed
Path Finder

I found out what the problem was. There is a Cribl server between UF and Indexer, which I mistakenly ruled out as the source of the problem during throubleshooting. I bypassed Cribl for a while and the problem disappeared.

The rest was already pretty fast. I found that there was a persistent queue enabled for Linux input/source in the "Alway On" mode. The persistent queue was not turned on for Windows Input/source. Windows logs were OK all the time. After turning it off for Linux data, the problem disappeared.

I don't understand why the persistent queue behaves this way, but I don't have time to investigate further. Maybe it's a Cribl bug or a misunderstanding of functionality. The input queue is not required in the project, so I can leave it off.

For me, it's currently resolved

Thank you all for your help and your time

View solution in original post

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Yeah, you're right. It was the other-way sawtooth. It looks strange. Are you sure you don't have any network-level issues? And don't you see any other interesting stuff in _internal (outside of the Metrics component) for this forwarder?

 

0 Karma

emzed
Path Finder

I have two weeks off, so I'll continue troubleshooting after that.

In my opinion there are not any interesting stuff in _internal log. You can see it on the screenshot. I used cluster command to reduce log number. There is component != metric in SPL.

 

 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Right. That was !=, not =.

You're mostly interested in

index=_internal component=AutoLoadBalancedConnectionStrategy host=<your_forwarder>
0 Karma

emzed
Path Finder

I looked at the events for the component you mentioned and found that there is only one type of log entry.

2024-08-12 16_04_55-Search _ Splunk 9.2.0.1_autoloadbalancedconnectionstrategy.png

I also tried it for the "last 7 days" time range.

 

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Which kind of logs you are collecting? Is it possible that there is some log or input which stalled this after it has read and then UF just wait free resources to read next one?

Have you only one or several pipelines in your UF?

Any performance data from OS level and which OS, version you have?

0 Karma

emzed
Path Finder

I am collecting logs from some files from /var/log and sysmon from journald.

last 90 minutes

/opt/splunkforwarder/var/log/splunk/audit.log41
/opt/splunkforwarder/var/log/splunk/health.log39
/opt/splunkforwarder/var/log/splunk/metrics.log8911
/opt/splunkforwarder/var/log/splunk/splunkd.log598
/var/log/audit/audit.log7
/var/log/messages936
/var/log/secure10
journald://sysmon919

 

inputs.conf

[monitor:///var/log/syslog]

disabled = 0
sourcetype = syslog
index = linux

[monitor:///var/log/messages]
disabled = 0
sourcetype = syslog
index = linux

[monitor:///var/log/secure]
disabled = 0
sourcetype = linux_secure
index = linux

[monitor:///var/log/auth.log]
disabled = 0
sourcetype = linux_secure
index = linux

[monitor:///var/log/audit/audit.log]
disabled = 0
sourcetype = linux_audit
index = linux

[journald://sysmon]
interval = 5
journalctl-quiet = true
journalctl-include-fields = PRIORITY,_SYSTEMD_UNIT,_SYSTEMD_CGROUP,_TRANSPORT,_PID,_UID,_MACHINE_ID,_GID,_COMM,_EXE
journalctl-exclude-fields = __MONOTONIC_TIMESTAMP,__SOURCE_REALTIME_TIMESTAMP
journalctl-filter = _SYSTEMD_UNIT=sysmon.service
sourcetype = sysmon:linux
index = linux
 
I did not change number of pipelines. I thing that default count is 1.

I will find out the OS version later. I do not have direct access to the OS. I thing it is CentOS/Redhat 8 or 9, but I may be wrong.

 
0 Karma

isoutamo
SplunkTrust
SplunkTrust
Not so many or floody inputs. Maybe you still should add another pipeline and check if it helps?
Based on amount of entries from audit.log it is quite low. Can you check is there really so few entries on source?
If those are entries from one Linux node from 90 minutes period it’s really unused.
0 Karma

emzed
Path Finder

I got a direct access to the sever again and I checked OS version. It is Red Hat Enterprise Linux release 9.4 (Plow).

I will try to add pipeline and I will check if it helps. I am going to check if there is not something connected with sysmon. 

It was right. There were only few log entries in audit.log during the period. I checked it on filesystem. After my ssh connection there is more log entrie. 

Last 90 minuts

/opt/splunkforwarder/var/log/splunk/audit.log2
/opt/splunkforwarder/var/log/splunk/conf.log1
/opt/splunkforwarder/var/log/splunk/configuration_change.log3
/opt/splunkforwarder/var/log/splunk/health.log26
/opt/splunkforwarder/var/log/splunk/metrics.log8975
/opt/splunkforwarder/var/log/splunk/splunkd-utility.log10
/opt/splunkforwarder/var/log/splunk/splunkd.log1055
/opt/splunkforwarder/var/log/watchdog/watchdog.log3
/var/log/audit/audit.log1337
/var/log/messages9418
/var/log/secure543
journald://sysmon6482

 

I revealed an interesting correlation. You can see a "gap" or change in behavior in the graph. It starts after the UF is restarted. There are messages "Found currently active indexer. Connected to idx=X.X.X.X:9992:0, reuse=1." before UF restart. After 20 minutes from restart they are back.

2024-08-12 17_55_40-Search _ Splunk 9.2.0.1.png

0 Karma

emzed
Path Finder

I tried setting parallelIngestionPipelines = 2 in server.conf and the behavior did not change. 

I also tried stopping sysmon deamon and disabling sysmon journald input. It had no effect on the above behavior.

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Based on number of your log events it had been surprise if that was helped.

Have you look network interface stats, if there is something weird?

Was it so, that this same issue was in all your Linux uf nodes? If yes then it heavily pointed to some configuration issue!

Can you show your outputs.conf settings exported by btool with —debug option?

0 Karma

emzed
Path Finder

I did not find anything weird about the interface stats.

2024-08-13 21_42_03-projekty [SSH_ rockyforwork] - Visual Studio Code.png

2024-08-13 21_43_14-projekty [SSH_ rockyforwork] - Visual Studio Code.png

Similar problem occurs in all Linux nodes, but differs in period/delay.

There is btool output configuration 

2024-08-13 21_56_32-projekty [SSH_ rockyforwork] - Visual Studio Code.png

 

 

0 Karma

isoutamo
SplunkTrust
SplunkTrust
I cannot see anything special here.
Do you have UFs in other OS like windows or some Unix and if, have those the same issue?
Can you post your indexer’s relevant inputs.conf output from btool too?
0 Karma

emzed
Path Finder

I found out what the problem was. There is a Cribl server between UF and Indexer, which I mistakenly ruled out as the source of the problem during throubleshooting. I bypassed Cribl for a while and the problem disappeared.

The rest was already pretty fast. I found that there was a persistent queue enabled for Linux input/source in the "Alway On" mode. The persistent queue was not turned on for Windows Input/source. Windows logs were OK all the time. After turning it off for Linux data, the problem disappeared.

I don't understand why the persistent queue behaves this way, but I don't have time to investigate further. Maybe it's a Cribl bug or a misunderstanding of functionality. The input queue is not required in the project, so I can leave it off.

For me, it's currently resolved

Thank you all for your help and your time

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Any errors on either side of the connection?

0 Karma

emzed
Path Finder

UF host for last 60 minutes with now errors and warnings

2024-07-25 09_28_17-Search _ Splunk 9.2.0.1.png

 

IDX side 

2024-07-25 09_33_01-Search _ Splunk 9.2.0.1.png

 

Still a problem here. This morning we had to reboot from the Splunk servers due to a security patch of the operating system. You can see it at the beginning of the graph. This meant that the connection between UF and IDX had to be re-established, i.e. when IDX or UF restarts, about 20 minutes yesterday and today 10 minutes is not the delay or batch processing.

2024-07-25 09_45_34-Search _ Splunk 9.2.0.1.png

0 Karma

PickleRick
SplunkTrust
SplunkTrust

These errors are completely unrelated. You'd need to dig deeper to find something relevant regarding inputs on the receiving side or outputs on the sending site.

And the shape of your graph does look awfully close to a situation with periodic batch input which then unloads with a limited thruput connection.

0 Karma

emzed
Path Finder

I know that these errors are unrelated. I tried to show that internal log are not full of "error" messages.

Situation is

  • Thruput is not limited (thruput se to 10240)
  • Number of logs is low
  • logs in files are generated fluently, i checked by "tail -f"
  • during aprox 20 minutes after UF restart there is no problem
  • after this time  problem  appears
  • the problem is
    • Data are buffered somewhere in front of indexer server, it is buffered aprox 9 minutes. After I restarted UF or droped TCP session, data were suddenly sent to the indexer.
    • I belive It must be buffered on UF side. I saw no dat period and then data burst on indexer site.
    • Shape of the grahp is saying the same thing. Data are somehere for some period of time and then are flushed to indexer. Older data are bigger diff a newer data are lower diff.

2024-07-26 12_26_38-Search _ Splunk 9.2.0.1.png

Index time

2024-07-26 12_26_09-Search _ Splunk 9.2.0.1.png

 

SendQ

2024-07-26 10_08_01-Clipboard.png

TCPout

2024-07-26 10_18_32-Search _ Splunk 9.2.0.1.png

 

Queues2024-07-26 11_13_09-Search _ Splunk 9.2.0.1.png

internal messages (clustered)
2024-07-26 11_27_02-Search _ Splunk 9.2.0.1.png

0 Karma
Get Updates on the Splunk Community!

New This Month in Splunk Observability Cloud - Metrics Usage Analytics, Enhanced K8s ...

The latest enhancements across the Splunk Observability portfolio deliver greater flexibility, better data and ...

Alerting Best Practices: How to Create Good Detectors

At their best, detectors and the alerts they trigger notify teams when applications aren’t performing as ...

Discover Powerful New Features in Splunk Cloud Platform: Enhanced Analytics, ...

Hey Splunky people! We are excited to share the latest updates in Splunk Cloud Platform 9.3.2408. In this ...