Why does monitoring the same directory twice with ...

dmrzzz · ‎07-02-2019

Background: I have syslog collector systems which run UFs to put received syslog data into Splunk. We're in the process of migrating from an old Splunk infrastructure to a new Splunk infrastructure, redoing all our indexes and sourcetypes as we go, and I want a gradual transition period rather than a hard flag day.

Objective: have UF simultaneously populate the same log files into two different Splunks (via two separate tcpout groups) using different index/sourcetype values for each.

The winning inputs.conf that seems to fit my needs is:

[monitor:///services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
index = networktest
sourcetype = syslog
host_segment = 6
blacklist = \.gz$
crcSalt = <SOURCE>
_TCP_ROUTING = splunk_legacy
[monitor:///services/net-logs/logs2/*/(udp|tcp)_(1514|1515)/]
index = some_new_index
sourcetype = some_new_sourcetype
host_segment = 6
blacklist = \.gz$
crcSalt = <SOURCE>
_TCP_ROUTING = aws_ufwd_tier_splunk_cert

with a symlink from /services/net-logs/logs2 -> /services/net-logs/logs.

But I tried several other variations on path pairs before that (everything else equivalent) which did not seem to work nearly as well.

With this pair

[monitor:///services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
[monitor:///./services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]

the new monitor stanza (with dot) processed new data promptly as expected; however, the original (without dot) would take quite a long time -- sometimes approaching 5 minutes -- to notice when a new file appeared or when new lines appeared at the end of an existing file. I observed this in several different ways:

splunk list inputstatus (note 411 vs 137) shortly after sending new messages

splunk list inputstatus | grep -A6 tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-16
/./services/net-logs/logs/whipsaw-dev-aws1.techservices.illinois.edu/tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-16
file position = 411
file size = 411
parent = /./services/net-logs/logs/*/(udp|tcp)_(1514|1515)/
percent = 100.00
type = finished reading

--
/services/net-logs/logs/whipsaw-dev-aws1.techservices.illinois.edu/tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-16
file position = 137
file size = 137
parent = /services/net-logs/logs/*/(udp|tcp)_(1514|1515)/
percent = 100.00
type = finished reading
DEBUG logging (note 3+ minute gap between the first notification detected on each path)

06-27-2019 17:46:06.118 +0000 DEBUG TailingProcessor - File state notification for path='/./services/net-logs/logs/whipsaw-dev-aws1.techservices.illinois.edu/tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-17' (first time).
06-27-2019 17:49:21.235 +0000 DEBUG TailingProcessor - File state notification for path='/services/net-logs/logs/whipsaw-dev-aws1.techservices.illinois.edu/tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-17' (first time).
querying legacy Splunk with | eval latency_secs=(_indextime-_time)

In all cases the delayed monitor did eventually notice the change and forward the data.

This pair

[monitor:///services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
[monitor:///services/net-logs/./logs/*/(udp|tcp)_(1514|1515)/]

likewise resulted in high data latency for the original path without dot.

This pair

[monitor:///services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
[monitor:///services/net-logs2/logs/*/(udp|tcp)_(1514|1515)/]

(with net-logs2 symlinked to net-logs) resulted in high data latency for net-logs2.

Hypothesis: the path that experiences the latency is whichever one comes later in an alpha-sort.

Which brings me, finally, to my questions:

Why in those other three cases does one monitor of the pair notice changes immediately while the other one takes several minutes? Note that it's definitely not due to load; all this was done on a test server which was receiving no outside log data except for the handful of test messages I fed it.
Why does my "winning" configuration with the symlink immediately before the wildcard not experience the same latency problem as all the others? In those tests, splunk list inputstatus consistently shows the same file position for both paths, new log files/lines produce simultaneous File state notification DEBUG messages for both paths, and the actual latency in Splunk is a satisfying 0-2s.
Can I expect my logs2 solution to perform consistently under heavier production load, or are my positive test results just a fluke?

I suspect the difference has something to do with:

07-01-2019 23:29:10.811 +0000 INFO  TailingProcessor - Adding watch on path: /services/net-logs/logs.
07-01-2019 23:29:10.811 +0000 INFO  TailingProcessor - Adding watch on path: /services/net-logs/logs2.
07-01-2019 23:29:10.879 +0000 ERROR FilesystemChangeWatcher - Error setting up inotify on "/services/net-logs/logs2": Not a directory

Wild speculation: in the other cases, splunkd would set up inotify to watch the same logs directory twice (via two different pathnames, but I don't think that matters to inotify), and consequently splunkd either didn't receive all the notifications it was expecting or didn't understand how to interpret them as matching the distinct paths. Whereas with the logs2 symlink, it knew it couldn't use inotify, so it did... something else... instead?

I haven't found a shred of official Splunk documentation that explains how the monitor detects when files appear and change. It seems clear that there much be at least two such mechanisms, but what are they?

amrit · ‎07-11-2019

Wow, creative. I guess the usage of crcSalt is helping make this work. Personally, I would have just run 2 UFs, and then phase out the old one. 🙂

Which version of the UF are you on? It looks like this bug (failing to use inotify on a symlink) was fixed in 7.2.0 (bug: SPL-153035).

Whereas with the logs2 symlink, it knew it couldn't use inotify, so it did... something else... instead?

It fell back to the classic implementation of the Monitor Input. This actively polls all monitored files with a TCP-like backoff of idle files, such that it still scales reasonably well on all platforms (but obviously can't approach inotify-levels of performance).

If you can't upgrade your UF, you can ditch the symlink and use a bind mount[1]:

$ rm /services/net-logs/logs2
$ mkdir /services/net-logs/logs2
$ mount /services/net-logs/logs /services/net-logs/logs2 -o bind

This will be seen as a directory instead of a symlink (because it is).

[1] Hope you're using Linux! 🙂

dmrzzz · ‎07-11-2019

It fell back to the classic implementation of the Monitor Input.

Thanks for this explanation; that makes sense, and answers my questions #2 and #3.

It looks like this bug (failing to use inotify on a symlink) was fixed in 7.2.0 (bug: SPL-153035).

Interesting, but that actually hurts my use case.

To summarize and perhaps clarify my original results:

inotify on same directory twice = bad outcome (~5min data latency)
fall back to classic = good outcome (new data propagates promptly, at least on a lightly-loaded test box)

Using UF v6.5.0, the logs2 symlink case produces the good outcome specifically because it can't use inotify (AFAICT).

I just retested that case with 7.2.0, and it now produces the bad outcome just like my other three test cases (more consistent, but less useful).

So I still need an answer to my question #1 (also the post title): why does monitoring the same directory twice with inotify work so poorly (i.e. incur such high data latency)?

Also: is there some way in the configuration to ask it to use the classic implementation instead?

amrit · ‎07-19-2019

So I still need an answer to my question #1 (also the post title): why does monitoring the same directory twice with inotify work so poorly (i.e. incur such high data latency)?

It sounds like your tricky scenario is managing to trick our file monitoring code more than we originally thought. I suppose the short answer here is, "it's complicated". 2 separate UFs monitoring the same dir twice would likely not experience this bug, even with each using inotify.

Also: is there some way in the configuration to ask it to use the classic implementation instead?

You can set SPLUNKD_USE_KERNEL_FILE_NOTIFY=0 in splunk-launch.conf.

dmrzzz · ‎07-23-2019

Thanks, that's a big help!

Why does monitoring the same directory twice with inotify produce high latency?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation

Why does monitoring the same directory twice with inotify produce high latency?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits