Background: I have syslog collector systems which run UFs to put received syslog data into Splunk. We're in the process of migrating from an old Splunk infrastructure to a new Splunk infrastructure, redoing all our indexes and sourcetypes as we go, and I want a gradual transition period rather than a hard flag day.
Objective: have UF simultaneously populate the same log files into two different Splunks (via two separate tcpout groups) using different index/sourcetype values for each.
The winning inputs.conf that seems to fit my needs is:
[monitor:///services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
index = networktest
sourcetype = syslog
host_segment = 6
blacklist = \.gz$
crcSalt = <SOURCE>
_TCP_ROUTING = splunk_legacy
[monitor:///services/net-logs/logs2/*/(udp|tcp)_(1514|1515)/]
index = some_new_index
sourcetype = some_new_sourcetype
host_segment = 6
blacklist = \.gz$
crcSalt = <SOURCE>
_TCP_ROUTING = aws_ufwd_tier_splunk_cert
with a symlink from /services/net-logs/logs2
-> /services/net-logs/logs
.
But I tried several other variations on path pairs before that (everything else equivalent) which did not seem to work nearly as well.
With this pair
[monitor:///services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
[monitor:///./services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
the new monitor stanza (with dot) processed new data promptly as expected; however, the original (without dot) would take quite a long time -- sometimes approaching 5 minutes -- to notice when a new file appeared or when new lines appeared at the end of an existing file. I observed this in several different ways:
splunk list inputstatus (note 411 vs 137) shortly after sending new messages
splunk list inputstatus | grep -A6 tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-16
/./services/net-logs/logs/whipsaw-dev-aws1.techservices.illinois.edu/tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-16
file position = 411
file size = 411
parent = /./services/net-logs/logs/*/(udp|tcp)_(1514|1515)/
percent = 100.00
type = finished reading
--
/services/net-logs/logs/whipsaw-dev-aws1.techservices.illinois.edu/tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-16
file position = 137
file size = 137
parent = /services/net-logs/logs/*/(udp|tcp)_(1514|1515)/
percent = 100.00
type = finished reading
DEBUG logging (note 3+ minute gap between the first notification detected on each path)
06-27-2019 17:46:06.118 +0000 DEBUG TailingProcessor - File state notification for path='/./services/net-logs/logs/whipsaw-dev-aws1.techservices.illinois.edu/tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-17' (first time).
06-27-2019 17:49:21.235 +0000 DEBUG TailingProcessor - File state notification for path='/services/net-logs/logs/whipsaw-dev-aws1.techservices.illinois.edu/tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-17' (first time).
querying legacy Splunk with | eval latency_secs=(_indextime-_time)
In all cases the delayed monitor did eventually notice the change and forward the data.
This pair
[monitor:///services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
[monitor:///services/net-logs/./logs/*/(udp|tcp)_(1514|1515)/]
likewise resulted in high data latency for the original path without dot.
This pair
[monitor:///services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
[monitor:///services/net-logs2/logs/*/(udp|tcp)_(1514|1515)/]
(with net-logs2 symlinked to net-logs) resulted in high data latency for net-logs2.
Hypothesis: the path that experiences the latency is whichever one comes later in an alpha-sort.
Which brings me, finally, to my questions:
Why in those other three cases does one monitor of the pair notice changes immediately while the other one takes several minutes? Note that it's definitely not due to load; all this was done on a test server which was receiving no outside log data except for the handful of test messages I fed it.
Why does my "winning" configuration with the symlink immediately before the wildcard not experience the same latency problem as all the others? In those tests, splunk list inputstatus
consistently shows the same file position for both paths, new log files/lines produce simultaneous File state notification DEBUG messages for both paths, and the actual latency in Splunk is a satisfying 0-2s.
Can I expect my logs2 solution to perform consistently under heavier production load, or are my positive test results just a fluke?
I suspect the difference has something to do with:
07-01-2019 23:29:10.811 +0000 INFO TailingProcessor - Adding watch on path: /services/net-logs/logs.
07-01-2019 23:29:10.811 +0000 INFO TailingProcessor - Adding watch on path: /services/net-logs/logs2.
07-01-2019 23:29:10.879 +0000 ERROR FilesystemChangeWatcher - Error setting up inotify on "/services/net-logs/logs2": Not a directory
Wild speculation: in the other cases, splunkd would set up inotify to watch the same logs
directory twice (via two different pathnames, but I don't think that matters to inotify), and consequently splunkd either didn't receive all the notifications it was expecting or didn't understand how to interpret them as matching the distinct paths. Whereas with the logs2
symlink, it knew it couldn't use inotify, so it did... something else... instead?
I haven't found a shred of official Splunk documentation that explains how the monitor detects when files appear and change. It seems clear that there much be at least two such mechanisms, but what are they?
Wow, creative. I guess the usage of crcSalt is helping make this work. Personally, I would have just run 2 UFs, and then phase out the old one. 🙂
Which version of the UF are you on? It looks like this bug (failing to use inotify on a symlink) was fixed in 7.2.0 (bug: SPL-153035).
Whereas with the logs2 symlink, it knew it couldn't use inotify, so it did... something else... instead?
It fell back to the classic implementation of the Monitor Input. This actively polls all monitored files with a TCP-like backoff of idle files, such that it still scales reasonably well on all platforms (but obviously can't approach inotify-levels of performance).
If you can't upgrade your UF, you can ditch the symlink and use a bind mount[1]:
$ rm /services/net-logs/logs2
$ mkdir /services/net-logs/logs2
$ mount /services/net-logs/logs /services/net-logs/logs2 -o bind
This will be seen as a directory instead of a symlink (because it is).
[1] Hope you're using Linux! 🙂
It fell back to the classic implementation of the Monitor Input.
Thanks for this explanation; that makes sense, and answers my questions #2 and #3.
It looks like this bug (failing to use inotify on a symlink) was fixed in 7.2.0 (bug: SPL-153035).
Interesting, but that actually hurts my use case.
To summarize and perhaps clarify my original results:
Using UF v6.5.0, the logs2 symlink case produces the good outcome specifically because it can't use inotify (AFAICT).
I just retested that case with 7.2.0, and it now produces the bad outcome just like my other three test cases (more consistent, but less useful).
So I still need an answer to my question #1 (also the post title): why does monitoring the same directory twice with inotify work so poorly (i.e. incur such high data latency)?
Also: is there some way in the configuration to ask it to use the classic implementation instead?
So I still need an answer to my question #1 (also the post title): why does monitoring the same directory twice with inotify work so poorly (i.e. incur such high data latency)?
It sounds like your tricky scenario is managing to trick our file monitoring code more than we originally thought. I suppose the short answer here is, "it's complicated". 2 separate UFs monitoring the same dir twice would likely not experience this bug, even with each using inotify.
Also: is there some way in the configuration to ask it to use the classic implementation instead?
You can set SPLUNKD_USE_KERNEL_FILE_NOTIFY=0 in splunk-launch.conf.
Thanks, that's a big help!