Background: I have syslog collector systems which run UFs to put received syslog data into Splunk. We're in the process of migrating from an old Splunk infrastructure to a new Splunk infrastructure, redoing all our indexes and sourcetypes as we go, and I want a gradual transition period rather than a hard flag day.
Objective: have UF simultaneously populate the same log files into two different Splunks (via two separate tcpout groups) using different index/sourcetype values for each.
The winning inputs.conf that seems to fit my needs is:
[monitor:///services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
index = networktest
sourcetype = syslog
host_segment = 6
blacklist = \.gz$
crcSalt = <SOURCE>
_TCP_ROUTING = splunk_legacy
[monitor:///services/net-logs/logs2/*/(udp|tcp)_(1514|1515)/]
index = some_new_index
sourcetype = some_new_sourcetype
host_segment = 6
blacklist = \.gz$
crcSalt = <SOURCE>
_TCP_ROUTING = aws_ufwd_tier_splunk_cert
with a symlink from /services/net-logs/logs2 -> /services/net-logs/logs .
But I tried several other variations on path pairs before that (everything else equivalent) which did not seem to work nearly as well.
With this pair
[monitor:///services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
[monitor:///./services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
the new monitor stanza (with dot) processed new data promptly as expected; however, the original (without dot) would take quite a long time -- sometimes approaching 5 minutes -- to notice when a new file appeared or when new lines appeared at the end of an existing file. I observed this in several different ways:
splunk list inputstatus (note 411 vs 137) shortly after sending new messages
splunk list inputstatus | grep -A6 tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-16
/./services/net-logs/logs/whipsaw-dev-aws1.techservices.illinois.edu/tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-16
file position = 411
file size = 411
parent = /./services/net-logs/logs/*/(udp|tcp)_(1514|1515)/
percent = 100.00
type = finished reading
--
/services/net-logs/logs/whipsaw-dev-aws1.techservices.illinois.edu/tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-16
file position = 137
file size = 137
parent = /services/net-logs/logs/*/(udp|tcp)_(1514|1515)/
percent = 100.00
type = finished reading
DEBUG logging (note 3+ minute gap between the first notification detected on each path)
06-27-2019 17:46:06.118 +0000 DEBUG TailingProcessor - File state notification for path='/./services/net-logs/logs/whipsaw-dev-aws1.techservices.illinois.edu/tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-17' (first time).
06-27-2019 17:49:21.235 +0000 DEBUG TailingProcessor - File state notification for path='/services/net-logs/logs/whipsaw-dev-aws1.techservices.illinois.edu/tcp_1514/wirelessprv-10-192-167-133.near.illinois.edu/10.192.167.133/user-2019-06-27-17' (first time).
querying legacy Splunk with | eval latency_secs=(_indextime-_time)
In all cases the delayed monitor did eventually notice the change and forward the data.
This pair
[monitor:///services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
[monitor:///services/net-logs/./logs/*/(udp|tcp)_(1514|1515)/]
likewise resulted in high data latency for the original path without dot.
This pair
[monitor:///services/net-logs/logs/*/(udp|tcp)_(1514|1515)/]
[monitor:///services/net-logs2/logs/*/(udp|tcp)_(1514|1515)/]
(with net-logs2 symlinked to net-logs) resulted in high data latency for net-logs2.
Hypothesis: the path that experiences the latency is whichever one comes later in an alpha-sort.
Which brings me, finally, to my questions:
Why in those other three cases does one monitor of the pair notice changes immediately while the other one takes several minutes? Note that it's definitely not due to load; all this was done on a test server which was receiving no outside log data except for the handful of test messages I fed it.
Why does my "winning" configuration with the symlink immediately before the wildcard not experience the same latency problem as all the others? In those tests, splunk list inputstatus consistently shows the same file position for both paths, new log files/lines produce simultaneous File state notification DEBUG messages for both paths, and the actual latency in Splunk is a satisfying 0-2s.
Can I expect my logs2 solution to perform consistently under heavier production load, or are my positive test results just a fluke?
I suspect the difference has something to do with:
07-01-2019 23:29:10.811 +0000 INFO TailingProcessor - Adding watch on path: /services/net-logs/logs.
07-01-2019 23:29:10.811 +0000 INFO TailingProcessor - Adding watch on path: /services/net-logs/logs2.
07-01-2019 23:29:10.879 +0000 ERROR FilesystemChangeWatcher - Error setting up inotify on "/services/net-logs/logs2": Not a directory
Wild speculation: in the other cases, splunkd would set up inotify to watch the same logs directory twice (via two different pathnames, but I don't think that matters to inotify), and consequently splunkd either didn't receive all the notifications it was expecting or didn't understand how to interpret them as matching the distinct paths. Whereas with the logs2 symlink, it knew it couldn't use inotify, so it did... something else... instead?
I haven't found a shred of official Splunk documentation that explains how the monitor detects when files appear and change. It seems clear that there much be at least two such mechanisms, but what are they?
... View more