Is there any way to identify from which source the logs are not getting forwarded??
For example: if we have such stanzas
I need to know from which source is the events forwarding is getting failed due to some reason.
Can I get this from _internal logs of UF with surety?
We dont have access to indexers and actual logs.
I think that the only way is to create a lookup (called e.g. perimeter.csv) containing the sources that you're waiting for each host (at least two fields: host source), then run a search to find the missing one, something like this:
| metasearch index=your_index | eval host=lower(host), source=lower(source) | stats count BY host source | append [ | inputlookup perimeter.csv | eval count=0, host=lower(host), source=lower(source) | fields host source count ] | stats sum(count) AS total BY host source | where total=0
This a very common problem. So much so I'll just link to some other solutions to it:
If you end up with specifics not covered by those, and in which google won't help much on, let us know and we can help!
Thanks @Richfez for a quick reponse.
I don't have any application/OS events in my indexer but the only UF internal logs.
And I need to know whether any log file is monitored properly or not and is it sent to other receiving indexer or not.
Suppose [monitor://xyz/.../logs.txt] is configured but is not forwarding events from log.txt, then can I detect this from UFs internal logs.
Basically I want to make sure all stanzas of inputs.conf are working properly. Can I know this from only UFs internal logs. Is there anything with this in splunkd logs or metrics?????
There's quite a few answers to this, depending on exactly what it is you are trying to accomplish.
One search which looks for systems that have had a problem connecting to the indexer they're supposed to be sending data to, might be
index=_internal component!=Metrics host=<hostname> component=TcpOutputFd log_level=ERROR
Poke around with time frames and hosts (or set host=* to see them all) so you can see what a forwarder failing connection to the server looks like.
A similar view, but sort of looking at it from another direction, is
index=_internal host=<hostname> log_level=WARN component=TcpOutputProc
Both of those, though, look at mostly failed connections from the UF to the indexing tier. This may be what you are after, but maybe not.
Another thought is to try looking at "WatchedFile"
index=_internal host=<hostname> component=WatchedFile
I'm not exactly sure what all shows up in there or doesn't show up in there if things aren't working right, but maybe you can engineer a failure of a sort on your own desktop (after installing either splunk on it to test with, or a UF that points to your indexing tier) and see what happens.
Here's a fantastic (and probably fantastically useless!) search that's fun to run
index=_internal host=* component=WatchedFile "will begin reading" | timechart max(offset) by file
Though you'll probably want to only run the above search against a single host (host=X) and possibly only for non-splunk logs (in which case maybe it's slightly more useful), like
index=_internal host=* component=WatchedFile "will begin reading" file!="'/opt/splunk*" file!="'C:\\Program Files\\Splunk*" | timechart max(offset) by file
I wish there was a bit better of a place to get concrete answers for questions like this. As it is, Splunk has an entire class on troubleshooting, and there's lots in there that would be useful to you, but mostly it's all just variations on the above searches and data, so ... play around for a while and you might find all you need.
And yes, it's probably worth mentioning that it would probably work best - either solely or at least in tandem with this set of searches, if you had access to see if the actual data is coming in.
Because as it is, you are attempting to determine if X is or is not happening, by looking at logs that only say Y. Yes, failed connections to the indexing tier probably makes for files being monitored not coming in.
But, broken connections to the indexing tier means there's also no errors until after it reconnects!
And of course I'd think of some more ways, immediately after posting the previous thing.
Maybe the easiest meta-style search is
index=_internal component=Metrics earliest=-1d host=* | stats latest(_time) as _time, count by host | reltime
That looks back 1 day, then finds the latest time anything that's contacted the server in the past day, contacted it. Reltime just builds the neat little field like '22 hours ago' which is nice for people who like to read English.
You could `| sort` that by _time, you could filter it to things older than two hours ago... here's one way of doing that.
index=_internal component=Metrics earliest=-1d host=* | stats latest(_time) as _time, count by host | eval threshold_time = relative_time(now(), "-2h") | where _time < threshold_time | reltime
That looks back 1 day for all the things that showed up anywhere.
THEN it trims out all the ones it's seen in the last 2 hours, so it's just returning the list of hosts that have contacted the server in the last 24 hours (well, technically, in the last 1 day), but which have NOT contacted it in the last 2 hours.
Adjust thresholds as you want!