Solved: How to troubleshoot why a universal forwarder lost...

ccie24806 · ‎10-29-2015

I deploy a universal forwarder on SUSE Linux server, and monitor a log file. This forwarder forwards data to an indexer. We found that sometimes we can't search some logs which were added to the log file on the Linux server. For example, we added one log which contains the key word YWG_704740 to the log file, and then we do searching on the indexer like this index=XXXX host=XXXX YWG_704740, time range is all time, but we can't search anything.

I enable indexer acknowledgment on the forwarder, set the useACK attribute to true in outputs.conf. It is effective, but we still can't search some logs on the indexer, but they were more less than before.

I want to know, do we have some methods to find what happened? For example, the connection problem or the forwarder problem or indexer problem.
Thanks a lot!

nnmiller · ‎10-29-2015

To clarify what bmacias84 said, on the forwarder, check splunkd.log and metrics.log.

Other places to look:

Search index=_internal and look for errors relating to the forwarder by IP address/hostname.
Are the search results all coming from the same source files? Are the missing events from just one or two source files on the forwarder? If so, check log file permissions on the forwarder. (But don't run the Splunk forwarder as root, that is a security issue.)
Do you have anything set up to route data to different indices? If so, double check that this input is not going to the wrong index.

View solution in original post

nnmiller · ‎10-29-2015

To clarify what bmacias84 said, on the forwarder, check splunkd.log and metrics.log.

Other places to look:

Search index=_internal and look for errors relating to the forwarder by IP address/hostname.
Are the search results all coming from the same source files? Are the missing events from just one or two source files on the forwarder? If so, check log file permissions on the forwarder. (But don't run the Splunk forwarder as root, that is a security issue.)
Do you have anything set up to route data to different indices? If so, double check that this input is not going to the wrong index.

ccie24806 · ‎11-01-2015

Great Thanks!
We do some checking and troubleshooting, but there are still some problems. Please see the checking process below.
1. The search results are all coming from the same source.
2. The missing events are from one source file.
3. I think the log file permissions are OK, because we can receive most of the events in this log file.
4. We didn't set up to route data to different indices.
5. We checked splunkd.log and metrics.log.
6. We found there were a lot of error events about connection failed in splunkd.log.
7. We didn't find any error or warn event in metrics.log.
8. We enable indexer acknowledgment on the forwarder, and set the useACK attribute to true in outputs.conf.
9. It is effective, but we still can't receive all events, but there were more less than before. (Only lost less than 10 events per day after enabling indexer acknowledgment on the forwarder. If we don't enable indexer acknowledgment, it will lost much more than 10 events per day.)

nnmiller · ‎11-02-2015

Based on your trouble-shooting inside of Splunk ('connection failed'), I'd suggest:

Checking for network congestion
Checking for system performance issues (mainly on the receiving side, but potentially on the sending side): system resource exhaustion (CPU/memory/filesystem I/O) and/or TCP stack issues

Related to system performance: http://docs.splunk.com/Documentation/Splunk/6.3.0/ReleaseNotes/SplunkandTHP

Although this doesn't address the exact problem you are having, it may be helpful to see if there is an overall delay in indexing events: http://docs.splunk.com/Documentation/Splunk/6.3.0/Troubleshooting/Troubleshootingeventsindexingdelay

Fairly thorough discussion of system performance analysis wrt Splunk here: https://wiki.splunk.com/Community:PerformanceTroubleshooting

ccie24806 · ‎11-02-2015

Thanks！
We will check our network first because we found a lot of packets were dropped.

bmacias84 · ‎10-29-2015

To place to check are $SPLUNK_HOME/var/log/splunk/splunkd.log AND $SPLUNK_HOME/var/log/splunk/metrics.log. The splunk.log will contain information on where the forwarder is having problems connecting to the indexer. Metrics.log contains how many bytes are sent and whats happening in the queues.

Dumb question have you configured your forwarder to send to the indexer?

Also run $SPLUNK_HOME/bin/splunk list monitor and see if your time is listed.

How to troubleshoot why a universal forwarder lost data when forwarding to an indexer?

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)