I spent hours trying to figure this out Friday, and it's been bugging me all weekend. So, I'm hoping the community can help me figure this out! The info below is all from memory, hopefully I don't miss anything.
First off, I'm completely new to Splunk... So if I butcher terminology or concepts, please understand! I am now trying to come in and fix something that appears to have never worked. Several months ago, the Splunk universal forwarder was pushed out to all of my Windows machines. I am fairly certain that it was pushed out using our patching solution "BigFix".
Fast forward to today. I am receiving data from about 150 hosts. Unfortunately, I should be receiving data from closer to 350. My domain controllers are included in the list of the systems that are not forwarding data. The guy before me decided to set up a heavy forwarder, something about blowing through our license. I haven't looked into the heavy forwarder too much, but I'm assuming that it's working since half of the hosts are getting through to the indexer.
1 - So far I've compared the local/inputs.conf and the local/server.conf on the working system and the not-working system. According to the guy who did the install, those are the only files that he touched after the install. On each of the systems both the local/inputs.conf and the local/server.conf files are basically identical.
2 - Also, on the not-working system and the heavy forwarder I've run
NETSTAT -an to verify that the 2 systems are establishing a connection between each other.
3 - I've dug through the
var/logs/splunkd.log on both the working and the non-working system, and I didn't see anything obvious that would indicate what is wrong on the non-working system.
4 - I've spent hours making changes to the inputs.conf and the server.conf, then restarting the Splunk forwarder service, to no avail.
Where else can I look...What else can I do... to try and figure out why only half of my systems are able to forward events to the indexer, and the other half cannot?
Any and all help would be greatly appreciated.
This is one of those questions that could take days to solve with dedicated resources.
You need to ensure there are zero ERROR or WARN* messages occurring in the splunkd.log files on non-working forwarders & your indexers. Something as inconspicuous as "warning: cant find saved search: blah" can cause Splunk to stop in its tracks.
You could have networking issues. Telnet, Ping, Nslookup... all great tools for troubleshooting network issues.
You could have firewall issues. Telnet, Ping, Nslookup... all great tools for troubleshooting firewall issues.
./splunk cmd btool outputs list --debug <-- great for splunk outputs.conf issues, run on forwarders (heavy,light,universal) ./splunk cmd btool server list --debug <-- great for splunk server.conf issues, run on indexers ./splunk cmd btool inputs list --debug <-- great for splunk inputs.conf issues, run on all splunk machines. Indexers use this for ssl config, other machines use it to specify data inputs.
You could have a permissions issue. Check the service account the universal forwarder is running as. Does it have read permission on the data you're trying to read? Will Group Policy apply? If so, does it? etc.
Another famous mistake is broken stanza names... copy and paste sometimes leaves out a square bracket and half a stanza name, etc.
If you're victim of this, you'll find all your data in index=_internal usually and everything after the "deployment" will have the same sourcetype, index, etc because of the broken stanza in inputs.conf somewhere.
[batch://path/to/file] ... ... monitor://path/to/file] ... ...
For example the above, would make everything after the monitor stanza end up in the wrong place.
Post the results of the above and further we may help you. Cheers.
Boy am I glad to hear you say it could take days of dedicated resources to solve this... Now I don't feel so bad dedicating days to troubleshooting!
Let me knock the easy ones down quick.
Network issues. Nope. Everything is working fine between them. Ping, NSlookup, RDP, BigFix, HBSS, ect.
Firewall issues. I double checked that today. Got into my McAfee HB-IPS and looked through the logs. I see the traffic, and its all being allowed through.
Corrupt universal forwarder installation. Nope. You didn't suggest it, but I was out of good ideas so I tried uninstalling and reinstalling the universal forwarder... No change.
My host was pointing to my heavy forwarder. So I pointed it right to the indexer to take it out of the equation... No change.
A Co-worker googled for 5 minutes and found a document that addressed troubleshooting forwarders... http://docs.splunk.com/Documentation/Splunk/6.3.0/Troubleshooting/Cantfinddata#Are_you_using_forward...
I ran the following search stings I found on the troubleshooting page on my indexer and my heavy forwarder,
"index=_internal source=metrics.log* tcpin_connections "sourceIp=*"
I am seeing several entries 2 times every minute:
So, I feel like the host's universal forwarder is able to communicate with the heavy forwarder / indexer... Just need to figure out why the data is being ignored.
Oh yea, 2 simple items that probably matter... Yes I am using Splunk Enterprise (not free), and no I have not exceeded my license.
I will run the commands you suggested first thing in the AM when I get to work, and I'll post the results!
Thanks for the pointers, I am looking forward to making splunk work... properly!
Ok... So I'm fairly certain I've isolated the problem. Now I just need to figure out why its happening, and fix it.
Searches were slow, so I bumped the CPU count up to 4 from 2, and bumped the RAM from 4Gb to 8Gb. When I rebooted and logged back into the splunk web I noticed an error message, "received event for unconfigured/disabled/deleted index='wineventlog'." So I created a new index called 'wineventlog'.... And called it a day.
Fast forward to today, after 48 hours there are 2.74 million events in the 'wineventlog' index. The events that are being logged are coming from at least 45 of the missing systems.
I'm not sure why some of my host are pointing their windows event logs to the 'main" index and some are pointing them to the 'wineventlog'...
Now I'm trying to figure out what *.conf file was modified to cause these events to be indexed in their own index.
The document linked below covers how to setup multiple indexes. Hopefully I can reverse that process to get all events indexing in the main index.
Ok so it looks like the problem is on the individual host, on the universal forwarder.
Several of the stanza's in the 'SplunkUniversalForwarder/etc/apps/SplunkTAWindows/default/inputs.conf' include an 'index = wineventlog' setting that is sending all the events to the index that didnt exist until a couple of days ago.
So... whats the best way to fix this? I've got a feeling the best solution will be to stand up a splunk deployment server to manage the universal forwarder's configuration on all the host. Considering that the universal forwarders are not currently pointed to a deployment server, that seems like it could be a tedious task.
Ah so you've caught the inputs.conf issue. Hopefully you used some form of this command I gave you to figure that out. No?
./splunk cmd btool inputs list --debug
Can you mark my answer as the correct one?
Your problem is now solved. You now know the inputs.conf on 45 forwarders have a different index of wineventlog. So now you must change those conf files and restart the forwarders. Note, if any are heavy forwarders, they can be "refreshed" via api call vs restarting the entire service (if you even care, if you do... just ask 😉
If you have more than a handful of forwarders you absolutely need a deployment server... you should need one at least... maybe you dont ... but I would demand one if I had that many forwarders to babysit.
Check this link out and do mark my answer above as the most excellent please!
How are you validating that not all hosts are sending data? Do you see data being received on the internal logs from all hosts or just a limited subset?
index=_internal | stats count by host
Validate your connecting hosts.
Other then that, you need to work from a host that should be sending, but is not. Troubleshoot that host, and most likely the fix will be applicable to all your "not-working" hosts. Do this methodically.
1) Check that splunk is running and has valid system permissions
2) Check your outputs are pointing to the right HF / IDX ( splunk btool outputs list --debug) and ( splunk list forward-server)
3) Validate network connectivity to the splunk ports from the host to its HF / IDX (telnet / nc to 9997 )
4) Validate inputs on your box (splunk btool inputs list --debug)
Per best practices, with a large number of hosts you should use a Deployment server, this will help ensure uniform configurations across your environment. If you have to edit system/local/* everytime, thats a huge pain and leaves a large amount of room for error.
Post your results from btool and network testing, we can recommend more options.