Hello to everyone
We have about >300 hosts sending syslog messages to the indexer cluster
The cluster runs on Windows Server
All settings across the indexer cluster that relate to syslog ingestion look like this:
[udp://port_number]
connection_host = dns
index = index_name
sourcetype = sourcetype_name
So I expected to see no IP addresses in the host field when I ran searches
I created the alert to be aware that no one message has an IP in the host field
But a couple of hosts have this problem
I know that PTR records are required for this setting, but we checked that the record exists.
When I run "dnslookup *host_ip* *dns_server_ip* I see that everything is OK
I also cleared the DNS cache across the indexer cluster, but I still see this problem
Does Splunk have some internal logs that can help me identify where the problem is?
Or do I only have the opportunity to catch the network traffic dump with DNS queries?
I would expect Splunk to keep some form of cache (you can't expect it to query DNS for every single UDP incoming packet. That would be silly. I wouldn't bet my money either that it doesn't have its own resolver independent from the OS (like Java does for example).
Having said that.
1. Identifying hosts by names is usually more error-prone than using IPs
2. With syslog sources you often have transforms overwriting the host field with the value parsed from within the event (and that might affect your case as well)
3. It's not a good idea to receive syslogs directly on your indexers (or even forwarders). It's better to use intermediate syslog daemon writing to files or sending to HEC (sc4s or properly configured "raw" syslog-ng or rsyslog).
4. As you're saying that you're sending syslogs to "indexer cluster" I suspect you have some kind of LB in front of those indexers. That's not a good idea usually. Typical load balancers don't handle syslog traffic (especially UDP) well.
I agree with you and also suspect that Splunk has an internal resolver or cache, but I can't find any docs or Q&A that can help me find out more
1. I understand it, but we need to see hostnames instead of IPs because we are using Splunk as a log collector from different parts of our internal infrastructure. Using hostnames is more convenient because they are human-readable
2. If I correctly understand Splunk, it has a pre-defined [syslog] stanza in props.conf and a related [syslog-host] stanza in transforms.conf. But in my particular situation, all sourcetypes don't match the syslog pattern because they all have names like *_syslog. My transforms.conf also doesn't have records related to the hostname override
3 and 4. I know it, but we decided to abandon using a dedicated syslog server for different reasons, such as fault tolerance and the desire to make the "log ingention" system less complicated. Thank you for your advices
Yes, the default syslog sourcetype calls the transform you mention but as far as I remember there are more apps that bring similar extractions with them.
And I still advocate for external syslog receiver. This way you can easily (compared to doing it with transforms) manipulate what you're indexing from which source and so on. Also "fault tolerance" in case of not-syslog-aware LB is... discussable. But hey, that's your environment 😉
I also understand that apps can do similar extractions but there is no apps related to the sourcetypes about which we talking
If talking about external syslog receiver, mayby in future 😃
In present time, we ingest and index literally everything just because we don't know what information we will really need to resolve the problem
Can you tell a little more about "not-syslog-aware" LB? What do you mean?
Our LB does the following:
- monitors the indexers by health API endpoint of earch indexer
- if one or more is down, for some reasons, LB selects another healthy instance
- spreads syslog messages to all IDXC members to avoid "data imbalance" - our approach is disscussable but works 😉
- for some reasons, we also makes source port and protocol overrides (some systems not support UDP and we change the protocol for UDP to avoid back TCP traffic)
First sin is "monitor by health API" - it doesn't tell you anything about availability of syslog input.
But from your description it seems that your LB is at least a bit syslog-aware (if you're able to extract the payload and resend it as UDP, that's something). What is it if you can share this information?
When we built our environment (splunk-related) I checked the splunk docs for some information that could say something about the proper functioning of one indexer
I can be mistaken, but in this case, I selected the indexer color status
The API endpoint is "bla bla bla/services/server/info/health_info"
If an indexer has green or yellow status, LB decides that node is OK
If an indexer has a red status, LB decides that node is not OK and selects another one
What if someone mistakenly disables udp input? Just first example from the top of my head.
For this situation, we have a weekly alert that shows "missing hosts"
| tstats latest(_time) as latest where NOT index=main AND NOT index="*-summary" earliest=-30d by index, host
| eval DeltaSeconds = now() - latest
| where DeltaSeconds>604800
| eval LastEventTime = strftime(latest,"%Y-%m-%d %H:%M:%S")
| eval DeltaHours = round(DeltaSeconds/3600)
| eval DeltaDays = round(DeltaHours/24)
| join index
[| inputlookup generated_file_with_admins_mails.csv]
| table index, host, LastEventTime, DeltaHours, DeltaDays, email_to
Using the sendresults app, this Splunk alerts the responsible employee(s) about these hosts.
Now this search shows only hosts that haven't sent Syslog for more than 7 days and that's OK for us
In most cases, this alert shows only hosts that we removed from our infrastructure 😉
But if it will be necessary, I can run this alert more frequently or separate it into several searches with different "missing" conditions
I understand that this approach cannot handle, for example, some intermittent network or software lags, but I have used this approach for about a year and all is quite fine, excluding some rare cases (like this topic)
Sure. Whatever rocks your boat 🙂
But seriously - it's like ITIL - adopt and adapt. If something works for you and you are aware of your approach's limitations - go ahead.
I really appreciate your advices, Thank you for discussion 🙂
Hello, you should check DNS records on your server, not sure internal logs can help.
In worst case use this example :
props.conf
[host::<IP address>]
TRANSFORMS-<hostname>=<hostname>_override
transforms.conf
[<hostname>_override]
REGEX = (.*)
DEST_KEY = MetaData:Host
FORMAT = host::<FQDN>
I checked DNS records many times
Also, thank you for your advice but it is not a solution, just a workaround 😃