Hi,
we have Splunk (v9.2) in a clustered environment that manages tons of different logs from a complex and varied network.
There are a few departments that have a Sophos firewall each, that sends logs through syslog (we would have used a UF, but we couldn't because the IT security can't touch those servers).
In order to split the inputs based on the source type, we set those Sophos logs to be sent to port 513 of one of our HFs and created an app to parse those through the use of a regex.
The goal was to reduce the logs and save license usage.
So far, so good... Everything was working as intended... Until...
As it turns out, every night, exactly at midnight, the Heavy Forwarder stops the collection from those sources (only those) and nothing is indexed, until someone gives a restart to the splunkd service (which could be potentially never) and gives new life to the collector.
Here's the odd part: during the no-collection time, tcpdump shows the reception of syslog data through the port 513, so the firewall never stops sending data to the HF, but no logs are indexed. Only after a restart we can see the logs are indexed again.
The Heavy Forwarder at issue sits on top of a Ubuntu 22 LTS minimized server edition.
Here are the app configuration files:
- inputs.conf
[udp:513]
sourcetype = syslog
no_appending_timestamp = true
index = generic_fw
- props.conf
[source::udp:513]
TRANSFORMS-null = nullQ
TRANSFORMS-soph = sophos_q_fw, sophos_w_fw, null_ip
- transforms.conf
[sophos_q_fw]
REGEX = hostname\sulogd\[\d+\]\:.*action=\"accept\".*initf=\"eth0\".*
DEST_KEY = queue
FORMAT = indexQueue
#
[sophos_w_fw]
REGEX = hostname\sulogd\[\d+\]\:.*action=\"accept\".*initf=\"eth0\".*
DEST_KEY = _MetaData:Index
FORMAT = custom_sophos
#
[null_ip]
REGEX = dstip=\"192\.168\.1\.122\"
DEST_KEY = queue
FORMAT = nullQueue
We didn't see anything out of the ordinary in the pocesses that start at midnight on the HF. At this point we have no clue about what's happening. How can we troubleshoot this situation?
Thanks
That's strange at every midnight - sounds like gremlins are out to play!
A few things you can check that may give you some clues (And as you have already stated its always better to use UF's/SC4S and not direct to Splunk, this is really for small environments/POCs etc.
As the HF is a full Instance and will parse data/forward etc, it might be worth having a look at the TcpOutputProc in Splunkd.log - or index=_internal sourcetype=splunkd host(YOUR HOST) log_level=WARN OR log_level=ERROR TcpOutputProc via Splunk search bar
Else check for any ERROR's for the HF.
You might find some clues around, timeout, Queues being full or some invalid configuration.
Perhaps increase the log level on the HF - can also be done via gui
/opt/splunk/bin/splunk set log-level TcpOutputProc -level DEBUG
Remember to turn it off after! - can also be done via gui
/opt/splunk/bin/splunk set log-level TcpOutputProc -level INFO
You could also do some checks on the Performance, memory, CPU / disk - get some of those stats, I have seen where the HF's used as syslog receivers with large volumes of streaming data stop the HF functioning, but that was at different times. Optional and if you have enough memory and this is not the issue, you could try to increase the memory queue size, server.conf on the HF and see if that helps.
example:
[queue]
maxSize=<5000MB>
I have also seen where a vulnerability scanner was preventing Splunk from not responding at regular intervals.
Unfortunately, I tried everything but frustratingly obtained no result. The logs at midnight were not different, so we didn't manage to find what was wrong.
I finally found a (NOT)solution:
revived an old ELK server and sent the log through Logstash into Splunk.
This way there's no gap in the logs and it is working right now.
We plan to return on it whenever the other team installs the new version of Sophos and see whether there are any differences.