I have a cloud-based server sending events to the Indexer over my WAN link via Http Event Collector (HEC). We have limited bandwidth on the WAN link. I want to limit (blacklist) a number of event codes and reduce the transfer of log data over the WAN.
Q: Does a blacklist on inputs.conf for the HEC filter the events at the indexer, or does it stop those event from being transferred at the source?
Q: If I install a Universal Forwarder, am I able to stop the blacklisted events from being sent across the WAN?
The configuration elements work where they are defined (but they may have additional impact on other functionalities due to mutual dependency - for example lowering output bandwidth on forwarder can affect rate of input on some inputs (you can't slow down inputs working in "push" mode - you can just drop events if the the queue is full).
So if you were to configure your HEC input to blacklist something, that would be working on the HEC input, not on other components.
Having said that - what do you mean by blacklisting on HEC input? I don't recall any setting regarding http input filtering/blacklisting events. The closest thing to any filtering on HEC input would be the list of SANs allowed to connect and that's it.
Even if you wanted to filter on the source forwarder, remember that filtering applies only to specific types of inputs - windows eventlog inputs can filter and ingest only some events and file monitor inputs can filter and ingest only certain files (still no event-level filtering).
Maybe you could implement some form of filtering on the UF if you enabled additional processing on the UF itself but that's not very well documented (hardly documented at all if I were to be honest) and turning on this option is not recommended.
So if you wanted to filter events before sending them downstream, you'd most probably need a HF which would do the parsing locally, fitler some of them out and then send across your WAN link but here we have two issues:
1) While it is called "http output", the forwarder doesn't use "normal" HEC to send events downstream but uses s2s tunelled over http connection. It's a completely different protocol.
2) HF parses data locally and sends the data parsed, not just cooked. That unfortunately means that it sends a whole lot of data more than UF normally sends as it sends data cooked.
So "limiting" your bandwidth usage by installing a HF and filtering the data before sending might actually have the opposite effect because even though you might be sending less events (because some have been filtered out) you might actually be sending more data altogether (because you're sending parsed data instead of just cooked stream).
Depending on the data you want to ingest, you might consider other options on the source side - if the events come from syslog sources you could set up a syslog receiver filtering data before passing them to Splunk, if you have files, you could preprocess them by external script. And so on.
What I can say is I have nowhere near your understanding of Splunk operations. I do appreciate your input.
I am taking my limited understanding of our wholly-UF-to-Indexer environment, and applying what I know to solve the issue of reducing cloud-to-on-prem traffic over the WAN link from our new SaaS solution. I keep a very low daily transfer rate (and licensing rate) in our on-prem environment by blacklisting noise, and whitelisting the key events we want to track.
I have no rights on the source machines, and I cannot install a UF, or anything for that matter. LogStash is the only option provided - which I assume requires HEC to receive the logs. I have read that HEC supports white/black listing - which is where my question came from.
HEC on its own doesn't have filtering abilities. You can filter events after receiving them (on any input) using props and transforms but that doesn't change what you're sending over your WAN link.
Your question is fairly generic and we don't have a lot of details about your environment so the answer is also only really generic in nature.
Anyway, ingesting events to Splunk using Logstash might prove to be complicated unless you properly prepare your data in Logstash to conform to the format normally ingested by standard Splunk methods (otherwise you'd need to put some work to properly extract fields from such logstash-formatted events). But Logstash should give you ability to filter the data before sending it over HTTP to the HEC input.
I apologize for being vague. Was just trying to stick to the point.
The source (LogStash) cloud server is CentOS and we have zero access or control beyond the initial setup happening over the next few days. We are not permitted to install ANY software on this server as it is externally hosted and is locked down. I plan to try and force the issue of a UF install, but I expect to be unsuccessful. In which case, LogStash is all we have.
My entire environment is 60-70 UF's to an on-prem Indexer. I have no LogStash or HEC experience. I have a bad feeling about this...
The overall difficulty of this whole exercise will depend on your logstash configuration and the use case - if you have just one sourcetype to ingest - maybe you can do it relatively reasonably. But if you want to send multiple sourcetypes over a single connection, that can be tricky to separate on the receiving side. You could send multiple sourcetypes using multiple tokens so they are received into separate indexes/with separate sourcetypes but it's getting complicated and - as I said before - needs proper configuration on the logstash side.
Anyway - it's still up to logstash to filter events before sending.
@rob_gibson - You need to filter on the source which is generating the data. And not send data at all to Splunk HEC.
Alternatively,
I would recommend not sending data to Splunk HEC at all directly by Data source would be simple solution.
I hope this helps!!!
"You must use services/collector/raw endpoint of Splunk HEC for data filtering to work."
This is not entirely true. In fact it's not true at all 😉
But seriously, while the /event endpoint does skip some parts of the ingestion queue and you can't affect line breaking or timestamp recognition (with exceptions) this way, your normal routing and filtering by means of transforms modifying _TCP_ROUTE or queue works perfectly ok.
@PickleRick - You must be right and I know its so complicated for HEC endpoint on what will execute or not, so I would avoid it all together at all and filter it early directly from source when using HEC.
It ain't that bad 🙂