I have read in various places about "cooking" logs before sending them to a Splunk Enterprise instance. I'm curious to know if a Heavy Forwarder is an optimal solution for my team.
To give some background, my company has a department that handles the main Splunk environment. They have set up a deployment server that other departments can subscribe to in order to send their data to the environment; however, they do limit users actions and do not allow sensitive log information to be sent to them. In addition, they also don't easily allow HTTP Event collection either.
We are considering a heavy forwarder in order to transform the data AND handle extractions **
AND** handle HTTP Event collection before it is indexed by the Splunk environment; however, we have a few questions regarding this. I read that heavy forwarders perform "pre-indexing extractions" meaning they write, instead of Splunk doing "post-index extractions" (read). From my understanding, it seems like Splunk applies the extractions at search time instead of modifying the logs themselves, but does the heavy forwarder modify the logs themselves?
How much overhead is my team and I realistically looking at if we wanted to configure a heavy forwarder to handle transformations and extractions? On another note, does the heavy forwarder allow to use the "Regex Tool"?
For trial purposes, can I install a heavy forwarder on the same Windows Machine that my current demo enterprise is on?
Thank you everyone!
thomastaylor,
Lets break this down a bit...
HTTP Event Collector:
A Heavy Forwarder is a great option here. You can manage the token and receive HEC inputs on the HWF without the need of the main Splunk install to do anything. As the data is JSON, you'll also get your field extracts "for free" from autokv.
Transforming data:
Yes you can use a Heavy Forwarder for this. I must caution that there are a number of pitfalls that come with using a HWF to "pre-parse" data before it hits the indexers.
You are unable to create "search time" field extracts with a Heavy Forwarder. The vast majority of TAs you'll find on Splunkbase are search time. Additionally, creating "index time" field extracts comes with a whole list of caveats (NOTE THE CAUTION WARNING: http://docs.splunk.com/Documentation/Splunk/7.1.1/Data/Configureindex-timefieldextraction). While possible, you're opening yourself up to a massive list of potential issues. To name a few:
Also, no the HWF will not let you use the regex tool - that's for search time field extracts. You'd have to have a dev search head / indexer for it and lift the extracts AND convert them to index time. DO NOT RECOMMEND.
TLDR:
For HEC, I think it's a great use case for you. For everything else, I'd advise against it. I'd recommend attempting to fix the relationship with whomever owns your Splunk install. You're setting your team and the Splunk owners up for potential issues down the road (and a bunch of up-front work for yourself as nothing on Splunk base will be plug and play).
thomastaylor,
Lets break this down a bit...
HTTP Event Collector:
A Heavy Forwarder is a great option here. You can manage the token and receive HEC inputs on the HWF without the need of the main Splunk install to do anything. As the data is JSON, you'll also get your field extracts "for free" from autokv.
Transforming data:
Yes you can use a Heavy Forwarder for this. I must caution that there are a number of pitfalls that come with using a HWF to "pre-parse" data before it hits the indexers.
You are unable to create "search time" field extracts with a Heavy Forwarder. The vast majority of TAs you'll find on Splunkbase are search time. Additionally, creating "index time" field extracts comes with a whole list of caveats (NOTE THE CAUTION WARNING: http://docs.splunk.com/Documentation/Splunk/7.1.1/Data/Configureindex-timefieldextraction). While possible, you're opening yourself up to a massive list of potential issues. To name a few:
Also, no the HWF will not let you use the regex tool - that's for search time field extracts. You'd have to have a dev search head / indexer for it and lift the extracts AND convert them to index time. DO NOT RECOMMEND.
TLDR:
For HEC, I think it's a great use case for you. For everything else, I'd advise against it. I'd recommend attempting to fix the relationship with whomever owns your Splunk install. You're setting your team and the Splunk owners up for potential issues down the road (and a bunch of up-front work for yourself as nothing on Splunk base will be plug and play).
@Beatus, thank you so much for this detailed response. You single highhandedly just answered so many lingering questions that my team and I had.
Just to get your opinion on one more aspect:
Our Splunk department does allow for HTTP Event Collection; however, they are reluctant to open several different ports for several different applications. We have several applications that we would like to add to our the Splunk instance, but we do not want to request for several different ports to be opened. That's why we were thinking about creating a HWF. We wanted to already have the logs associated with their sourcetypes BEFORE it arrived to Splunk department so that they would not have to configure anything other than one port on their end.
With that in mind, we were also considering the question "Should we also go ahead and handle extractions"? You have already convinced us that a huge NO applies to this question; however, what is your opinion on our scenario?
Ideally, we are still thinking about using the HWF as an HTTP Event collector, but we may not need to use it after all if there's a way that the Splunk department does not have to get several requests to open up a ton of data inputs. Maybe we could specify the sourcetype on the API call?
PS. We don't want to have to install a UF on each applications' server either. We want to leave as minimal as a footprint as we can (even though I know the UF is minimal already).
If all your data is coming in via HEC, I'd use a load balancer. The Splunk department can open up access to their HEC receivers (I suspect they have more than one) and you can load balance to them. The HWF would still work here as well, and that would allow you to manage your own tokens (which may come in handy, see below).
There are a number of things you can do in HEC to set the sourcetype / index.
As part of the post payload:
You can set the host, index, source, sourcetype in the post (detailed here: http://dev.splunk.com/view/event-collector/SP-CAAAE6P). Basically as part of the HTTP payload, you spell out all the meta you'd like associated with the data. This is the route I'd take.
As part of the HEC token
When you configure the token, you can force settings such as index and sourcetype (http://docs.splunk.com/Documentation/Splunk/7.1.1/Data/HECWalkthrough#Create_an_HEC_token)
If you're writing code to integrate HEC, I'd check out the SDKs from Splunk and some of the community efforts:
Ultimately it's just a HTTP post, but you're responsible for handling if the receiver rejects your post in whatever manner you'd like (cache it? drop it on the floor? etc).
Happy to help! Please accept the answer if you feel it's solved all of your issues.
Whenever you mention "HEC Receivers", do you mean going through and creating a new input source by generating a new token? Or are you talking about something different? If that is what you're referring to, then can 113 different applications use the same token? Or is this bad practice?
@beatus
Once again thank you so much for clarification. I am going to accept your answer. How does the Heavy Forwarder work with using the DB Splunk Connect app on it? Will pitfalls come from that as well? Or does it just seemingly grab the data from the db and send it over to splunk?
Great responses so far. If I could, I would submit a survey or something haha!
DBX is a perfect candidate for use on a HWF. It's my go to route for getting data in via DBX. In fact, it's one of the documented ways to install it (http://docs.splunk.com/Documentation/DBX/3.1.3/DeployDBX/Architectureandperformanceconsiderations#Di...).
It's not without it's downsides - When using a HWF, you're only going to be able to index data. Lookups and ad-hoc searching will not work unless DB Connect is installed on the Search Heads.
I'd recommend HWFs for any scripted / modular inputs (AWS is a great example). I avoid co-locating those functions with a search head when at all possible.