Getting Data In

Does a Heavy Forwarder fit my needs?

thomastaylor
Communicator

I have read in various places about "cooking" logs before sending them to a Splunk Enterprise instance. I'm curious to know if a Heavy Forwarder is an optimal solution for my team.

To give some background, my company has a department that handles the main Splunk environment. They have set up a deployment server that other departments can subscribe to in order to send their data to the environment; however, they do limit users actions and do not allow sensitive log information to be sent to them. In addition, they also don't easily allow HTTP Event collection either.

We are considering a heavy forwarder in order to transform the data AND handle extractions **
AND** handle HTTP Event collection before it is indexed by the Splunk environment; however, we have a few questions regarding this. I read that heavy forwarders perform "pre-indexing extractions" meaning they write, instead of Splunk doing "post-index extractions" (read). From my understanding, it seems like Splunk applies the extractions at search time instead of modifying the logs themselves, but does the heavy forwarder modify the logs themselves?

How much overhead is my team and I realistically looking at if we wanted to configure a heavy forwarder to handle transformations and extractions? On another note, does the heavy forwarder allow to use the "Regex Tool"?

For trial purposes, can I install a heavy forwarder on the same Windows Machine that my current demo enterprise is on?

Thank you everyone!

0 Karma
1 Solution

beatus
Communicator

thomastaylor,

Lets break this down a bit...

  • HTTP Event Collector:
    A Heavy Forwarder is a great option here. You can manage the token and receive HEC inputs on the HWF without the need of the main Splunk install to do anything. As the data is JSON, you'll also get your field extracts "for free" from autokv.

  • Transforming data:
    Yes you can use a Heavy Forwarder for this. I must caution that there are a number of pitfalls that come with using a HWF to "pre-parse" data before it hits the indexers.

  1. Cooked data is larger on the network than uncooked data: https://www.splunk.com/blog/2016/12/12/universal-or-heavy-that-is-the-question.html - Some have theorized that unless you're doing a massive amount of Index time operations, the load on the indexers is actually higher CPU wise too (Still an argument in the community so take this with a grain of salt).
  2. Heavy Forwarders tend to cause data in-balance on Indexers (They get sticky to which indexer they send to, due to not having a break in incoming traffic. A common problem for syslog boxes that use a HWF).
  3. The Indexers are not given a second chance to parse the data - This means if your main Splunk install needs to do sourcetyping, index renaming or host renaming, it will be unable to (Well there are some special things you can do to cheat here, but it's a bad idea)
  • Creating field extracts:

You are unable to create "search time" field extracts with a Heavy Forwarder. The vast majority of TAs you'll find on Splunkbase are search time. Additionally, creating "index time" field extracts comes with a whole list of caveats (NOTE THE CAUTION WARNING: http://docs.splunk.com/Documentation/Splunk/7.1.1/Data/Configureindex-timefieldextraction). While possible, you're opening yourself up to a massive list of potential issues. To name a few:

  1. Greater storage requirements (index time fields are stored in the TSIDX files, uncompressed)
  2. Lack of flexibility (Once a field is written, it's "burnt" into the index)
  3. Potentially extreme CPU overhead at the HWF level

Also, no the HWF will not let you use the regex tool - that's for search time field extracts. You'd have to have a dev search head / indexer for it and lift the extracts AND convert them to index time. DO NOT RECOMMEND.

TLDR:
For HEC, I think it's a great use case for you. For everything else, I'd advise against it. I'd recommend attempting to fix the relationship with whomever owns your Splunk install. You're setting your team and the Splunk owners up for potential issues down the road (and a bunch of up-front work for yourself as nothing on Splunk base will be plug and play).

View solution in original post

beatus
Communicator

thomastaylor,

Lets break this down a bit...

  • HTTP Event Collector:
    A Heavy Forwarder is a great option here. You can manage the token and receive HEC inputs on the HWF without the need of the main Splunk install to do anything. As the data is JSON, you'll also get your field extracts "for free" from autokv.

  • Transforming data:
    Yes you can use a Heavy Forwarder for this. I must caution that there are a number of pitfalls that come with using a HWF to "pre-parse" data before it hits the indexers.

  1. Cooked data is larger on the network than uncooked data: https://www.splunk.com/blog/2016/12/12/universal-or-heavy-that-is-the-question.html - Some have theorized that unless you're doing a massive amount of Index time operations, the load on the indexers is actually higher CPU wise too (Still an argument in the community so take this with a grain of salt).
  2. Heavy Forwarders tend to cause data in-balance on Indexers (They get sticky to which indexer they send to, due to not having a break in incoming traffic. A common problem for syslog boxes that use a HWF).
  3. The Indexers are not given a second chance to parse the data - This means if your main Splunk install needs to do sourcetyping, index renaming or host renaming, it will be unable to (Well there are some special things you can do to cheat here, but it's a bad idea)
  • Creating field extracts:

You are unable to create "search time" field extracts with a Heavy Forwarder. The vast majority of TAs you'll find on Splunkbase are search time. Additionally, creating "index time" field extracts comes with a whole list of caveats (NOTE THE CAUTION WARNING: http://docs.splunk.com/Documentation/Splunk/7.1.1/Data/Configureindex-timefieldextraction). While possible, you're opening yourself up to a massive list of potential issues. To name a few:

  1. Greater storage requirements (index time fields are stored in the TSIDX files, uncompressed)
  2. Lack of flexibility (Once a field is written, it's "burnt" into the index)
  3. Potentially extreme CPU overhead at the HWF level

Also, no the HWF will not let you use the regex tool - that's for search time field extracts. You'd have to have a dev search head / indexer for it and lift the extracts AND convert them to index time. DO NOT RECOMMEND.

TLDR:
For HEC, I think it's a great use case for you. For everything else, I'd advise against it. I'd recommend attempting to fix the relationship with whomever owns your Splunk install. You're setting your team and the Splunk owners up for potential issues down the road (and a bunch of up-front work for yourself as nothing on Splunk base will be plug and play).

View solution in original post

thomastaylor
Communicator

@Beatus, thank you so much for this detailed response. You single highhandedly just answered so many lingering questions that my team and I had.

Just to get your opinion on one more aspect:

Our Splunk department does allow for HTTP Event Collection; however, they are reluctant to open several different ports for several different applications. We have several applications that we would like to add to our the Splunk instance, but we do not want to request for several different ports to be opened. That's why we were thinking about creating a HWF. We wanted to already have the logs associated with their sourcetypes BEFORE it arrived to Splunk department so that they would not have to configure anything other than one port on their end.

With that in mind, we were also considering the question "Should we also go ahead and handle extractions"? You have already convinced us that a huge NO applies to this question; however, what is your opinion on our scenario?

Ideally, we are still thinking about using the HWF as an HTTP Event collector, but we may not need to use it after all if there's a way that the Splunk department does not have to get several requests to open up a ton of data inputs. Maybe we could specify the sourcetype on the API call?

PS. We don't want to have to install a UF on each applications' server either. We want to leave as minimal as a footprint as we can (even though I know the UF is minimal already).

0 Karma

beatus
Communicator

If all your data is coming in via HEC, I'd use a load balancer. The Splunk department can open up access to their HEC receivers (I suspect they have more than one) and you can load balance to them. The HWF would still work here as well, and that would allow you to manage your own tokens (which may come in handy, see below).

There are a number of things you can do in HEC to set the sourcetype / index.

If you're writing code to integrate HEC, I'd check out the SDKs from Splunk and some of the community efforts:

Ultimately it's just a HTTP post, but you're responsible for handling if the receiver rejects your post in whatever manner you'd like (cache it? drop it on the floor? etc).

Happy to help! Please accept the answer if you feel it's solved all of your issues.

thomastaylor
Communicator

Whenever you mention "HEC Receivers", do you mean going through and creating a new input source by generating a new token? Or are you talking about something different? If that is what you're referring to, then can 113 different applications use the same token? Or is this bad practice?
@beatus

0 Karma

thomastaylor
Communicator

Once again thank you so much for clarification. I am going to accept your answer. How does the Heavy Forwarder work with using the DB Splunk Connect app on it? Will pitfalls come from that as well? Or does it just seemingly grab the data from the db and send it over to splunk?

Great responses so far. If I could, I would submit a survey or something haha!

0 Karma

beatus
Communicator

DBX is a perfect candidate for use on a HWF. It's my go to route for getting data in via DBX. In fact, it's one of the documented ways to install it (http://docs.splunk.com/Documentation/DBX/3.1.3/DeployDBX/Architectureandperformanceconsiderations#Di...).

It's not without it's downsides - When using a HWF, you're only going to be able to index data. Lookups and ad-hoc searching will not work unless DB Connect is installed on the Search Heads.

I'd recommend HWFs for any scripted / modular inputs (AWS is a great example). I avoid co-locating those functions with a search head when at all possible.

0 Karma
Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!