Getting Data In

Why would I use the HTTP Event Collector when I can use TCP?

Graham_Hanningt
Builder

(This question encompasses single-instance Splunk installations and multisite indexer clusters.)

I'm working on a platform that does not have a Splunk Universal Forwarder. I want to send events to Splunk over an IP network. I don't want to use UDP.

I am successfully sending events in JSON format to a single Splunk instance via the HTTP Event Collector (EC) and TCP. So I'm already familiar with some of the differences between EC and TCP inputs.

For example, the EC protocol enables you to specify event time and source type as metadata, whereas using TCP involves configuring timestamp recognition and overriding source type per event (in .conf files).

So, that's one answer: EC separates metadata from data. Whereas, with TCP, you have to embed the time stamp and (if you want to send multiple source types to the same TCP port) source type as fields in the event data. (I've already bleated about this in the question "Can I use the HTTP Event Collector JSON event protocol for TCP inputs?".)

Another answer: using EC - HTTP - means you get a response (in JSON) that reports the success or failure of the request.

However, neither of these answers is compelling to me.

In fact, while I want to know that there's a Splunk server listening - and I know that when I attempt to open a connection (which is why I don't want to use UDP) - I do not want to spend CPU time on the "sending" platform handling errors reported by Splunk. I'd prefer to capture and handle those errors via Splunk's own logging.

(I have questions about that, that I might ask - in a separate question - here on Splunk Answers. As far as I can tell, Splunk does not log the details of individual EC request errors. For example, when I deliberately send badly formed JSON to EC, the data.num_of_parser_errors in the _introspection index for that time period has a value of 1, but I cannot find specific details of that error in any Splunk log... perhaps I'm just not looking in the right places, or perhaps I need to enable debug logging for some category, although I'd rather not do that for ongoing "production" use.)

I've read various Splunk blog posts and Splunk dev topics on EC (including "Introduction", "Walkthrough", and "Distributed deployment"), but I don't see any compelling reasons there to use EC when I can use TCP.

I'd be interested in results of high volume performance benchmark testing of EC versus TCP.

According to the Splunk docs topic "Getting Data In":

TCP ... is the recommended protocol for sending data from any remote host to your Splunk Enterprise server

While that recommendation pre-dates EC (the same text appears in pre-6.3 docs), it remains in the current (6.4) docs. Is it still true?

More broadly - outside of the specific context of Splunk - I've read discussions about using HTTP versus TCP. (When I write "versus", I know that, in this context, HTTP runs over TCP: that is, I can use a TCP client to open a connection to port 80 on a computer, send a "GET / HTTP..." request, and get the response.) Here, in this question, I'm specifically interested in using HTTP (EC) versus TCP for Splunk.

Labels (1)

gblock_splunk
Splunk Employee
Splunk Employee

There's a bunch of reasons.

  1. There are many clients where TCP is not a viable option, such as sending from the browser.
  2. Scale. HEC is stateless and designed to easily scale out across a pool of instances behind a LB.
  3. Performance. We've heavily optimized HEC to handle 100K events or more per instance.
  4. Ease of use. HEC has really rich support for JSON out of the box, you don't have to mess with sourcetypes or bending over backwards with your JSON.
  5. Security. HEC's token based mechanism allows easily locking down which clients can send to it. With TCP you can lock things down as well but you end up messing with IP-ranges and such, or dealing with certs.

Graham_Hanningt
Builder

@gblock, thanks very much for weighing in on this question, much appreciated. I'd hoped to catch your attention.

I'm asking this question primarily on behalf of some developer colleagues who will soon be turning their attention to sending events to Splunk over an IP network. In particular, I want to present them with information to help them decide whether to use HEC or a TCP input. As mentioned in my question, I've already done some research, and have hands-on experience using both HEC and TCP inputs (albeit currently only on a small scale, on a single Splunk instance).

To recap: my question is "Why would I use HEC when I can use TCP?". That is, as further clarified in the details of the question, why would I choose to use HEC in situations where I can use either HEC or TCP?

I'm interpreting your answer in the context of that question.

Point by point:

  1. There are many clients where TCP is not a viable option, such as sending from the browser.

Yes, fair point. However - I sincerely don't mean to be adversarial or otherwise annoy you; I'm grateful for your time, and hope for more advice from you on your subsequent points - this point is not relevant to the specific context of this question, where TCP is a viable option.

Incidentally, and more or less just for fun, this morning I played around sending events from a web browser (Chrome) to a Splunk TCP input. Yes, really (and, yes, I do have better things to do ;-):

xhr = new XMLHttpRequest()
xhr.open("POST", "http://localhost:6067")
xhr.send("{\"my_field\": \"some_value\"}")

with the following stanza in props.conf:

[source::tcp:6067]
KV_MODE = json
LINE_BREAKER = ((^[^{][^\n]*\r\n)*)\{\"[^}]+\}
SHOULD_LINEMERGE = false

The LINE_BREAKER is intended to ditch the multiline HTTP request header.

It kinda works: for each xhr.send, I get two events in Splunk:

  • The event I want, {"my_field": "some_value"}, with myfield correctly presented as a field.
  • An unwanted event, with a time stamp 10 seconds earlier (!), consisting only of the multiline HTTP header (which I thought I told LINE_BREAKER to discard!)

I spent some time Googling about automatic HTTP request retries, and whether I can set an Ajax request to use HTTP 1.0 instead of 1.1, but gave up. Maybe I just specified an inappropriate regex?

Interesting, but academic, thanks to HEC. Moving on.

  1. Scale. HEC is stateless and designed to easily scale out across a pool of instances behind a LB.

Again, fair point. But, again, the point of this question is to decide between using HEC and a Splunk TCP input.

A Splunk TCP input is also stateless. Right?

And a Splunk TCP input easily scales out across a pool of instances behind a load balancer (LB), too. Or am I missing something here?

The Splunk dev topic "High volume HTTP Event Collector data collection using distributed deployment" describes using a network traffic load balancer (such as NGINX) in front of several Splunk Enterprise indexers.

Is there any reason why I can't do the same thing - use a TCP LB, such as NGINX or HAProxy - for Splunk TCP traffic?

  1. Performance. We've heavily optimized HEC to handle 100K events or more per instance.

How does that compare with the performance of a Splunk TCP input?

HTTP involves processing that TCP does not, such as parsing an HTTP request header and returning a response with a header (and, in the case of HEC, a JSON-format body).

This is one reason for my original question: if I don't want or need the processing overhead of HTTP versus TCP, why use HEC?

Outside of this processing that is specific to HTTP - and so, an overhead, when compared to TCP - I would have thought that the remainder of the event processing would be common to both HEC and Splunk TCP inputs. Or could be, if it isn't: that's one reason why I recently asked the question "Can I use the HTTP Event Collector JSON event protocol for TCP inputs?".

As you mention in the next point, HEC has rich support for JSON out of the box. Does that "protocol" - for example, specifying the time in the metadata as a Unix Epoch value - improve the performance of HEC versus a Splunk TCP input? If so, why not offer that same JSON structure for TCP inputs? (Or are you deliberately deprecating TCP inputs in favor of HEC?)

Aside: It occurred to me that perhaps you deliberately chose "EC" as the official abbreviation for HEC for this very reason: that you had plans to "roll out" the JSON-based EC metadata/data protocol across other input methods, including TCP. But nope, I was wrong, because you've recently clarified the official abbreviation as being HEC, not EC.

  1. Ease of use. HEC has really rich support for JSON out of the box, you don't have to mess with sourcetypes or bending over backwards with your JSON.

Yes. I describe some of that "bending over" in my question. Much nicer with HEC, thanks.

However, as I mentioned, I don't find this (ease of use) a compelling enough reason to choose HEC over TCP. Unless the "rich support for JSON" comes with a performance benefit (that you don't plan to make available to TCP inputs).

  1. Security....

Yes.

However, in the use cases I expect to see - I didn't mention this in my question - I suspect (although I don't know for sure) that all of this traffic will occur behind a firewall on an intranet or over a VPN.


I look forward to hearing more from you, especially regarding performance.

Not wishing to put words in your mouth, but framing your answer in the context of my question, I think what you're telling me is:

There's a bunch of reasons [why you would use HEC when you can use TCP]: ... Performance

That is, in a nutshell: HEC offers better performance than using TCP inputs. I'd like to hear more about that.

And if that's true, then perhaps the Splunk docs recommendation I cited in my question needs revisiting (or at least, qualifying):

TCP ... is the recommended protocol for sending data from any remote host to your Splunk Enterprise server

0 Karma

gblock_splunk
Splunk Employee
Splunk Employee

HI Graham

Thanks for the detailed reply. At the end of the day I'd say our choice of HTTP was based on multiple factors, not just one. Several of those factors are not compatible with the design of our TCP input, though as you've shown there is some overlap.

  • Using a TCP Input from the browser. I did not know that would work so kudos there! HTTP is layered on TCP so that makes sense, but it won't support protocol specific things like for example CORS, keep-alive, gzip encoding or honoring the auth header. Not supporting the auth header is a big one, as a big part of HEC is the security model. It is just using TCP as a transport for an HTTP message. It also won't support the specifics of our event protocol like source type etc. Sure you can use props.conf and start doing all sorts of custom extractions, but HTTP servers are built for this, and it will not be as efficient / using props will be brittle.

  • Stateful vs Stateless - TCP is a stateful protocol. You are establishing a persistent connection over a port which you communicate with. HTTP abstracts this away in its layering and supports optimizations like keep-alive, hence why HTTP is stateless and TCP is not. In your example it is closing the TCP connection after each request which can be expensive.

  • Load balancers - Yes you can load balance TCP but it is expensive. For example NGINX does have TCP load balancing but only if you pay for the premium product (NGINX Plus). There are a boat load of free (including NGINX) HTTP load balancing options out there. Also when it comes to cloud providers, not all support load balancing for TCP, but they all support HTTP.

0 Karma

hchinta
Explorer

@gblock_splunk Can you please provide the optimizations done on HEC server to receive 100K per server. We are running into issues the httpinput queue not receiving fast enough and splunk not closing the tcp connections leading to active connection staying on.

0 Karma

Graham_Hanningt
Builder

@gblock,

Re:

Using a TCP Input from the browser.

Yes, for all the reasons you cite, that's pretty much just a party trick; certainly, not something I'd want to implement in a production environment. (That party trick works from curl, too.)

The unwanted event, containing the lines I'd told LINE_BREAKER to discard, irks me.

This morning, I used a TCP client to send an HTTP request - and also text that "looks" like an HTTP request, with \r\n-delimited preamble lines - to the Splunk TCP input. I still get that unwanted event.

I've just asked about this in a separate question, "Why do the contents of the first capturing group in this LINE_BREAKER regex appear as a separate eve...".

0 Karma

Graham_Hanningt
Builder

Hi @gblock,

Re:

you can load balance TCP but it is expensive

Any reason not to use HAProxy for TCP load balancing of Splunk TCP inputs?

From the HAProxy website:

HAProxy is a free, very fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications. It is particularly suited for very high traffic web sites and powers quite a number of the world's most visited ones. Over the years it has become the de-facto standard opensource load balancer, is now shipped with most mainstream Linux distributions, and is often deployed by default in cloud platforms. Since it does not advertise itself, we only know it's used when the admins report it 🙂

0 Karma

gblock_splunk
Splunk Employee
Splunk Employee

I hear you. As I mentioned, HEC is the experience we've invested in to have a really simple and secure mechanism for sending from a multitude of clients and in a high scale manner.

If TCP works for you, by all means use it 🙂

0 Karma

woodcock
Esteemed Legend

It is still true; if you can use either HEC or TCP, then I would almost certainly pick TCP. However, there are many situations (embedded systems) where TCP is not possible, but HEC is. Also, I suspect that the likelihood of dropping data (useAck=false for S2S), or duplicating data (useAck=true for S2S) are both much lower when using HEC, but I have no evidence for this.

0 Karma

woodcock
Esteemed Legend

Picture hacking your thermostat which has an http-based phone-home capability. You clone that one line and use HEC. In some environments, you are the hacker, not the developer!

0 Karma

daveyc
New Member

In this case it's a mainframe logging to a distributed server using enterprise networking kit.

0 Karma

daveyc
New Member

I'm confused! How can HTTP be possible when TCP isn't? IIRC, HTTP is an application protocol that uses TCP for transport. Am I missing something?

0 Karma

Graham_Hanningt
Builder

Thanks! I'll wait until Monday to see if anyone pitches in with a different, and more compelling, answer (I would be surprised), but if not, I'll accept your answer. Thanks again, and have a good weekend.

0 Karma

cgardiner
Explorer

Just wanted to add this one for future readers.

Another important advantage of HEC over TCP is error handling.
Specifically, if you send data to a TCP endpoint, there is no interaction. No response from the TCP endpoint to let you know data has been received and processed. If there are load issues on the server or Queues are filled up, there is a chance that data will get lost. Data may get dropped and the sending process will not have any idea there was an issue.

With HEC, you get an HTTP response such as a 400 or 500 error indicating problems. While most of the possible errors are specific to HEC, at least 2 would be an advantage over TCP. 
(Server is busy and Internal Server Error)

https://docs.splunk.com/Documentation/Splunk/9.1.1/Data/TroubleshootHTTPEventCollector#Possible_erro...

Receiving these codes, a sender would know there is a problem.. And could attempt to resent the data again later. 
You can also configure your "use Ack" which will allow the sender to check and confirm that data has been received and indexed before purging those events from the system. 

0 Karma

sohrab_keramat
New Member

Hi @Graham_Hanningt  Thank you for youre question

In our organization, we have restrictions on installing the agent on clients, on the other hand, we use two products, Splunk and ELK, so in this particular scenario, we need an event aggregation unit and then redirecting to each of the infrastructures.
SIEM is mentioned

Tags (2)
0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...