Getting Data In

Splunk HTTP Event Collector's: What is the definition of size limit?

King_Of_Shawn
Explorer

Hi guys, Happy New Year,

i do some code testing with the Splunk HEC, now i need to transfer some large volum data with gzip compressed.

1. first i find one limit in $SPLUNK_HOME$/etc/system/default/limits.conf

 

 

[http_input]
max_content_length = <integer>
* The maximum length, in bytes, of HTTP request content that is
  accepted by the HTTP Event Collector server.
* Default: 838860800 (~ 800 MB)

 

 

but i It is found that this value seems to calculate the size after decompression,

because i have one test file about 50MiB, it's far less than 800MB, but when i sending request,

Splunk raise the:

 

 

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"><title>413 Content-Length of 838889996 too large (maximum is 838860800)</title></head><body><h1>Content-Length of 838889996 too large (maximum is 838860800)</h1><p>The request your client sent was too large.</p></body></html>

 

 

2. the 2nd limit i find at $SPLUNK_HOME$/etc/apps/splunk_httpinput/local/inputs.conf

 

 

[http]
maxEventSize = <positive integer>[KB|MB|GB]
* The maximum size of a single HEC (HTTP Event Collector) event.
* HEC disregards and triggers a parsing error for events whose size is
  greater than 'maxEventSize'.
* Default: 5MB

 

 

i think this limit is set at only one event size? if i send batch events in one request by "/services/collector", so this limit will apply to every event in the batch events, right?

Are there any relevant experts to help confirm this behavior? if need more details feel free to let me know, Many thanks!

Labels (4)
Tags (2)
0 Karma
1 Solution

PickleRick
SplunkTrust
SplunkTrust

My HTTP might be a bit rusty so I'm not gonna argue about particular headers but it's the actual raw payload that matters. After all, Splunk has to process uncompressed data. Look at it as different encapsulation levels. The rest application endpoint receives uncompressed data. It's the "outer layer" of HTTP server which handles the transport along with proper encoding.

View solution in original post

0 Karma

jamie00171
Communicator

Hi @King_Of_Shawn,

To confirm, are you setting the HTTP "Content-Encoding" header to be "gzip"?

Any other details you can provide about how you're sending the data would be useful. 

Thanks,

Jamie

0 Karma

King_Of_Shawn
Explorer

Hi @jamie00171 

thanks for reply, i will show you the demo:

import requests
import gzip
import ndjson
import functools
from io import BytesIO, SEEK_SET, SEEK_END

# About 50MB after gzip compression, and about 900MB after decompression
url = "https://{{fqdn}}/packages/{{uuid}}"

payload = {}
headers = {
    'Authorization': 'Bearer {{token}}',
    "Accept-Encoding": "gzip",
    'Connection': 'keep-alive'
}


def stream_transfer():
    with requests.request("GET", url, headers=headers, data=payload, stream=True) as r:
        print(r.headers)
        r1 = requests.post("http://10.64.21.32:8088/services/collector/raw", headers={
            "Authorization": "Splunk f98f1101-e880-49e9-88ac-d08e8ce0c1e5",
            "Content-Encoding": "gzip",
            "Transfer-Encoding": "chunked",
            "Connection": "keep-alive",
            # "Content-Length": r.headers["Content-Length"]
        }, data=r.raw.stream(1024*1024, decode_content=False))


if __name__ == "__main__":
    stream_transfer()

and i also show you the headers in response, the content length is in it,

{'Date': 'Tue, 24 Jan 2023 01:46:16 GMT', 'Content-Type': 'text/plain; charset=UTF-8', 'Content-Length': '49039690', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains'}

then it will raise this like i said above:

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"><title>413 Content-Length of 838889996 too large (maximum is 838860800)</title></head><body><h1>Content-Length of 838889996 too large (maximum is 838860800)</h1><p>The request your client sent was too large.</p></body></html>

 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Ok, first things first - what are you trying to _do_? Not what you're trying to achieve (which is apparently to get your blob of data into Splunk) but what you're actually doing.

Oh, and what is this gzipped file that you have? Is it a gzipped flat text file?

0 Karma

King_Of_Shawn
Explorer

Hi @PickleRick 

thank you, please refer my reply above,

in fact, the gzip not one local file, that's one remote resource from another request's response,

and i'm ensure the gzip stream is good, the data is line json after decompression, looks like:

{...}
{...}
{...}
......

so i think HEC endpoint "/services/collector/raw" should be able to parse it directly and correctly.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

OK. Two things.

1. Don't do that this way. If you send one huge blob of data you have no control whatsoever if everything went well and if anything went wrong you have to re-send all data. As you can see - every attempt pushes your whole batch of events back and forth. And you're hitting resource limits.

2. Transfer encoding is a mechanism used to encode data in transit. It doesn't influence the request size. If you're sending 900MB worth of data, it's still 900MB of data regardless of whether it's encoded with gzip, deflate or sent as is.

0 Karma

King_Of_Shawn
Explorer

I agree about your first 1, but for 2, I want to confirm:

"Transfer encoding is a mechanism used to encode data in transit", I think you mean that for the CherryPy of Splunk, if I transmit a piece of compressed data, what it sees must be the size of the decompressed data, right?

because for me (the client of HEC), the size of my request should be 49039690 bytes, and according to the relevant regulations of RFC, if gzip is used, Content-Length should be the payload length encoded by gzip,  about 46.77 MiB, not the limit of 800MB in Splunk.

So Splunk limits the actual data size (after decompression)? Instead of the requested Content-Length?

0 Karma

PickleRick
SplunkTrust
SplunkTrust

My HTTP might be a bit rusty so I'm not gonna argue about particular headers but it's the actual raw payload that matters. After all, Splunk has to process uncompressed data. Look at it as different encapsulation levels. The rest application endpoint receives uncompressed data. It's the "outer layer" of HTTP server which handles the transport along with proper encoding.

0 Karma
Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...