Hi guys, Happy New Year,
i do some code testing with the Splunk HEC, now i need to transfer some large volum data with gzip compressed.
1. first i find one limit in $SPLUNK_HOME$/etc/system/default/limits.conf
[http_input]
max_content_length = <integer>
* The maximum length, in bytes, of HTTP request content that is
accepted by the HTTP Event Collector server.
* Default: 838860800 (~ 800 MB)
but i It is found that this value seems to calculate the size after decompression,
because i have one test file about 50MiB, it's far less than 800MB, but when i sending request,
Splunk raise the:
<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"><title>413 Content-Length of 838889996 too large (maximum is 838860800)</title></head><body><h1>Content-Length of 838889996 too large (maximum is 838860800)</h1><p>The request your client sent was too large.</p></body></html>
2. the 2nd limit i find at $SPLUNK_HOME$/etc/apps/splunk_httpinput/local/inputs.conf
[http]
maxEventSize = <positive integer>[KB|MB|GB]
* The maximum size of a single HEC (HTTP Event Collector) event.
* HEC disregards and triggers a parsing error for events whose size is
greater than 'maxEventSize'.
* Default: 5MB
i think this limit is set at only one event size? if i send batch events in one request by "/services/collector", so this limit will apply to every event in the batch events, right?
Are there any relevant experts to help confirm this behavior? if need more details feel free to let me know, Many thanks!
My HTTP might be a bit rusty so I'm not gonna argue about particular headers but it's the actual raw payload that matters. After all, Splunk has to process uncompressed data. Look at it as different encapsulation levels. The rest application endpoint receives uncompressed data. It's the "outer layer" of HTTP server which handles the transport along with proper encoding.
Hi @King_Of_Shawn,
To confirm, are you setting the HTTP "Content-Encoding" header to be "gzip"?
Any other details you can provide about how you're sending the data would be useful.
Thanks,
Jamie
Hi @jamie00171
thanks for reply, i will show you the demo:
import requests
import gzip
import ndjson
import functools
from io import BytesIO, SEEK_SET, SEEK_END
# About 50MB after gzip compression, and about 900MB after decompression
url = "https://{{fqdn}}/packages/{{uuid}}"
payload = {}
headers = {
'Authorization': 'Bearer {{token}}',
"Accept-Encoding": "gzip",
'Connection': 'keep-alive'
}
def stream_transfer():
with requests.request("GET", url, headers=headers, data=payload, stream=True) as r:
print(r.headers)
r1 = requests.post("http://10.64.21.32:8088/services/collector/raw", headers={
"Authorization": "Splunk f98f1101-e880-49e9-88ac-d08e8ce0c1e5",
"Content-Encoding": "gzip",
"Transfer-Encoding": "chunked",
"Connection": "keep-alive",
# "Content-Length": r.headers["Content-Length"]
}, data=r.raw.stream(1024*1024, decode_content=False))
if __name__ == "__main__":
stream_transfer()
and i also show you the headers in response, the content length is in it,
{'Date': 'Tue, 24 Jan 2023 01:46:16 GMT', 'Content-Type': 'text/plain; charset=UTF-8', 'Content-Length': '49039690', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains'}
then it will raise this like i said above:
<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"><title>413 Content-Length of 838889996 too large (maximum is 838860800)</title></head><body><h1>Content-Length of 838889996 too large (maximum is 838860800)</h1><p>The request your client sent was too large.</p></body></html>
Ok, first things first - what are you trying to _do_? Not what you're trying to achieve (which is apparently to get your blob of data into Splunk) but what you're actually doing.
Oh, and what is this gzipped file that you have? Is it a gzipped flat text file?
Hi @PickleRick
thank you, please refer my reply above,
in fact, the gzip not one local file, that's one remote resource from another request's response,
and i'm ensure the gzip stream is good, the data is line json after decompression, looks like:
{...}
{...}
{...}
......
so i think HEC endpoint "/services/collector/raw" should be able to parse it directly and correctly.
OK. Two things.
1. Don't do that this way. If you send one huge blob of data you have no control whatsoever if everything went well and if anything went wrong you have to re-send all data. As you can see - every attempt pushes your whole batch of events back and forth. And you're hitting resource limits.
2. Transfer encoding is a mechanism used to encode data in transit. It doesn't influence the request size. If you're sending 900MB worth of data, it's still 900MB of data regardless of whether it's encoded with gzip, deflate or sent as is.
I agree about your first 1, but for 2, I want to confirm:
"Transfer encoding is a mechanism used to encode data in transit", I think you mean that for the CherryPy of Splunk, if I transmit a piece of compressed data, what it sees must be the size of the decompressed data, right?
because for me (the client of HEC), the size of my request should be 49039690 bytes, and according to the relevant regulations of RFC, if gzip is used, Content-Length should be the payload length encoded by gzip, about 46.77 MiB, not the limit of 800MB in Splunk.
So Splunk limits the actual data size (after decompression)? Instead of the requested Content-Length?
My HTTP might be a bit rusty so I'm not gonna argue about particular headers but it's the actual raw payload that matters. After all, Splunk has to process uncompressed data. Look at it as different encapsulation levels. The rest application endpoint receives uncompressed data. It's the "outer layer" of HTTP server which handles the transport along with proper encoding.