I ask this question after reading the following Splunk Dev articles, among others:
Some background:
HttpURLConnection
, or even just cURL - but not necessarily the most performant)Redis also occurs to me as a possibility, although, to my knowledge, it is not an ingestion method directly supported by Splunk; I mention it because I use Redis to forward logs to a different analytics software (not Splunk).
A related or sub- question: setting aside the method of data transport, what event data format does Splunk most efficiently ingest?
For example, assuming that each event consists of a flat list of fields, with no nested structures (no repeating groups),
should I use:
In particular, if I use EC, then, even though the body of the HTTP POST request to EC is JSON, what is a better choice for the value of the event
property:
{ event:{"key1":"string_value1","key2":numeric_value2} }
or
{ event:"key1=\"string_value1\", key2=numeric_value2" }
If the data does contain nested structures, is JSON the most efficient format for ingestion?
Unless you have unlimited budget, you are going to have to do something that does, at longest, daily batching because if you push several days' worth of data through your system in just a few hours, you are going to have to buy an needlessly huge amount of license to handle that "false" bandwidth. Given this, I would batch hourly files through SFTP and use batch
with move_policy = sinkhole
so that you do not need to keep up with housekeeping of processed files.
The same consideration should be given to data format. I greatly prefer CSV to JSON because it is considerably more compact and impacts license less. This decision also impacts disk space. Somebody has got to think about the bandwidth and the budget.
@Graham_Hannington HTTP Event Collector is highly performant, but it depends on what you mean by that. We've designed to support hundreds of thousands of events to millions of events per second (distributed).
We support JSON and batching in a first class manner.
I'm curious, what kind of throughput are you expecting?
In terms of the format of the event field, it is really up to you. However if you have a nested structure, I'd recommend JSON, as at search time you can easily access the fields in a nested manner.
Unless you have unlimited budget, you are going to have to do something that does, at longest, daily batching because if you push several days' worth of data through your system in just a few hours, you are going to have to buy an needlessly huge amount of license to handle that "false" bandwidth. Given this, I would batch hourly files through SFTP and use batch
with move_policy = sinkhole
so that you do not need to keep up with housekeeping of processed files.
The same consideration should be given to data format. I greatly prefer CSV to JSON because it is considerably more compact and impacts license less. This decision also impacts disk space. Somebody has got to think about the bandwidth and the budget.
Thanks very much for your answer.
Some comments (tl;dr? shrug):
Re:
at longest, daily batching
Yes, I've read several similar warnings about inadvertently exceeding the license limit by ingesting "historical" data.
Re:
I would batch hourly files through SFTP and use
batch
withmove_policy = sinkhole
so that you do not need to keep up with housekeeping of processed files.
This seems like a good answer to me.
In a different context (not Splunk), I already do much the same: I use SSH-based data transfer (SSH-based, but not, specifically, SFTP) to transfer CSV data. I was curious to know what Splunk users saw as the best method.
However, before I accept your answer (thanks again), I'm going to wait a few more days, to see if anyone else offers a different answer, perhaps with counterarguments.
Re:
I greatly prefer CSV to JSON because it is considerably more compact and impacts license less.
I use CSV in other contexts - where license usage is not an issue - because of its compactness: less to transfer, less to store on disk.
I deliberately avoided mentioning CSV in my question; I was curious to see if someone mentioned it. The Splunk product documentation
and Splunk Dev articles that I have read do not promote CSV as best practice. For example, the Splunk Dev article "Logging best practices" recommends "Use clear key-value pairs" and "Use developer-friendly formats ... like JavaScript Object Notation (JSON)".
However, as you point out:
CSV ... impacts license less
Related question on Splunk Answers: "Key/value pair vs CSV in relation to daily license".
So (I think you already know all of this; the following is for others who might read this later, or for someone to correct me, if I'm wrong): If you use JSON, or another "key-value" data format where every field value is explicitly tagged by its field name (okay: setting aside elements in a JSON array), you're paying for all those key names. If your key names are "verbose"/human-readable, you might be paying more for the key names than the values!
The only situation where this won't matter is with an "all you can ingest" license ("Contact Sales" 😉 ).
According to RFC 4180:
there is no formal specification in existence [for CSV]
The RFC goes on to document CSV for its own purposes:
Each line should contain the same number of fields throughout the file
This rules out a straightforward mapping of data from some of my original event records (as described in my question) to CSV.
One workaround I didn't mention in my question: I could split out items in nested structures into their own "flat" events, and create an ID field to correlate those coined/sub- events with their original/master events.
However, in practice, in my experience, despite the mantra to "index everything" (ka-ching! 😉 ), use cases for preserving those nested structures in Splunk are rare. Which is one reason I'm prepared to accept a "CSV" answer.
While we're talking about CSV...
CSV - at least, CSV as documented in RCF 4180 - offers no indication of data type:
Each field may or may not be enclosed in double quotes
whereas, in JSON, enclosing double quotes are significant: their presence indicates a string value, their absence a numeric value. I'm mildly curious to know whether this - implicit typing in the raw data - makes any significant difference to Splunk ingestion processing.
Still, JSON doesn't distinguish between a string that contains a program name and a string that contains an ISO 8601-format date and time value, so you still need a schema (or some method of mapping keys to data types) for those situations. However, in many situations, distinguishing between strings and numbers is enough.
CSV relies on fixed field positions. This makes it more concise than per-record key-value data formats such as JSON, but it can be a pain to diagnose when fields are added, deleted, or moved.
Even for strictly fixed, flat record structures, CSV requires one file per record type, whereas a JSON file (or a JSON Lines file; or a file containing a sequence of JSON objects) can contain any combination of different record types.
Hi Graham,
I'll leave the technical nuances to those who know better, although you might want to specify what measure of efficiency are you after? eg CPU, mem, IO, time etc? From experience, your disk IO is going to be your limiting factor.
Depending on what your constraints are a multi instance Splunk cluster will merrily index tens of thousands of events in realtime (some people out there are ingesting tens of TB a day) , so a few thousand as a batch input in a day isn't going to trouble it. I would guess you'll be running out of license before your run out of indexing capacity.
In general, I'm interested in maximizing throughput: the rate at Splunk ingests data. That is, given available resources (such as CPU, memory, disk space, network bandwidth), what is the best ingestion method (including data transport and data format) for ensuring maximum throughput? Which, I realize, raises questions about the details of those "available resources", because the answer might well depend on which of those resources (or combination of resources) is a bottleneck (constraint).
I've accepted the batch
/CSV-based answer for various reasons, including the points that you make (thanks!):
your disk IO is going to be your limiting factor
and
you'll be running out of license before your run out of indexing capacity
I would welcome rebuttals, counterarguments, and alternative answers.