Getting Data In

Best method for batched (periodic or en masse ad hoc, not real time) ingestion?

What I've read

I ask this question after reading the following Splunk Dev articles, among others:

Why I'm asking this question

Some background:

  • One Splunk usage scenario I need to consider involves loading many thousands of events into Splunk periodically (such as daily) or ad hoc, rather one event or a few events as they occur, in real time.
  • The original event records are in hundreds of different proprietary binary formats. However, for the purpose of this question, I am dealing with the event records after they have been converted into a text format of my choice, such as JSON.
  • Some of the events contain hundreds of fields (or, if you prefer, properties; key-value pairs), resulting in JSON of a few kilobytes per event. 5 KB per event is common; many are much smaller (only a few fields), some are larger (many fields; long field values).
  • Some of the original binary event records contain nested structures such as repeating groups. When converting these binary records, and depending on whether it's important to preserve the granularity of the original data for analysis in Splunk, I can either flatten such structures by aggregating multiple values into a single average or total, or - when converting to formats such as JSON that inherently support nested structures - preserve them.
  • I don't mean to be coy about the platform on which these event records exist, but I'd like an answer that is independent of what that platform is, with the following considerations: it's a UNIX environment that conforms to POSIX standard 1003.2, and it's not one of the operating systems for which Splunk offers a Universal Forwarder.

Some potential answers

  • Splunk HTTP Event Collector (EC; works a treat - using Java HttpURLConnection, or even just cURL - but not necessarily the most performant)
  • TCP or UDP (Splunk Dev: "In terms of performance, UDP/TCP transfer ranks as the highest performing method")
  • Monitor files and directories (for example, FTP from the originating remote system to a file system available to Splunk, then use the batch input type)

Redis also occurs to me as a possibility, although, to my knowledge, it is not an ingestion method directly supported by Splunk; I mention it because I use Redis to forward logs to a different analytics software (not Splunk).

Data format?

A related or sub- question: setting aside the method of data transport, what event data format does Splunk most efficiently ingest?

For example, assuming that each event consists of a flat list of fields, with no nested structures (no repeating groups),
should I use:

  • JSON
  • "Non-JSON" (syslog-like) key-value pairs (key1=value1, key2=value2, key3=value3 ...)

In particular, if I use EC, then, even though the body of the HTTP POST request to EC is JSON, what is a better choice for the value of the event property:

{ event:{"key1":"string_value1","key2":numeric_value2} }

or

{ event:"key1=\"string_value1\", key2=numeric_value2" }

If the data does contain nested structures, is JSON the most efficient format for ingestion?

0 Karma
1 Solution

Esteemed Legend

Unless you have unlimited budget, you are going to have to do something that does, at longest, daily batching because if you push several days' worth of data through your system in just a few hours, you are going to have to buy an needlessly huge amount of license to handle that "false" bandwidth. Given this, I would batch hourly files through SFTP and use batch with move_policy = sinkhole so that you do not need to keep up with housekeeping of processed files.

The same consideration should be given to data format. I greatly prefer CSV to JSON because it is considerably more compact and impacts license less. This decision also impacts disk space. Somebody has got to think about the bandwidth and the budget.

View solution in original post

0 Karma

Splunk Employee
Splunk Employee

@Graham_Hannington HTTP Event Collector is highly performant, but it depends on what you mean by that. We've designed to support hundreds of thousands of events to millions of events per second (distributed).

We support JSON and batching in a first class manner.

I'm curious, what kind of throughput are you expecting?

In terms of the format of the event field, it is really up to you. However if you have a nested structure, I'd recommend JSON, as at search time you can easily access the fields in a nested manner.

0 Karma

Esteemed Legend

Unless you have unlimited budget, you are going to have to do something that does, at longest, daily batching because if you push several days' worth of data through your system in just a few hours, you are going to have to buy an needlessly huge amount of license to handle that "false" bandwidth. Given this, I would batch hourly files through SFTP and use batch with move_policy = sinkhole so that you do not need to keep up with housekeeping of processed files.

The same consideration should be given to data format. I greatly prefer CSV to JSON because it is considerably more compact and impacts license less. This decision also impacts disk space. Somebody has got to think about the bandwidth and the budget.

View solution in original post

0 Karma

Thanks very much for your answer.

Some comments (tl;dr? shrug):

Batch transfer

Re:

at longest, daily batching

Yes, I've read several similar warnings about inadvertently exceeding the license limit by ingesting "historical" data.

Re:

I would batch hourly files through SFTP and use batch with move_policy = sinkhole so that you do not need to keep up with housekeeping of processed files.

This seems like a good answer to me.

In a different context (not Splunk), I already do much the same: I use SSH-based data transfer (SSH-based, but not, specifically, SFTP) to transfer CSV data. I was curious to know what Splunk users saw as the best method.

However, before I accept your answer (thanks again), I'm going to wait a few more days, to see if anyone else offers a different answer, perhaps with counterarguments.

"CSV impacts license less"

Re:

I greatly prefer CSV to JSON because it is considerably more compact and impacts license less.

I use CSV in other contexts - where license usage is not an issue - because of its compactness: less to transfer, less to store on disk.

I deliberately avoided mentioning CSV in my question; I was curious to see if someone mentioned it. The Splunk product documentation
and Splunk Dev articles that I have read do not promote CSV as best practice. For example, the Splunk Dev article "Logging best practices" recommends "Use clear key-value pairs" and "Use developer-friendly formats ... like JavaScript Object Notation (JSON)".

However, as you point out:

CSV ... impacts license less

Related question on Splunk Answers: "Key/value pair vs CSV in relation to daily license".

So (I think you already know all of this; the following is for others who might read this later, or for someone to correct me, if I'm wrong): If you use JSON, or another "key-value" data format where every field value is explicitly tagged by its field name (okay: setting aside elements in a JSON array), you're paying for all those key names. If your key names are "verbose"/human-readable, you might be paying more for the key names than the values!

The only situation where this won't matter is with an "all you can ingest" license ("Contact Sales" 😉 ).

But... CSV doesn't handle variable nested structures

According to RFC 4180:

there is no formal specification in existence [for CSV]

The RFC goes on to document CSV for its own purposes:

Each line should contain the same number of fields throughout the file

This rules out a straightforward mapping of data from some of my original event records (as described in my question) to CSV.

One workaround I didn't mention in my question: I could split out items in nested structures into their own "flat" events, and create an ID field to correlate those coined/sub- events with their original/master events.

However, in practice, in my experience, despite the mantra to "index everything" (ka-ching! 😉 ), use cases for preserving those nested structures in Splunk are rare. Which is one reason I'm prepared to accept a "CSV" answer.

Other issues with CSV

While we're talking about CSV...

CSV - at least, CSV as documented in RCF 4180 - offers no indication of data type:

Each field may or may not be enclosed in double quotes

whereas, in JSON, enclosing double quotes are significant: their presence indicates a string value, their absence a numeric value. I'm mildly curious to know whether this - implicit typing in the raw data - makes any significant difference to Splunk ingestion processing.

Still, JSON doesn't distinguish between a string that contains a program name and a string that contains an ISO 8601-format date and time value, so you still need a schema (or some method of mapping keys to data types) for those situations. However, in many situations, distinguishing between strings and numbers is enough.

CSV relies on fixed field positions. This makes it more concise than per-record key-value data formats such as JSON, but it can be a pain to diagnose when fields are added, deleted, or moved.

Even for strictly fixed, flat record structures, CSV requires one file per record type, whereas a JSON file (or a JSON Lines file; or a file containing a sequence of JSON objects) can contain any combination of different record types.

0 Karma

Influencer

Hi Graham,

I'll leave the technical nuances to those who know better, although you might want to specify what measure of efficiency are you after? eg CPU, mem, IO, time etc? From experience, your disk IO is going to be your limiting factor.

Depending on what your constraints are a multi instance Splunk cluster will merrily index tens of thousands of events in realtime (some people out there are ingesting tens of TB a day) , so a few thousand as a batch input in a day isn't going to trouble it. I would guess you'll be running out of license before your run out of indexing capacity.

0 Karma

In general, I'm interested in maximizing throughput: the rate at Splunk ingests data. That is, given available resources (such as CPU, memory, disk space, network bandwidth), what is the best ingestion method (including data transport and data format) for ensuring maximum throughput? Which, I realize, raises questions about the details of those "available resources", because the answer might well depend on which of those resources (or combination of resources) is a bottleneck (constraint).

I've accepted the batch/CSV-based answer for various reasons, including the points that you make (thanks!):

your disk IO is going to be your limiting factor

and

you'll be running out of license before your run out of indexing capacity

I would welcome rebuttals, counterarguments, and alternative answers.

0 Karma