Appropriate ingest structure for large lists and f...

splunked38 · ‎02-05-2019

Hi,

I've got a large list which is grouped in chronological order and I'd like to ingest it into Splunk.

The list structure is 'flexible' as we haven't defined one yet (and we're open to suggestions).
Data is:

from multiple sources/hosts
produced on a regular basis (anywhere between weekly to hourly)

A sample of the data:

Timestamp: 2019-01-01T01:01:01Z
12.345
30.314
52.143
.
(50k-100k values here)
.
34914.134

We have some use cases:

The values will be summarised by the 'integer value|' eg: rex (?\d+) | stats count by int)
The values will be compared/charted/etc over time (eg 1: timechart by int eg 2: what values appear in time period 1 but not time period 2, vice versa)
The values will be compared/charted/etc between hosts (eg: what values are in host 'a' but not on host 'b', vice versa)

However, I've been coming across a few issues:

I tried json/multivalues as one event but a. there's a limit on mvexpand (with 50k+ values, MV is not very efficient). b. the event is extremely long (had to put in TRUNCATE: 5000000 for the source type)
I tried json/multivalues as multiple events but the limit on mvexpand still is an issue
Lookup tables are not an option due to maintenance required, however, I thought about using exportcsv/importcsv dynamically but feel that it is over engineering.

So I'm open for a solution/discussion on:

Is there a good/appropriate data structure for this data?
If JSON is fine, is there a way to avoid using mvexpand?

Thanks in advance

DMohn · ‎02-12-2019

You should consider to things here:

1) Splunk natively will work with events. So the best performance can be achieved, if one data point is one event.
2) Splunk now supports metrics, which might be exactly what you want to have here...

My recommendation would be either to add a timestamp to each log line, and ingest it line-by-line (which would be Splunk default), or to leverage the automatic timestamp recognition and the fact that splunk will revert to the last recognized timestamp when indexing events.

Thereafter you should consider converting your events to metric data, see here for reference:
https://docs.splunk.com/Documentation/Splunk/latest/Metrics/L2MOverview

If you have the chance of sending the metric data directly to Splunk by other means (collectd, HEC) you should try to do so, as is will significantly boost your search performance for the mentioned use cases.

splunked38 · ‎02-20-2019

@DMohn, unfortunately, we can't use metrics as it's not a metric (it just looks like one). What I displayed is the summarised data, the actual data point is:
13,45.45623,144 is summarised to 13.45623.

We still debating the need for precision. At present we've decided to go for one data point/event.

maciep · ‎02-06-2019

is that literally the event? If so, i would probably prepend each int with the timestamp and ingest them separately

2019-01-01T01:01:01Z 12.345
2019-01-01T01:01:01Z 30.314
2019-01-01T01:01:01Z 52.143

imo trying to do what you want with everything tossed into one event is going to be annoying and inefficient. Having them in each in their own event should allow you to do everything you want with less of a headache. I mean, that is basically what mvexpand is going to do anyway, right? So just do it up front...

splunked38 · ‎02-07-2019

@maciep we are moving that way, one data point/event but I just want to know if there is any other structure that I didn't think about. Agreed, having all 50k data points in one event is inefficient and inflexible.

maciep · ‎02-07-2019

if you expect low cardinality maybe you could aggregate some of the data first and save in a json, e.g. like an array of "int: count" objects or something...but i think that would still make your life harder once it's in Splunk.

splunked38 · ‎02-11-2019

@maciep we explored aggregation but by doing so we lose the granularity required from the data points.

vishaltaneja070 · ‎02-05-2019

@splunked38

I think we can change limit on mvexpand. The 2nd approach seems good.

you can adjust the limit by editing the max_mem_usage_mb setting in the limits.conf file to increase the limit of mvexpand.

splunked38 · ‎02-07-2019

@vishaltaneja07011993 , we'd rather not increase the limit as the number will grow from 50k, this will become a maintenance task

Appropriate ingest structure for large lists and for analysis

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

Are you a member of the Splunk Community?

Appropriate ingest structure for large lists and for analysis

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?