- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Appropriate ingest structure for large lists and for analysis
Hi,
I've got a large list which is grouped in chronological order and I'd like to ingest it into Splunk.
The list structure is 'flexible' as we haven't defined one yet (and we're open to suggestions).
Data is:
- from multiple sources/hosts
- produced on a regular basis (anywhere between weekly to hourly)
A sample of the data:
Timestamp: 2019-01-01T01:01:01Z
12.345
30.314
52.143
.
(50k-100k values here)
.
34914.134
We have some use cases:
- The values will be summarised by the 'integer value|' eg: rex (?\d+) | stats count by int)
- The values will be compared/charted/etc over time (eg 1: timechart by int eg 2: what values appear in time period 1 but not time period 2, vice versa)
- The values will be compared/charted/etc between hosts (eg: what values are in host 'a' but not on host 'b', vice versa)
However, I've been coming across a few issues:
- I tried json/multivalues as one event but a. there's a limit on mvexpand (with 50k+ values, MV is not very efficient). b. the event is extremely long (had to put in TRUNCATE: 5000000 for the source type)
- I tried json/multivalues as multiple events but the limit on mvexpand still is an issue
- Lookup tables are not an option due to maintenance required, however, I thought about using exportcsv/importcsv dynamically but feel that it is over engineering.
So I'm open for a solution/discussion on:
- Is there a good/appropriate data structure for this data?
- If JSON is fine, is there a way to avoid using mvexpand?
Thanks in advance
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

You should consider to things here:
1) Splunk natively will work with events. So the best performance can be achieved, if one data point is one event.
2) Splunk now supports metrics, which might be exactly what you want to have here...
My recommendation would be either to add a timestamp to each log line, and ingest it line-by-line (which would be Splunk default), or to leverage the automatic timestamp recognition and the fact that splunk will revert to the last recognized timestamp when indexing events.
Thereafter you should consider converting your events to metric data, see here for reference:
https://docs.splunk.com/Documentation/Splunk/latest/Metrics/L2MOverview
If you have the chance of sending the metric data directly to Splunk by other means (collectd, HEC) you should try to do so, as is will significantly boost your search performance for the mentioned use cases.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@DMohn, unfortunately, we can't use metrics as it's not a metric (it just looks like one). What I displayed is the summarised data, the actual data point is:
13,45.45623,144 is summarised to 13.45623.
We still debating the need for precision. At present we've decided to go for one data point/event.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

is that literally the event? If so, i would probably prepend each int with the timestamp and ingest them separately
2019-01-01T01:01:01Z 12.345
2019-01-01T01:01:01Z 30.314
2019-01-01T01:01:01Z 52.143
imo trying to do what you want with everything tossed into one event is going to be annoying and inefficient. Having them in each in their own event should allow you to do everything you want with less of a headache. I mean, that is basically what mvexpand is going to do anyway, right? So just do it up front...
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@maciep we are moving that way, one data point/event but I just want to know if there is any other structure that I didn't think about. Agreed, having all 50k data points in one event is inefficient and inflexible.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

if you expect low cardinality maybe you could aggregate some of the data first and save in a json, e.g. like an array of "int: count" objects or something...but i think that would still make your life harder once it's in Splunk.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@maciep we explored aggregation but by doing so we lose the granularity required from the data points.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@splunked38
I think we can change limit on mvexpand. The 2nd approach seems good.
you can adjust the limit by editing the max_mem_usage_mb
setting in the limits.conf
file to increase the limit of mvexpand.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@vishaltaneja07011993 , we'd rather not increase the limit as the number will grow from 50k, this will become a maintenance task
