Dedup Command Using All Available Memory

Dennis · ‎10-25-2023

Hello,

Didn't get any hits on this issue so starting a new thread, and didn't find any previous defect reported on this possible issue.

We are running SE 9.5.0. There is a need to get an accurate number of the exact number of duplicate events indexing to Splunk.....ended up having to use the transaction command.

To do this, ran a search for a 24-hour period that ran in about 16 minutes to get the total number of events, and data set size.

Then ran the same search with the dedup command to reduce out all the duplicate events..... | dedup _time _raw

The problem is the dedup command commences to use up all the available memory until Splunk kills the search with the message "Your search has been terminated. This is most likely due to an out of memory condition."

The data set for the 24-hour search period is 13 billion+ events, and the data set size is 1.6 TB's.

The 3 SH's each have 374 GB's of memory. Those are usually at 14%-16% memory usage. Using the higher number of 16% that leaves 314 GB's of memory available when the search starts.

The search with dedup commences to use that 314 GB's of available memory over about a 1 hour and 40 minute period until all used up, and Splunk kills the search when memory is near 100% utilized.

So these are the 2 reasons that dedup could be using all remaining available memory:

The dedup command is designed this way to use a larger and larger amount of memory as the data set increases in size / number of events.
There is a defect with the dedup command in SE 9.5.0

Can someone explain which of the 2 reasons it is ?

Dennis · ‎10-25-2023

Hello bowesmana,

The transaction command worked. Memory was at 16% when the search was started, and the search ran for 72 hours with the transaction command, but memory utilization stayed at 16% every time checked.

So the transaction command doesn't have the huge memory requirements issue that the dedup command has.

The overall count needed was all the events in that 24-hour period, and then all the events in that same 24-hour period minus exact duplicate events.

As mentioned, was able to get the count of all the events, minus the duplicates, using the transaction command.

So all is good.

The overall reason for this post was to find out if the dedup command possibly had a defect in SE 9.5.0, and you answered that the dedup command is designed that way. Although, that means the dedup command is basically useless with larger data sets.

bowesmana · ‎10-27-2023

@Dennis Dedup is often used where stats could be used instead and I would normally suggest using stats as the tool to solve de-duplication, unless you specifically need to dedup and retain data, and where event order is important in the deduplication.

I wouldn't say it's useless for large data sets, but doing things with _raw will be inefficient.

A note of caution on transaction. Memory constraints are controlled through limits.conf, see the [transactions] stanza, so you will not see massive amounts of memory being consumed, but what will happen is that evictions will occur and as a results, your transactions MAY not reflect reality.

I believe the default for maxopentxn (max open transactions) is 5,000 and maxopenevents is 100,000, so in your 24 hour period for your 13 billion events, you are going to have 150,000 events per second. The chances are reasonably high that you will have more than that number of open transactions, so in that case it's a reasonable possibility that you will miss a duplicate because an open transaction has been evicted to make way for a new one.

Unless you can prove that the numbers you have reached are in fact correct, which on the size of your data may be challenging, I would treat that with an element of scepticism.

I would suggest that you do a shorter time range search with the sha() technique and the transaction command to see that they come up with the same results. You should be able to get a good feeling of trust by running each of the searches over a 5-10 minute period.

bowesmana · ‎10-25-2023

Using dedup on _raw is always going to give you problems! Imagine what is going on in the server trying to hold all that data, so it can determine whether the next event is a duplicate of one seen before.

The transaction command is also going to give you false information working on that dataset - it has memory limitations and you will not know that it is failing, it just won't give you correct results.

Let's wind back - what information do you need to know about duplicates and what do you want to show as a result of duplicates?

You could do something like

| eval sha=sha256(_raw)
| fields - _raw
| stats count by _time sha
| where count > 1

where you turn _raw into a hash and then stats on the hash to find the duplicate count of hashes. You could include

| eval sha=sha256(_raw)
| stats count values(_raw) as rawVals by _time sha
| where count > 1

so that you can see the raw values. Do this on a small dataset before you suddenly jump to 1.3TB of data.

Dedup Command Using All Available Memory

other

3 Ways to Make OpenTelemetry Even Better

What's New in Splunk Cloud Platform 9.2.2406?

Enterprise Security Content Update (ESCU) | New Releases