Solved: Is there an equivalent for Dedup distributable on ...

DalJeanis · ‎07-17-2017

Background in a moment, but here's the question:

Is there a way to have the equivalent of dedup running against each index's results before they get sent to the search head?

I'm running a search against a huge amount of data (minimum dozens of terabytes) and need only one example of each combination of a couple of fields (let's say 3 at most). There will be 10s of thousands of distinct value combinations and 10's of millions of dups, across a few dozen indexers. The results distribution is likely to be neither sparse nor dense, but long-tail - a few combinations will predominate, with hundreds of thousands of dups each over the time period, and lots of combinations will be rare, with only a few dozen dups each.

Is there any streamable command or strategy that can kill the lion's share of those duplicates?

Obviously, I'll have some non-streamable commands after I pull them all together, but it seems like a pre-sort strategy for dedup would save a lot of transmission bandwidth, if nothing else, and could push the work out to the indexers instead of dropping it on the search head..

Ideally, I'd love an option to force streamstats to run at each indexer (perhaps with a different name to avoid confusion). That would be much more versatile for these kinds of optimization problems.

Alternatively, I guess I'll need to pursue some kind of custom command...

At the moment I'm leaning toward creating a lookup table (local=f) and killing all events where the last copy is less than x days old. Initially, I can randomly backdate the most recent report to spread them out.

This strategy would give me a "most recent date" that is inaccurate by up to x-1 days, but would cut a 9-hour extract to something more manageable.

jplumsdaine22 · ‎07-18-2017

As I understand it stats runs is a semi distributed mode if its is the first transforming command. So instead of

.... | dedup X Y Z

try this

.... | stats latest(X) as X latest(Y)  as Y by Z

Also if your search head supports it (and if it's logical for your data) use event sampling - you will get a tremendous speed up!

View solution in original post

jplumsdaine22 · ‎07-18-2017

As I understand it stats runs is a semi distributed mode if its is the first transforming command. So instead of

.... | dedup X Y Z

try this

.... | stats latest(X) as X latest(Y)  as Y by Z

Also if your search head supports it (and if it's logical for your data) use event sampling - you will get a tremendous speed up!

DalJeanis · ‎07-20-2017

Still testing - I found a different issue that sped the query up impressively, though. I was assuming table was a distributed streaming command, because it obviously operates at the event level, but it turns out that it has some group effects and is decidedly not streaming.

Changing |table + fieldX fieldY to | fields fieldX fieldY | fields - _* gave me more than a 40x speed boost, which pulls the search from 9 hours to well under a half hour.

I'll accept this one if it verifies as giving any boost at all

tmoser · ‎08-02-2022

Yes, "fields" is a distributable streaming command - done on IDX layer. Check following links:

Types of commands - https://docs.splunk.com/Documentation/Splunk/9.0.0/Search/Typesofcommands#:~:text=Distributable%20st....
Streaming commands by type - https://docs.splunk.com/Documentation/Splunk/9.0.0/SearchReference/Commandsbytype#Streaming_commands

DalJeanis · ‎07-18-2017

Nice! Let me try that and see if it has a significant effect.

Is there an equivalent for Dedup distributable on search indexes

New Case Study Shows the Value of Partnering with Splunk Academic Alliance

How to Monitor Google Kubernetes Engine (GKE)

Index This | How can you make 45 using only 4?