Splunk Search

Is there an equivalent for Dedup distributable on search indexes

DalJeanis
Legend

Background in a moment, but here's the question:

Is there a way to have the equivalent of dedup running against each index's results before they get sent to the search head?

I'm running a search against a huge amount of data (minimum dozens of terabytes) and need only one example of each combination of a couple of fields (let's say 3 at most). There will be 10s of thousands of distinct value combinations and 10's of millions of dups, across a few dozen indexers. The results distribution is likely to be neither sparse nor dense, but long-tail - a few combinations will predominate, with hundreds of thousands of dups each over the time period, and lots of combinations will be rare, with only a few dozen dups each.

Is there any streamable command or strategy that can kill the lion's share of those duplicates?

Obviously, I'll have some non-streamable commands after I pull them all together, but it seems like a pre-sort strategy for dedup would save a lot of transmission bandwidth, if nothing else, and could push the work out to the indexers instead of dropping it on the search head..


Ideally, I'd love an option to force streamstats to run at each indexer (perhaps with a different name to avoid confusion). That would be much more versatile for these kinds of optimization problems.

Alternatively, I guess I'll need to pursue some kind of custom command...


At the moment I'm leaning toward creating a lookup table (local=f) and killing all events where the last copy is less than x days old. Initially, I can randomly backdate the most recent report to spread them out.

This strategy would give me a "most recent date" that is inaccurate by up to x-1 days, but would cut a 9-hour extract to something more manageable.

1 Solution

jplumsdaine22
Influencer

As I understand it stats runs is a semi distributed mode if its is the first transforming command. So instead of

.... | dedup X Y Z

try this

.... | stats latest(X) as X latest(Y)  as Y by Z

Also if your search head supports it (and if it's logical for your data) use event sampling - you will get a tremendous speed up!

View solution in original post

jplumsdaine22
Influencer

As I understand it stats runs is a semi distributed mode if its is the first transforming command. So instead of

.... | dedup X Y Z

try this

.... | stats latest(X) as X latest(Y)  as Y by Z

Also if your search head supports it (and if it's logical for your data) use event sampling - you will get a tremendous speed up!

DalJeanis
Legend

Still testing - I found a different issue that sped the query up impressively, though. I was assuming table was a distributed streaming command, because it obviously operates at the event level, but it turns out that it has some group effects and is decidedly not streaming.

Changing |table + fieldX fieldY to | fields fieldX fieldY | fields - _* gave me more than a 40x speed boost, which pulls the search from 9 hours to well under a half hour.

I'll accept this one if it verifies as giving any boost at all

0 Karma

tmoser
Splunk Employee
Splunk Employee

Yes, "fields" is a distributable streaming command - done on IDX layer. Check following links:

0 Karma

DalJeanis
Legend

Nice! Let me try that and see if it has a significant effect.

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...