Dashboards & Visualizations

How can dedup with multiple fields returns fewer results

jrfreeze
Explorer

I think I must be misunderstanding how dedup works. It seems to me if you add fields to the dedup field list, you should never get fewer events returned.
| dedup fieldA
Should get rid of all extra events with the same value of fieldA
| dedup fieldA fieldB
Should only get right of those where BOTH fieldA and fieldB have duplicate values, which set theory suggests to me must be the at least the same size as the those where we only get rid of duplicates for fieldA alone.

But I'm getting far more results for:
| dedup _time
Than I do for
| dedup _time wma_set wma_filename

Any idea what's going on? For reference, here's the query:

index="main" host="designsafe01.tacc.utexas.edu" "designsafe.storage.community" "SimCenter/Datasets" (op=download OR op=preview OR op=copy OR op=agave_file_download OR op=agave_file_preview OR op=data_depot_copy)
| rex mode=sed "s/%20/ /g"
| rex mode=sed field=info "s/\'/\"/g"
| rex mode=sed field=info "s/\: u\"/: \"/g"
| eval thepath=case(in(op,"download","preview","agave_file_download","agave_file_preview"),json_extract(info,"filePath"),op="copy", json_extract(info,"path"), op="data_depot_copy", json_extract(info,"fromFilePath"))
| rex field=thepath "\/?SimCenter\/Datasets\/(?<wma_set>\w+)(?<wma_path>\/(.*\/)*)(?<wma_filename>[-\w\s\.]+)"
| rex field=wma_filename ".+\.(?<wma_extension>\w*)"
| dedup _time wma_set wma_filename

Labels (1)
0 Karma
1 Solution

somesoni2
Revered Legend

Your dedup can return less number of rows if one or more dedup fields have null values (null values will cause number of uniq combinations to be less). Try something like this to confirm.

index="main" host="designsafe01.tacc.utexas.edu" "designsafe.storage.community" "SimCenter/Datasets" (op=download OR op=preview OR op=copy OR op=agave_file_download OR op=agave_file_preview OR op=data_depot_copy)
| rex mode=sed "s/%20/ /g"
| rex mode=sed field=info "s/\'/\"/g"
| rex mode=sed field=info "s/\: u\"/: \"/g"
| eval thepath=case(in(op,"download","preview","agave_file_download","agave_file_preview"),json_extract(info,"filePath"),op="copy", json_extract(info,"path"), op="data_depot_copy", json_extract(info,"fromFilePath"))
| rex field=thepath "\/?SimCenter\/Datasets\/(?<wma_set>\w+)(?<wma_path>\/(.*\/)*)(?<wma_filename>[-\w\s\.]+)"
| rex field=wma_filename ".+\.(?<wma_extension>\w*)"
| eval wma_set=coalesce(wma_set,"Not_Available"), wma_filename=coalesce(wma_filename,"Not_Availabe")
| dedup _time wma_set wma_filename

View solution in original post

jrfreeze
Explorer

That did the trick - thanks!

0 Karma

somesoni2
Revered Legend

Your dedup can return less number of rows if one or more dedup fields have null values (null values will cause number of uniq combinations to be less). Try something like this to confirm.

index="main" host="designsafe01.tacc.utexas.edu" "designsafe.storage.community" "SimCenter/Datasets" (op=download OR op=preview OR op=copy OR op=agave_file_download OR op=agave_file_preview OR op=data_depot_copy)
| rex mode=sed "s/%20/ /g"
| rex mode=sed field=info "s/\'/\"/g"
| rex mode=sed field=info "s/\: u\"/: \"/g"
| eval thepath=case(in(op,"download","preview","agave_file_download","agave_file_preview"),json_extract(info,"filePath"),op="copy", json_extract(info,"path"), op="data_depot_copy", json_extract(info,"fromFilePath"))
| rex field=thepath "\/?SimCenter\/Datasets\/(?<wma_set>\w+)(?<wma_path>\/(.*\/)*)(?<wma_filename>[-\w\s\.]+)"
| rex field=wma_filename ".+\.(?<wma_extension>\w*)"
| eval wma_set=coalesce(wma_set,"Not_Available"), wma_filename=coalesce(wma_filename,"Not_Availabe")
| dedup _time wma_set wma_filename
Get Updates on the Splunk Community!

Ready, Set, SOAR: How Utility Apps Can Up Level Your Playbooks!

 WATCH NOW Powering your capabilities has never been so easy with ready-made Splunk® SOAR Utility Apps. Parse ...

DevSecOps: Why You Should Care and How To Get Started

 WATCH NOW In this Tech Talk we will talk about what people mean by DevSecOps and deep dive into the different ...

Introducing Ingest Actions: Filter, Mask, Route, Repeat

WATCH NOW Ingest Actions (IA) is the best new way to easily filter, mask and route your data in Splunk® ...