Splunk Search

dedup gives different result if a 'table' command is used before it. A bug??

patng323
Explorer

In an running a command which uses the dedup command:

index=myindex earliest=-5d@d latest=@d | 
bin _time span=1d | 
dedup id, _time | stats count

The above query returns 794.

However, if I add a table command before dedup:

index=myindex earliest=-5d@d latest=@d | 
bin _time span=1d | 
table id, _time | dedup id, _time | stats count

The result is 798! This really puzzles me as I don't expect the table command will change my answer. Am I missing something? Or is it a bug?

0 Karma

lukejadamec
Super Champion

If you use rex to extract id from _raw, and then run the two searches from the results of rex what do you get? If you don't know how to use rex, then post an example event.
Without rex, after you run these searches, what id count to you see in Interesting Fields?
Have you identified the 4 combinations of id and _time that are getting missed - is it a difference in id or _time?

0 Karma

cmerriman
Super Champion

It may be because with the table command the only two fields available are Id and time and without it has all fields? Dedup keeps the first event with the specified fields and dumps the rest.

Not sure if that's right but it's the best I've got.

https://docs.splunk.com/Documentation/Splunk/6.5.0/SearchReference/Dedup

0 Karma

patng323
Explorer

To make sure there isn't any "missing values"-related problem, I changed my queries to:

index=myindex earliest=-5d@d latest=@d | 
bin _time span=1d | search id=* AND _time=* |
dedup id, _time | stats count

AND

index=myindex earliest=-5d@d latest=@d | 
bin _time span=1d | search id=* AND _time=* |
table id, _time | dedup id, _time | stats count

But once again, they gave different result (794 and 798 respectively). 😞

0 Karma

cmerriman
Super Champion

What I meant was that the table is only bringing back two fields and all other fields are lost and when you run the dedup without table, all other fields are still available (same with when you use "fields" instead of table). That could b e why. Not that field values are missing, but that fields themselves are gone.

0 Karma

cmerriman
Super Champion

Try adding a dc(id) by _time to your searches and see how many ids you actually have per _time

0 Karma

patng323
Explorer

If I run this:

index=myindex earliest=-5d@d latest=@d | 
bin _time span=1d | search id=* AND _time=* |
table id, _time | stats dc(id) by _time

I got identical results without or without the table command. But again, replace dc by dedup id, _time | stats count then I got different answer when I have the table.

Regarding the worry that "table is only bringing back two fields, as a try I changed that part to table id, _time, host, source, but still the problem is isn't resolved. So I think it's not about "table is only bringing back two fields".

0 Karma

cmerriman
Super Champion

If you tried |table *|stats count Are your results different? Unless the only fields in your index are id, _time,host and source

0 Karma

patng323
Explorer

Again, |table *|dedup id, _time|stats count and |fields *|dedup id, _time|stats count give different results, with the table answer always larger than the fields one.

And there are many other fields in the events.

0 Karma

cmerriman
Super Champion

so |table * gives you 798 and |fields * gives you 794?

this is quite the tricky widget...

0 Karma

somesoni2
Revered Legend

How about this query's result?

index=myindex earliest=-5d@d latest=@d | 
 bin _time span=1d | 
 fields id, _time | dedup id, _time | stats count

AND

index=myindex earliest=-5d@d latest=@d id=* | 
 bin _time span=1d | 
 table id, _time | dedup id, _time | stats count
0 Karma

patng323
Explorer

Thanks. Actually I tried your first one as well and it also gave 794, the same answer from the query without the fields id, _time part.

0 Karma

richgalloway
SplunkTrust
SplunkTrust

Is it possible the event list changed between the two queries? Try running them each with the same explicit start and end times (use the time selector and remove the earliest and latest keywords from the query).

---
If this reply helps you, Karma would be appreciated.
0 Karma

patng323
Explorer

Thanks, but I don't think the event list was changing because the events are updated once a day in a nightly batch job. And I switched between these two queries many times (to debug them) and I always got back 794 and 798 respectively.

And since I use the @d in both the earliest and latest, the time range was fixed while I was debugging it.

0 Karma
Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...