Re: dedup gives different result if a 'table' comm...

patng323 · ‎10-12-2016

In an running a command which uses the dedup command:

index=myindex earliest=-5d@d latest=@d | 
bin _time span=1d | 
dedup id, _time | stats count

The above query returns 794.

However, if I add a table command before dedup:

index=myindex earliest=-5d@d latest=@d | 
bin _time span=1d | 
table id, _time | dedup id, _time | stats count

The result is 798! This really puzzles me as I don't expect the table command will change my answer. Am I missing something? Or is it a bug?

lukejadamec · ‎10-14-2016

If you use rex to extract id from _raw, and then run the two searches from the results of rex what do you get? If you don't know how to use rex, then post an example event.
Without rex, after you run these searches, what id count to you see in Interesting Fields?
Have you identified the 4 combinations of id and _time that are getting missed - is it a difference in id or _time?

cmerriman · ‎10-13-2016

It may be because with the table command the only two fields available are Id and time and without it has all fields? Dedup keeps the first event with the specified fields and dumps the rest.

Not sure if that's right but it's the best I've got.

https://docs.splunk.com/Documentation/Splunk/6.5.0/SearchReference/Dedup

patng323 · ‎10-13-2016

To make sure there isn't any "missing values"-related problem, I changed my queries to:

index=myindex earliest=-5d@d latest=@d | 
bin _time span=1d | search id=* AND _time=* |
dedup id, _time | stats count

AND

index=myindex earliest=-5d@d latest=@d | 
bin _time span=1d | search id=* AND _time=* |
table id, _time | dedup id, _time | stats count

But once again, they gave different result (794 and 798 respectively). 😞

cmerriman · ‎10-13-2016

What I meant was that the table is only bringing back two fields and all other fields are lost and when you run the dedup without table, all other fields are still available (same with when you use "fields" instead of table). That could b e why. Not that field values are missing, but that fields themselves are gone.

cmerriman · ‎10-13-2016

Try adding a dc(id) by _time to your searches and see how many ids you actually have per _time

patng323 · ‎10-14-2016

If I run this:

index=myindex earliest=-5d@d latest=@d | 
bin _time span=1d | search id=* AND _time=* |
table id, _time | stats dc(id) by _time

I got identical results without or without the table command. But again, replace dc by dedup id, _time | stats count then I got different answer when I have the table.

Regarding the worry that "table is only bringing back two fields, as a try I changed that part to table id, _time, host, source, but still the problem is isn't resolved. So I think it's not about "table is only bringing back two fields".

cmerriman · ‎10-14-2016

If you tried |table *|stats count Are your results different? Unless the only fields in your index are id, _time,host and source

patng323 · ‎10-14-2016

And there are many other fields in the events.

cmerriman · ‎10-14-2016

so |table * gives you 798 and |fields * gives you 794?

this is quite the tricky widget...

somesoni2 · ‎10-12-2016

How about this query's result?

index=myindex earliest=-5d@d latest=@d | 
 bin _time span=1d | 
 fields id, _time | dedup id, _time | stats count

AND

index=myindex earliest=-5d@d latest=@d id=* | 
 bin _time span=1d | 
 table id, _time | dedup id, _time | stats count

patng323 · ‎10-13-2016

Thanks. Actually I tried your first one as well and it also gave 794, the same answer from the query without the fields id, _time part.

richgalloway · ‎10-12-2016

Is it possible the event list changed between the two queries? Try running them each with the same explicit start and end times (use the time selector and remove the earliest and latest keywords from the query).

---
If this reply helps you, Karma would be appreciated.

patng323 · ‎10-13-2016

Thanks, but I don't think the event list was changing because the events are updated once a day in a nightly batch job. And I switched between these two queries many times (to debug them) and I always got back 794 and 798 respectively.

And since I use the @d in both the earliest and latest, the time range was fixed while I was debugging it.

dedup gives different result if a 'table' command is used before it. A bug??

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes