Solved: Get a distinct count of new values in a field per ...

boo · ‎06-20-2020

I want a distinct count for a given field by day, but this count also needs to look at all previous days in the given time range to see the remaining distinct count?

For example:

When I search for all events that are on Monday the result is : a,b,c,d,c,z
When I search for all events that are on Tuesday the result is : a,b,c,d,e,f,a,b
When I search for all events that are on Wednesday the result is : a,b,c,d,e,f,z

The distinct count for Monday is 5 and for Tuesday is 6 and for Wednesday it is 7. The remaining distinct count for Tuesday would be 2, since a,b,c,d have all already appeared on Monday and the remaining distinct count for Wednesday would be 0 since all values have appeared on both Monday and Tuesday already.

How would I be able to do this?

Thanks

boo · ‎06-22-2020

Thanks for the reply! No the count doesn't reset.

I made a search that does what I need but it is quite slow:

index=_internal  earliest=-3d@d latest=-d@d
|reverse
|dedup sourcetype
|bucket _time span=1d
|stats dc(sourcetype) by _time

The above search would make use of the fact that dedup goes by the first event that it sees, so if i were to reverse first then the first event i would see is the last event.

Cheers,

View solution in original post

bowesmana · ‎06-21-2020

As an example of collection today's values not seen previously you can do the following

index=_internal  earliest=-3d@d latest=-d@d
  | stats count by sourcetype
| append [
  search index=_internal earliest=@d latest=now
  | stats count by sourcetype
]
| stats count by sourcetype
| where count=1

but I guess this is not quite what you're after.

I suspect you will need to look at streamstats

bowesmana · ‎06-21-2020

Does the count ever reset, so the following week, if you get the same results, is the remaining DC always then 0, as you are looking back to include this week to determine what is 'new'

boo · ‎06-22-2020

Thanks for the reply! No the count doesn't reset.

I made a search that does what I need but it is quite slow:

index=_internal  earliest=-3d@d latest=-d@d
|reverse
|dedup sourcetype
|bucket _time span=1d
|stats dc(sourcetype) by _time

The above search would make use of the fact that dedup goes by the first event that it sees, so if i were to reverse first then the first event i would see is the last event.

Cheers,

bowesmana · ‎06-22-2020

Hey cool, I worked out how to make it faster.

Rather than reverse all the data, first bucket and collect the values of the field, then expand that to rows and sort by ascending time. The data volume is likely to be much smaller and it achieves the same thing based on your dedup principle - nice usage @boo

index=_internal  earliest=-7d@d latest=-d@d
|bucket _time span=1d
|stats values(sourcetype) as sourcetype dc(sourcetype)  by _time
| mvexpand sourcetype
| sort _time
| dedup sourcetype
|stats values(sourcetype) as sourcetype dc(sourcetype)  by _time

I included the values(sourcetype) in the second stats, so I could see it working. In my test it reduced the time over the last 7 days from 50 second for 5.5 million rows to 15 seconds.

Get a distinct count of new values in a field per day with respect to all previous days in the time range

lookup

stats

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Join the Conversation