I’m building a report that finds the number of unique users in our activity log each day:
sourcetype="accountTransaction" | timechart span="1d" dc(accountID)
The results are in the neighborhood of 12,000 each day.
This search takes forever to complete, so it seems like a perfect opportunity to use a summary index. So, I changed the search to this:
sourcetype="accountTransaction" | sitimechart span="1d" dc(accountID)
Saved it and scheduled it to run hourly and to use summary indexing. The job runs, but then when I run the search against it:
index=summary search_name="30-day DAU summary" |timechart span=1d dc(accountID)
The result (while nearly instantaneous) is dc(accountID)=1000 every single day – a flat line. Any idea what’s going on? Am I hitting a limit somewhere that I don’t know about?
So, first of all, I must ask if you ever need to have a distinct count of unique users by anything other than a day? If so, will you need it for arbitrary periods or just fixed specific ones, i.e., would you need the count for some random 6-day period starting 43 days ago, or would you only need it for an entire month from the 1st to the end, or a week from Sunday through Saturday?
The reason this matters is that
si versions of
timechart can be very space-inefficient when you use
dc(), because they must store information to let you aggregate up to any arbitrary interval. If you don't need that, you can conserve a lot of space by using plain
stats to store just the specific periods you want. For example, in your case
sistats would have to store about 12,000 items per day (each actual item), while a plain
stats will store only one entry (just the count). But you can't figure out the complete distinct count over (e.g.) three days from just the distinct counts of each of the three days.
If you can use plain
stats instead of
sistats, you won't have your limits problem.
Now, the limits problem if you
si commands has limitations on the number of distinct things that it tracks. This limit is set in
limits.conf under the
[sistats] section by
maxvalues. You can raise this, but if you raise it to over 12,000 to accommodate your data, it is likely that you will also need to increase the TRUNCATE limit on the
[stash] sourcetype. Yes, it's getting complicated. I also don't know if you'll bump into other limits when reporting back. Hopefully not.
You can avoid making these configu changes if you use
sistats count by user instead of
sistats dc(user), but this will substantially slow down reporting, and require you to slightly change your reporting queries.
...|stats count by user | stats count is the same as
...|stats dc(user) as count with the exception that the former won't hit the same limits. So if you use
...|sistats count by user and add
| stats count when you get the data back out of the summary index, you will have the distinct count.
Thanks for the reply.
As to your questions - for the dashboard, I need a chart of unique users by day for the past 30 days. We wouldn't need the count over an arbitrary time window.
Ideally I should only have to run this search for each day - the unique user count isn't going to change historically.
I'll play around with "stats count by user" but I'm not sure how to handle uniques in that situation.