We have a relatively small set of devices that emit daily in the vicinity of a million events each. Each device has unique ID (Serial #) which is included in events.
What would be an efficient method of collecting a list of unique IDs?
index=abc | stats count by ID
index=abc | stats values(id) as IDs | mvexpand IDs
index-abc | fields ID | dedup ID
Anything else?
In terms of efficiency, the stats command is _likely_ to be the most efficient. However, make sure you put as many filter criteria in the initial search as possible. For example if each device produces different types of event and you know it always makes an event with a type=X then include that type filter in the search, so it will not search ALL events produced by the device, only the limited subset.
The job inspector should give you a good idea as to which is the most efficient in your environment.
As @richgalloway says, if your ID field is indexed, then tstats will be by far, the most efficient way of collecting the list of ids, at the expense of some extra disk space to index that field for each event.
As usually this depends and the best way to check which one is best for your particular case is to use Job inspector as @bowesmana already said. Time by time dedup can be more efficient than stats (which is efficient in most of cases).
r. Ismo
I would combine the first and last.
index=abc ID=*
| fields ID
| stats count by ID
If the ID field is indexed then tstats would be more efficient.
| tstats count where index=abc by ID
@richgalloway
I understand ID=* part. Why would I needs fields before stats?
Can you please explain?
The fields command reduces the amount of data being processed. It probably is not of much benefit in this example, but is something to keep in mind when thinking about performance.
As @richgalloway says, fields is a useful command, particularly when dealing with large data sets, as it instructs the search to remove unwanted data from the event, thus improving efficiency.
An important point about fields is that it typically runs on the indexer before the data is returned to a search head, so it can be very important in minimising the data flow through the Splunk environment, therefore improving your search performance, but also having less impact on others' search performance.