topic Re: Getting a list of unique IDs from a large data set efficiently in Splunk Search

Getting a list of unique IDs from a large data set efficiently

pm771 — Mon, 08 Nov 2021 21:55:28 GMT

We have a relatively small set of devices that emit daily in the vicinity of a million events each. Each device has unique ID (Serial #) which is included in events.

Anything else?

Re: Getting a list of unique IDs from a large data set efficiently

richgalloway — Tue, 09 Nov 2021 01:11:25 GMT

I would combine the first and last.

index=abc ID=* | fields ID | stats count by ID

If the ID field is indexed then tstats would be more efficient.

| tstats count where index=abc by ID

Re: Getting a list of unique IDs from a large data set efficiently

bowesmana — Tue, 09 Nov 2021 06:19:05 GMT

In terms of efficiency, the stats command is _likely_ to be the most efficient. However, make sure you put as many filter criteria in the initial search as possible. For example if each device produces different types of event and you know it always makes an event with a type=X then include that type filter in the search, so it will not search ALL events produced by the device, only the limited subset.

The job inspector should give you a good idea as to which is the most efficient in your environment.

As @richgalloway says, if your ID field is indexed, then tstats will be by far, the most efficient way of collecting the list of ids, at the expense of some extra disk space to index that field for each event.

Re: Getting a list of unique IDs from a large data set efficiently

isoutamo — Tue, 09 Nov 2021 06:30:02 GMT

As usually this depends and the best way to check which one is best for your particular case is to use Job inspector as @bowesmana already said. Time by time dedup can be more efficient than stats (which is efficient in most of cases).

r. Ismo

Re: Getting a list of unique IDs from a large data set efficiently

pm771 — Tue, 09 Nov 2021 15:52:20 GMT

@richgalloway

I understand ID=* part. Why would I needs fields before stats?

Can you please explain?

Re: Getting a list of unique IDs from a large data set efficiently

richgalloway — Tue, 09 Nov 2021 16:34:39 GMT

The fields command reduces the amount of data being processed. It probably is not of much benefit in this example, but is something to keep in mind when thinking about performance.

Re: Getting a list of unique IDs from a large data set efficiently

bowesmana — Tue, 09 Nov 2021 21:08:44 GMT

As @richgalloway says, fields is a useful command, particularly when dealing with large data sets, as it instructs the search to remove unwanted data from the event, thus improving efficiency.

An important point about fields is that it typically runs on the indexer before the data is returned to a search head, so it can be very important in minimising the data flow through the Splunk environment, therefore improving your search performance, but also having less impact on others' search performance.