Getting a list of unique IDs from a large data set...

pm771 · ‎11-08-2021

We have a relatively small set of devices that emit daily in the vicinity of a million events each. Each device has unique ID (Serial #) which is included in events.

Anything else?

bowesmana · ‎11-08-2021

In terms of efficiency, the stats command is _likely_ to be the most efficient. However, make sure you put as many filter criteria in the initial search as possible. For example if each device produces different types of event and you know it always makes an event with a type=X then include that type filter in the search, so it will not search ALL events produced by the device, only the limited subset.

The job inspector should give you a good idea as to which is the most efficient in your environment.

As @richgalloway says, if your ID field is indexed, then tstats will be by far, the most efficient way of collecting the list of ids, at the expense of some extra disk space to index that field for each event.

isoutamo · ‎11-08-2021

As usually this depends and the best way to check which one is best for your particular case is to use Job inspector as @bowesmana already said. Time by time dedup can be more efficient than stats (which is efficient in most of cases).

r. Ismo

richgalloway · ‎11-08-2021

I would combine the first and last.

index=abc ID=* 
| fields ID 
| stats count by ID

If the ID field is indexed then tstats would be more efficient.

| tstats count where index=abc by ID

---
If this reply helps you, Karma would be appreciated.

pm771 · ‎11-09-2021

@richgalloway

I understand ID=* part. Why would I needs fields before stats?

Can you please explain?

richgalloway · ‎11-09-2021

The fields command reduces the amount of data being processed. It probably is not of much benefit in this example, but is something to keep in mind when thinking about performance.

---
If this reply helps you, Karma would be appreciated.

bowesmana · ‎11-09-2021

As @richgalloway says, fields is a useful command, particularly when dealing with large data sets, as it instructs the search to remove unwanted data from the event, thus improving efficiency.

An important point about fields is that it typically runs on the indexer before the data is returned to a search head, so it can be very important in minimising the data flow through the Splunk environment, therefore improving your search performance, but also having less impact on others' search performance.

Getting a list of unique IDs from a large data set efficiently

fields

stats

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

Are you a member of the Splunk Community?