Splunk Search

Getting a list of unique IDs from a large data set efficiently

pm771
Communicator

We have a relatively small set of devices that emit daily in the vicinity of a million events each.  Each device has unique ID (Serial #) which is included in events.

What would be an efficient method of collecting a list of unique IDs? 

index=abc | stats count by ID  

index=abc | stats values(id) as IDs | mvexpand IDs

index-abc | fields ID | dedup ID

Anything else?

 

Labels (2)
0 Karma

bowesmana
SplunkTrust
SplunkTrust

In terms of efficiency, the stats command is _likely_ to be the most efficient. However, make sure you put as many filter criteria in the initial search as possible. For example if each device produces different types of event and you know it always makes an event with a type=X then include that type filter in the search, so it will not search ALL events produced by the device, only the limited subset.

The job inspector should give you a good idea as to which is the most efficient in your environment.

As @richgalloway  says, if your ID field is indexed, then tstats will be by far, the most efficient way of collecting the list of ids, at the expense of some extra disk space to index that field for each event.

 

isoutamo
SplunkTrust
SplunkTrust

As usually this depends and the best way to check which one is best for your particular case is to use Job inspector as @bowesmana already said. Time by time dedup can be more efficient than stats (which is efficient in most of cases).

r. Ismo

0 Karma

richgalloway
SplunkTrust
SplunkTrust

I would combine the first and last.

index=abc ID=* 
| fields ID 
| stats count by ID

If the ID field is indexed then tstats would be more efficient.

| tstats count where index=abc by ID

 

---
If this reply helps you, Karma would be appreciated.

pm771
Communicator

@richgalloway 

I understand ID=* part.   Why would I needs fields before stats?

Can you please explain?

0 Karma

richgalloway
SplunkTrust
SplunkTrust

The fields command reduces the amount of data being processed.  It probably is not of much benefit in this example, but is something to keep in mind when thinking about performance.

---
If this reply helps you, Karma would be appreciated.

bowesmana
SplunkTrust
SplunkTrust

As @richgalloway says, fields is a useful command, particularly when dealing with large data sets, as it instructs the search to remove unwanted data from the event, thus improving efficiency.

An important point about fields is that it typically runs on the indexer before the data is returned to a search head, so it can be very important in minimising the data flow through the Splunk environment, therefore improving your search performance, but also having less impact on others' search performance.

 

Get Updates on the Splunk Community!

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...