Splunk Search

Getting a list of unique IDs from a large data set efficiently

pm771
Path Finder

We have a relatively small set of devices that emit daily in the vicinity of a million events each.  Each device has unique ID (Serial #) which is included in events.

What would be an efficient method of collecting a list of unique IDs? 

index=abc | stats count by ID  

index=abc | stats values(id) as IDs | mvexpand IDs

index-abc | fields ID | dedup ID

Anything else?

 

Labels (2)
0 Karma

bowesmana
SplunkTrust
SplunkTrust

In terms of efficiency, the stats command is _likely_ to be the most efficient. However, make sure you put as many filter criteria in the initial search as possible. For example if each device produces different types of event and you know it always makes an event with a type=X then include that type filter in the search, so it will not search ALL events produced by the device, only the limited subset.

The job inspector should give you a good idea as to which is the most efficient in your environment.

As @richgalloway  says, if your ID field is indexed, then tstats will be by far, the most efficient way of collecting the list of ids, at the expense of some extra disk space to index that field for each event.

 

isoutamo
SplunkTrust
SplunkTrust

As usually this depends and the best way to check which one is best for your particular case is to use Job inspector as @bowesmana already said. Time by time dedup can be more efficient than stats (which is efficient in most of cases).

r. Ismo

0 Karma

richgalloway
SplunkTrust
SplunkTrust

I would combine the first and last.

index=abc ID=* 
| fields ID 
| stats count by ID

If the ID field is indexed then tstats would be more efficient.

| tstats count where index=abc by ID

 

---
If this reply helps you, Karma would be appreciated.

pm771
Path Finder

@richgalloway 

I understand ID=* part.   Why would I needs fields before stats?

Can you please explain?

0 Karma

richgalloway
SplunkTrust
SplunkTrust

The fields command reduces the amount of data being processed.  It probably is not of much benefit in this example, but is something to keep in mind when thinking about performance.

---
If this reply helps you, Karma would be appreciated.

bowesmana
SplunkTrust
SplunkTrust

As @richgalloway says, fields is a useful command, particularly when dealing with large data sets, as it instructs the search to remove unwanted data from the event, thus improving efficiency.

An important point about fields is that it typically runs on the indexer before the data is returned to a search head, so it can be very important in minimising the data flow through the Splunk environment, therefore improving your search performance, but also having less impact on others' search performance.

 

Get Updates on the Splunk Community!

The Splunk Success Framework: Your Guide to Successful Splunk Implementations

Splunk Lantern is a customer success center that provides advice from Splunk experts on valuable data ...

Splunk Training for All: Meet Aspiring Cybersecurity Analyst, Marc Alicea

Splunk Education believes in the value of training and certification in today’s rapidly-changing data-driven ...

Investigate Security and Threat Detection with VirusTotal and Splunk Integration

As security threats and their complexities surge, security analysts deal with increased challenges and ...