Splunk Search

How to Filter over 50K on different indexes/sources?

sumarri
Path Finder

So, I have a loadjob with all the data I need with a primary field (account number). But, I have a CSV with about 104K account number that they only want in this report. How do I filter only 104K account numbers in this load job?  I don't have access to admin to change the join limit... Can Lookups do the job? I also don't want the values to grouped together in each row... I just want to remove the account numbers that are not on the csv from the loadjob...  

Labels (2)
0 Karma
1 Solution

yuanliu
SplunkTrust
SplunkTrust

Your original idea of lookup should work.  Assuming your loadjob gives you a field named account_number, and that your lookup has a column account_number, you can do this

 

``` search above gives account_number and other fields ```
| lookup mylookup account_number output account_number as match_account
| where account_number == match_account

 

Is this something you are looking for?

View solution in original post

antoniolamonica
Explorer

at the send of your query add "0" after a sort function:

Example:

....

| table _time,  accountnumber, field1, etc.
| sort 0 


https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Sort

Tags (1)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

Regardless of whether you mean the loadjob as some form of batch ingesting events or an actual invocation of Splunk's loadjob command, the typical approach to filtering events by the contents of a lookup is to use a lookup to assign a field value and then filter on that value. This way you'll get only those events that do have wanted values.

Keep in mind thought that:

1) You still need to read all matching "before lookup" events so if you're filtering to a very small subset of events, another approach might be better.

2) If your lookup is big, indeed moving to KVstore can be the thing to do.

Anyway, this is the approach:

<your initial search>
| lookup mylookup.csv lookupfield AS eventfield OUTPUT lookupfield AS somefieldwewanttofilterby
| where isnotnull(somefieldwewanttofilterby)
0 Karma

yuanliu
SplunkTrust
SplunkTrust

Can you explain what a "loadjob" is? Normally, if data is already ingested, and you have this lookup file, all you need to do is a subsearch

index=myindex sourcetype=mysourcetype
  [inputlookup mylookup
  | fields account]

If you are trying to filter before ingestion, Splunk cannot really do anything.

Tags (1)
0 Karma

sumarri
Path Finder

A loadjob is results of  previously completed search job from my reports created. I am trying to filter after the ingestion. I have all the data there, I just need to some account numbers, and I don't want to break the data into multiple files to get all the data needed... hence why I asked. 

You way would work, but I only have a 50K join limit, so I will not get all the results. I need all 104K to pass through this subsearch.

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Your original idea of lookup should work.  Assuming your loadjob gives you a field named account_number, and that your lookup has a column account_number, you can do this

 

``` search above gives account_number and other fields ```
| lookup mylookup account_number output account_number as match_account
| where account_number == match_account

 

Is this something you are looking for?

sumarri
Path Finder

Yes, it is! Thank you so much! I truly appreciate this! 

0 Karma

danspav
SplunkTrust
SplunkTrust

Hi @sumarri,


I created a dummy search to mock up your data, and created a lookup with 104,000 entries:

| makeresults count=140000
| streamstats count as id 
| eval account="account" . substr("000000000".tostring(id),-6), keep="true"
| table account, keep
| outputlookup "accounts_to_keep.csv"

This will be our lookup file, replicating what you have in your lookup. It has the account ID and a "keep" field, and that's it.

 

Next, I created a dummy search to generate a bunch of data, with accounts we don't care about and the 104,000 we do care about:

| makeresults count=200000
| streamstats count as id 
| eval account="account" . substr("000000000".tostring(id),-6)
| eval data=random()%10000, label="whatever", _time=relative_time(now(), "-" + tostring(random()%1000) + "m")
| table account, data, label, _time

 

To use the lookup to identify the accounts we want to keep you can use this SPL:

| inputlookup accounts_to_keep.csv append=t
``` use eventstats if stats messes up your data 
| eventstats values(keep) as keep by account
```
| stats values(*) as * by account
| search keep="true"
| fields - keep
  1. This add the contents of the lookup to the results (append=t)
  2. Then we use stats to combine the keep field with the events in the search
    If this messes up your data, you can run eventstats instead, but that may run into memory issues with massive result sets.
  3. Finally, we search for all the events where the keep field is set to "true"

Depending on how big your lookup gets, you may want to make the lookup a KV store collection.


0 Karma

sumarri
Path Finder

This logic makes sense, however, I will not get the other fields such as data, label, _time. I need those fields populated with the correct information. But thank you for your help. 

0 Karma
Get Updates on the Splunk Community!

New in Observability - Improvements to Custom Metrics SLOs, Log Observer Connect & ...

The latest enhancements to the Splunk observability portfolio deliver improved SLO management accuracy, better ...

Improve Data Pipelines Using Splunk Data Management

  Register Now   This Tech Talk will explore the pipeline management offerings Edge Processor and Ingest ...

3-2-1 Go! How Fast Can You Debug Microservices with Observability Cloud?

Register Join this Tech Talk to learn how unique features like Service Centric Views, Tag Spotlight, and ...