Solved: How to Filter over 50K on different indexes/source...

sumarri · ‎05-23-2024

So, I have a loadjob with all the data I need with a primary field (account number). But, I have a CSV with about 104K account number that they only want in this report. How do I filter only 104K account numbers in this load job? I don't have access to admin to change the join limit... Can Lookups do the job? I also don't want the values to grouped together in each row... I just want to remove the account numbers that are not on the csv from the loadjob...

yuanliu · ‎05-24-2024

Your original idea of lookup should work. Assuming your loadjob gives you a field named account_number, and that your lookup has a column account_number, you can do this

``` search above gives account_number and other fields ```
| lookup mylookup account_number output account_number as match_account
| where account_number == match_account

Is this something you are looking for?

View solution in original post

antoniolamonica · ‎05-31-2024

at the send of your query add "0" after a sort function:

Example:

....

| table _time,  accountnumber, field1, etc.
| sort 0

https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Sort

PickleRick · ‎05-25-2024

Regardless of whether you mean the loadjob as some form of batch ingesting events or an actual invocation of Splunk's loadjob command, the typical approach to filtering events by the contents of a lookup is to use a lookup to assign a field value and then filter on that value. This way you'll get only those events that do have wanted values.

Keep in mind thought that:

1) You still need to read all matching "before lookup" events so if you're filtering to a very small subset of events, another approach might be better.

2) If your lookup is big, indeed moving to KVstore can be the thing to do.

Anyway, this is the approach:

<your initial search>
| lookup mylookup.csv lookupfield AS eventfield OUTPUT lookupfield AS somefieldwewanttofilterby
| where isnotnull(somefieldwewanttofilterby)

yuanliu · ‎05-24-2024

Can you explain what a "loadjob" is? Normally, if data is already ingested, and you have this lookup file, all you need to do is a subsearch

index=myindex sourcetype=mysourcetype
  [inputlookup mylookup
  | fields account]

If you are trying to filter before ingestion, Splunk cannot really do anything.

sumarri · ‎05-24-2024

A loadjob is results of previously completed search job from my reports created. I am trying to filter after the ingestion. I have all the data there, I just need to some account numbers, and I don't want to break the data into multiple files to get all the data needed... hence why I asked.

You way would work, but I only have a 50K join limit, so I will not get all the results. I need all 104K to pass through this subsearch.

yuanliu · ‎05-24-2024

Your original idea of lookup should work. Assuming your loadjob gives you a field named account_number, and that your lookup has a column account_number, you can do this

``` search above gives account_number and other fields ```
| lookup mylookup account_number output account_number as match_account
| where account_number == match_account

Is this something you are looking for?

sumarri · ‎05-28-2024

Yes, it is! Thank you so much! I truly appreciate this!

danspav · ‎05-23-2024

Hi @sumarri,

I created a dummy search to mock up your data, and created a lookup with 104,000 entries:

| makeresults count=140000
| streamstats count as id 
| eval account="account" . substr("000000000".tostring(id),-6), keep="true"
| table account, keep
| outputlookup "accounts_to_keep.csv"

This will be our lookup file, replicating what you have in your lookup. It has the account ID and a "keep" field, and that's it.

Next, I created a dummy search to generate a bunch of data, with accounts we don't care about and the 104,000 we do care about:

| makeresults count=200000
| streamstats count as id 
| eval account="account" . substr("000000000".tostring(id),-6)
| eval data=random()%10000, label="whatever", _time=relative_time(now(), "-" + tostring(random()%1000) + "m")
| table account, data, label, _time

To use the lookup to identify the accounts we want to keep you can use this SPL:

| inputlookup accounts_to_keep.csv append=t
``` use eventstats if stats messes up your data 
| eventstats values(keep) as keep by account
```
| stats values(*) as * by account
| search keep="true"
| fields - keep

This add the contents of the lookup to the results (append=t)
Then we use stats to combine the keep field with the events in the search
If this messes up your data, you can run eventstats instead, but that may run into memory issues with massive result sets.
Finally, we search for all the events where the keep field is set to "true"

Depending on how big your lookup gets, you may want to make the lookup a KV store collection.

sumarri · ‎05-24-2024

This logic makes sense, however, I will not get the other fields such as data, label, _time. I need those fields populated with the correct information. But thank you for your help.

How to Filter over 50K on different indexes/sources?

lookup

subsearch

Splunk MCP & Agentic AI: Machine Data Without Limits

Application management with Targeted Application Install for Victoria Experience

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Join the Conversation