So, I have a loadjob with all the data I need with a primary field (account number). But, I have a CSV with about 104K account number that they only want in this report. How do I filter only 104K account numbers in this load job? I don't have access to admin to change the join limit... Can Lookups do the job? I also don't want the values to grouped together in each row... I just want to remove the account numbers that are not on the csv from the loadjob...
Your original idea of lookup should work. Assuming your loadjob gives you a field named account_number, and that your lookup has a column account_number, you can do this
``` search above gives account_number and other fields ```
| lookup mylookup account_number output account_number as match_account
| where account_number == match_account
Is this something you are looking for?
at the send of your query add "0" after a sort function:
Example:
....
| table _time, accountnumber, field1, etc.
| sort 0
https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Sort
Regardless of whether you mean the loadjob as some form of batch ingesting events or an actual invocation of Splunk's loadjob command, the typical approach to filtering events by the contents of a lookup is to use a lookup to assign a field value and then filter on that value. This way you'll get only those events that do have wanted values.
Keep in mind thought that:
1) You still need to read all matching "before lookup" events so if you're filtering to a very small subset of events, another approach might be better.
2) If your lookup is big, indeed moving to KVstore can be the thing to do.
Anyway, this is the approach:
<your initial search>
| lookup mylookup.csv lookupfield AS eventfield OUTPUT lookupfield AS somefieldwewanttofilterby
| where isnotnull(somefieldwewanttofilterby)
Can you explain what a "loadjob" is? Normally, if data is already ingested, and you have this lookup file, all you need to do is a subsearch
index=myindex sourcetype=mysourcetype
[inputlookup mylookup
| fields account]
If you are trying to filter before ingestion, Splunk cannot really do anything.
A loadjob is results of previously completed search job from my reports created. I am trying to filter after the ingestion. I have all the data there, I just need to some account numbers, and I don't want to break the data into multiple files to get all the data needed... hence why I asked.
You way would work, but I only have a 50K join limit, so I will not get all the results. I need all 104K to pass through this subsearch.
Your original idea of lookup should work. Assuming your loadjob gives you a field named account_number, and that your lookup has a column account_number, you can do this
``` search above gives account_number and other fields ```
| lookup mylookup account_number output account_number as match_account
| where account_number == match_account
Is this something you are looking for?
Yes, it is! Thank you so much! I truly appreciate this!
Hi @sumarri,
I created a dummy search to mock up your data, and created a lookup with 104,000 entries:
| makeresults count=140000
| streamstats count as id
| eval account="account" . substr("000000000".tostring(id),-6), keep="true"
| table account, keep
| outputlookup "accounts_to_keep.csv"
This will be our lookup file, replicating what you have in your lookup. It has the account ID and a "keep" field, and that's it.
Next, I created a dummy search to generate a bunch of data, with accounts we don't care about and the 104,000 we do care about:
| makeresults count=200000
| streamstats count as id
| eval account="account" . substr("000000000".tostring(id),-6)
| eval data=random()%10000, label="whatever", _time=relative_time(now(), "-" + tostring(random()%1000) + "m")
| table account, data, label, _time
To use the lookup to identify the accounts we want to keep you can use this SPL:
| inputlookup accounts_to_keep.csv append=t
``` use eventstats if stats messes up your data
| eventstats values(keep) as keep by account
```
| stats values(*) as * by account
| search keep="true"
| fields - keep
Depending on how big your lookup gets, you may want to make the lookup a KV store collection.
This logic makes sense, however, I will not get the other fields such as data, label, _time. I need those fields populated with the correct information. But thank you for your help.