Hello @Splunkers,
Can someone please help me on this ? Trying to use "lookup/ inputlookup" command in search.
Use case: trying to extract some specific values from logs for given session IDs. But there are more than 200K session IDs to check.
So I created a lookup table which includes 200K sessions and then used below query.
Problem: nothing is returning, but there should be values returned when I checked some session IDs manually.
index=testing_car hostname=*prod* "/api/update"
| rex field=_raw "CUSTOMER\":(?<FName>[^\,]+)"
| rex field=_raw "Session\":\"(?<SID>[^\"]+)"
| search [ | lookup Sessions.csv SID
| fields SID]
| table SID, FName
P.S. SID field is available in Session.csv file.
This is an interesting case.
Firstly, there is an issue of a general syntax.
You don't use lookup as a generating command. So your subsearch is wrong. Theroretically, you could do something like
| search [ | inputlookup ...]
To limit your results only to the ones matching the contents of your lookup file. But your subesearch contains way too many rows (by default subsearch is limited to 10k results).
But (again - in a general case; not your particular one since your has additional limitations and we'll get to that shortly) that would be way more efficient if you could limit your initial search, not add that search command further down the pipeline.
As your search is built, you have to read all events with /api/update path and then "manually" extract the field form them. It requires reading and processing all /api/update events even if only a small subsearch of them would finally match your search terms. So in general case, it would be better to do something like
index=testing_car hostname=*prod* "/api/update" [ | inputlookup Sessions.csv SID | rename SID as search ]
Assuming your SID-s are properly tokenized - it would work much faster than putting the search (or lookup, if using the lookup method) further down the pipeline.
But this is impossible in your case.
Firstly, your lookup is way too big for default limits. Secondly, even if you went and raised those limits, spawning search with 200k search terms... it doesn't seem like a good idea.
So now we come to the second part of the puzzle.
The lookup-based solution presented by @kiran_panchavat - search for all events, use lookup to "flag" matching events and filter out those not flagged - is generally OK. But there is one pretty big caveat to this. And it comes back to the size of your lookup.
A simple csv-based lookup is OK for small lookups. It compares values using linear search through the file so while it can be pretty effective if the lookup file is small (and in best scenario the most often used values would be at the top of the file) I suppose you know where this is going...
If you have 200k rows in your lookup, assuming uniform distribution, you'd get 100k comparison per match on average for those values matching your lookup. For those not matching your lookup Splunk would need to do 200k comparisons just to decide the value isn't there! So it quickly gets very very ineffective, especially if the lookup file is big and the hit ratio low.
Therefore for bigger lookups you should not use csv as backend but insted create a kvstore collection.
But here we hit anoter issue - kvstore runs on SH. You don't run kvstore on indexers. If you have an all-in-one installation that might not be a problem but if you're trying to run this on a distributed environment, your kvstore contents would be replicated as csv lookup to indexers for indexer-tier search operations. That creates a conundrum - generally you want to do as much as you can on the indexer level with streaming commands since this can parallelize your work and use your hardware more efficiently but in this case you'd fall back to the csv-based lookup performance which with this size of lookup would be awful.
You can get around it by using the local=t option to the lookup command as @kiran_panchavat did but... this forces the search pipeline to move at the lookup command from the indexers to search head. Which means all intermediate non-filtered values would need to be streamed to the calling SH to be processed and filtered so you lose the advantage of multiple-indexers doing the work in parallel.
So however you turn, there is always some obstacle. Either it's slow because it's a huge lookup or it's slow because you have heaps of data to dig through.
If this is a one-off job, it just means you have to wait longer for your results but if it's something you will want to repeat regularly you might want to rethink the original issue.
Maybe those SIDs follow some pattern and you could partially pre-filter your data? Or maybe you could split that csv into smaller parts? Where does this list come from in the first place? Do you create it from another search in Splunk? If so, then maybe you can add the original search logic to this search. Or maybe instead of creating a lookup you could index your SIDs into a temporary index and use a search combining results from that index with the index you're trying to search from. There might be alternative approaches depending on your underlying problem.
@PickleRick Thank you so much for your valuable insights.
To test it, Generated a "Sessions.csv" file using makeresults and outputlookup.
| makeresults| eval SID = "SID12345;SID67890;SID99999;SID00000"
| makemv delim=";" SID
| mvexpand SID
| table SID
| outputlookup Sessions.csv
Final Query:-
| makeresults
| eval data = "SID12345,John Doe;SID67890,Jane Smith;SID99999,Bob Johnson;SID00000,Alice Brown"
| makemv delim=";" data
| mvexpand data
| eval SID = mvindex(split(data, ","), 0), FName = mvindex(split(data, ","), 1)
| eval _raw = "2025-03-13T10:00:00 INFO hostname=prod* /api/update CUSTOMER:\"" . FName . "\", Session:\"" . SID . "\""
| rex field=_raw "CUSTOMER:\"(?<FName>[^\"]+)\""
| rex field=_raw "Session:\"(?<SID>[^\"]+)\""
| lookup local=t Sessions.csv SID OUTPUT SID as matching_sid
| table SID, FName, matching_sid
| where isnotnull(matching_sid)
@kiran_panchavat Thank you so much for your insights and the assistance extended.
@mpk_24You're welcome. If this resolves your issue, please consider accepting the solution, as it may be helpful for others as well.