Splunk Search

lookup/ inputlookup in search not providing any results

mpk_24
Explorer

Hello @Splunkers,

Can someone please help me on this ? Trying to use "lookup/ inputlookup" command in search.

Use case: trying to extract some specific values from logs for given session IDs. But there are more than 200K session IDs to check. 

So I created a lookup table which includes 200K sessions and then used below query.

 

Problem: nothing is returning, but there should be values returned when I checked some session IDs manually.

index=testing_car hostname=*prod* "/api/update"
| rex field=_raw "CUSTOMER\":(?<FName>[^\,]+)"
| rex field=_raw "Session\":\"(?<SID>[^\"]+)"
| search [ | lookup Sessions.csv SID
| fields SID]
| table SID, FName

P.S. SID field is available in Session.csv file. 

Labels (1)
Tags (1)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

This is an interesting case.

Firstly, there is an issue of a general syntax.

You don't use lookup as a generating command. So your subsearch is wrong. Theroretically, you could do something like

| search [ | inputlookup ...]

To limit your results only to the ones matching the contents of your lookup file. But your subesearch contains way too many rows (by default subsearch is limited to 10k results).

But (again - in a general case; not your particular one since your has additional limitations and we'll get to that shortly) that would be way more efficient if you could limit your initial search, not add that search command further down the pipeline.

As your search is built, you have to read all events with /api/update path and then "manually" extract the field form them. It requires reading and processing all /api/update events even if only a small subsearch of them would finally match your search terms. So in general case, it would be better to do something like

index=testing_car hostname=*prod* "/api/update" [ | inputlookup Sessions.csv SID | rename SID as search ]

Assuming your SID-s are properly tokenized - it would work much faster than putting the search (or lookup, if using the lookup method) further down the pipeline.

But this is impossible in your case.

Firstly, your lookup is way too big for default limits. Secondly, even if you went and raised those limits, spawning search with 200k search terms... it doesn't seem like a good idea.

So now we come to the second part of the puzzle.

The lookup-based solution presented by @kiran_panchavat - search for all events, use lookup to "flag" matching events and filter out those not flagged - is generally OK. But there is one pretty big caveat to this. And it comes back to the size of your lookup.

A simple csv-based lookup is OK for small lookups. It compares values using linear search through the file so while it can be pretty effective if the lookup file is small (and in best scenario the most often used values would be at the top of the file) I suppose you know where this is going...

If you have 200k rows in your lookup, assuming uniform distribution, you'd get 100k comparison per match on average for those values matching your lookup. For those not matching your lookup Splunk would need to do 200k comparisons just to decide the value isn't there! So it quickly gets very very ineffective, especially if the lookup file is big and the hit ratio low.

Therefore for bigger lookups you should not use csv as backend but insted create a kvstore collection.

But here we hit anoter issue - kvstore runs on SH. You don't run kvstore on indexers. If you have an all-in-one installation that might not be a problem but if you're trying to run this on a distributed environment, your kvstore contents would be replicated as csv lookup to indexers for indexer-tier search operations. That creates a conundrum - generally you want to do as much as you can on the indexer level with streaming commands since this can parallelize your work and use your hardware more efficiently but in this case you'd fall back to the csv-based lookup performance which with this size of lookup would be awful.

You can get around it by using the local=t option to the lookup command as @kiran_panchavat did but... this forces the search pipeline to move at the lookup command from the indexers to search head. Which means all intermediate non-filtered values would need to be streamed to the calling SH to be processed and filtered so you lose the advantage of multiple-indexers doing the work in parallel.

So however you turn, there is always some obstacle. Either it's slow because it's a huge lookup or it's slow because you have heaps of data to dig through.

If this is a one-off job, it just means you have to wait longer for your results but if it's something you will want to repeat regularly you might want to rethink the original issue.

Maybe those SIDs follow some pattern and you could partially pre-filter your data? Or maybe you could split that csv into smaller parts? Where does this list come from in the first place? Do you create it from another search in Splunk? If so, then maybe you can add the original search logic to this search. Or maybe instead of creating a lookup you could index your SIDs into a temporary index and use a search combining results from that index with the index you're trying to search from. There might be alternative approaches depending on your underlying problem.

mpk_24
Explorer

@PickleRick Thank you so much for your valuable insights. 

0 Karma

kiran_panchavat
Champion

@mpk_24 

To test it, Generated a "Sessions.csv" file using makeresults and outputlookup.

 

| makeresults| eval SID = "SID12345;SID67890;SID99999;SID00000"
| makemv delim=";" SID
| mvexpand SID
| table SID
| outputlookup Sessions.csv
 
This creates a CSV file named Sessions.csv with a single column SID containing the session IDs. 
 
kiran_panchavat_3-1741847264682.png
 
 
kiran_panchavat_4-1741847285150.png

 

Final Query:-

  • mvindex(split(data, ","), 0) extracts the first part (SID).
  • mvindex(split(data, ","), 1) extracts the second part (FName).
  • Ensures correct lookup matching before filtering
 
| makeresults 
| eval data = "SID12345,John Doe;SID67890,Jane Smith;SID99999,Bob Johnson;SID00000,Alice Brown"
| makemv delim=";" data
| mvexpand data
| eval SID = mvindex(split(data, ","), 0), FName = mvindex(split(data, ","), 1)
| eval _raw = "2025-03-13T10:00:00 INFO hostname=prod* /api/update CUSTOMER:\"" . FName . "\", Session:\"" . SID . "\""
| rex field=_raw "CUSTOMER:\"(?<FName>[^\"]+)\""
| rex field=_raw "Session:\"(?<SID>[^\"]+)\""
| lookup local=t Sessions.csv SID OUTPUT SID as matching_sid
| table SID, FName, matching_sid
| where isnotnull(matching_sid)
 
kiran_panchavat_2-1741847165873.png

 

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!

mpk_24
Explorer

@kiran_panchavat Thank you so much for your insights and the assistance extended. 

kiran_panchavat
Champion

@mpk_24You're welcome. If this resolves your issue, please consider accepting the solution, as it may be helpful for others as well.

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!
0 Karma

kiran_panchavat
Champion

@mpk_24 

The subsearch syntax using lookup is incorrect.  The search command with a subsearch needs proper formatting.  Using lookup in a subsearch won't work as expected in this context. 
 
Test a small sample:- 
 
index=testing_car hostname=*prod* "/api/update"
| rex field=_raw "Session\":\"(?<SID>[^\"]+)"
| head 10
 
Verify SID extraction is working. 
 
Here are some makeresults examples to create dummy data for testing your Splunk query with session IDs and customer names. These can help you simulate your use case without needing real log data.
 
kiran_panchavat_0-1741846525090.png

 

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.
Get Updates on the Splunk Community!

Data Persistence in the OpenTelemetry Collector

This blog post is part of an ongoing series on OpenTelemetry. What happens if the OpenTelemetry collector ...

Introducing Splunk 10.0: Smarter, Faster, and More Powerful Than Ever

Now On Demand Whether you're managing complex deployments or looking to future-proof your data ...

Community Content Calendar, September edition

Welcome to another insightful post from our Community Content Calendar! We're thrilled to continue bringing ...