I'm stepping through the main Splunk Search Tutorial. I'm at the "subsearch" section: https://docs.splunk.com/Documentation/Splunk/6.4.3/SearchTutorial/Useasubsearch
The cited example search is the following:
sourcetype=access_* status=200 action=purchase [search sourcetype=access_* status=200 action=purchase | top limit=1 clientip | table clientip] | stats count, dc(productId), values(productId) by clientip
What seems curious to me is that the subsearch begins with the entire content of the "outer search", being sourcetype=access_* status=200 action=purchase
. It seems odd to me that the subsearch needs to repeat the entire outer search, and then qualifying it. Is it perhaps that this is just a nonsensical subsearch use case?
The answer lies in the requirement. Below is the requirement of search, for that example
You want to find the single most frequent shopper on the Buttercup Games online store and what that shopper has purchased. Use the top command to return the most frequent shopper.
Now, remember data for both frequent shopper and purchases is coming from same data.
So, Step 1 was to find single most frequent shopper, If you check the subsearch, that's what it gets (gets the clientip of the single most frequent buyer).
sourcetype=access_* status=200 action=purchase | top limit=1 clientip | table clientip
Now, for this clientip, we need to get all the purchases, which we'll find in the same data using which we calculated most frequent buyer. So the outer search uses same data (successful purchases) and filter it for just that single clietip as returned by subsearch.
This is how it'll look if you don't use this simplistic method
sourcetype=access_* status=200 action=purchase | stats count, dc(productId), values(productId) by clientip | sort 0 -count | head 1
Here, you're getting list of purchases of all buyers/clientip and then in the end, getting the most frequent ( by sorting and taking top 1 record). The former method, applies the filter in the base search itself, even though it has to run a subsearch, drastically (based on data of course) reducing the number of records that search has to process.
The answer lies in the requirement. Below is the requirement of search, for that example
You want to find the single most frequent shopper on the Buttercup Games online store and what that shopper has purchased. Use the top command to return the most frequent shopper.
Now, remember data for both frequent shopper and purchases is coming from same data.
So, Step 1 was to find single most frequent shopper, If you check the subsearch, that's what it gets (gets the clientip of the single most frequent buyer).
sourcetype=access_* status=200 action=purchase | top limit=1 clientip | table clientip
Now, for this clientip, we need to get all the purchases, which we'll find in the same data using which we calculated most frequent buyer. So the outer search uses same data (successful purchases) and filter it for just that single clietip as returned by subsearch.
This is how it'll look if you don't use this simplistic method
sourcetype=access_* status=200 action=purchase | stats count, dc(productId), values(productId) by clientip | sort 0 -count | head 1
Here, you're getting list of purchases of all buyers/clientip and then in the end, getting the most frequent ( by sorting and taking top 1 record). The former method, applies the filter in the base search itself, even though it has to run a subsearch, drastically (based on data of course) reducing the number of records that search has to process.
Thank you Somesoni2, really clear explanation !
I will add this to the Search Tutorial and to the Search Reference so that others are not confused.