Sorry for the delay but it took a while to think through the solution. The more obvious imperative approach to the solution using iteration and subsearches were rather futile since my approach was trying to subsearch the wrong end of the problem.
Instead of running a large complex search iteratively across each of my large data sets I opted for a large boolean term that defined the (data set, most recent) pairings relevant for what I'm calling 'Global Current'.
In other words I run a summarized saved search that collects all "data set" names and "start" fields every minute which only incurs a cost when new inserts are indexed. By sorting this table by "start" I am able to dedup the table getting the most recent pairings.
Using load job of this saved search in a subsearch to define the global current data set terms as the base search solves the problem.
If someone is trying to solve a similar problem here is a more concrete example of the problem and my solution.
Example Events:
name="Data Set B" start="2013-05-28 02:00:00" ... log data ...
name="Data Set B" start="2013-04-28 02:00:00" ... log data ...
name="Data Set D" start="2013-04-21 10:00:00" ... log data ...
name="Data Set C" start="2013-04-10 07:00:00" ... log data ...
name="Data Set A" start="2013-04-01 17:00:00" ... log data ...
name="Data Set C" start="2013-03-05 07:00:00" ... log data ...
name="Data Set A" start="2013-03-01 17:00:00" ... log data ...
name="Data Set A" start="2013-02-01 17:00:00" ... log data ...
name="Data Set D" start="2013-01-21 10:00:00" ... log data ...
Summarized saved search just includes name/start pairs sorted by start:
name start
"Data Set B" "2013-05-28 02:00:00"
"Data Set B" "2013-04-28 02:00:00"
"Data Set D" "2013-04-21 10:00:00"
"Data Set C" "2013-04-10 07:00:00"
"Data Set A" "2013-04-01 17:00:00"
"Data Set C" "2013-03-05 07:00:00"
"Data Set A" "2013-03-01 17:00:00"
"Data Set A" "2013-02-01 17:00:00"
"Data Set D" "2013-01-21 10:00:00"
dedup name provides most recent name/start pairs:
name start
"Data Set B" "2013-05-28 02:00:00"
"Data Set D" "2013-04-21 10:00:00"
"Data Set C" "2013-04-10 07:00:00"
"Data Set A" "2013-04-01 17:00:00"
loadjob in subsearch converts the table to:
( ("Data Set B" AND "2013-05-28 02:00:00") OR ("Data Set D" AND "2013-04-21 10:00:00") OR ("Data Set C" AND "2013-04-10 07:00:00") OR ("Data Set A" AND "2013-04-01 17:00:00") )
This results in matching only events for the "Current Global" without incurring large subsearch/map/join etc issues of subsequent searches.
I was concerned about the size of the term string once I hit 40+ data sets (80 key=value pairs in OR'd AND term) not to mention the other filtering terms needed for the search. The concern was search string character limit and browser uri string character limit. After reading the documentation there shouldn't be a limit on the search string character limit and the browser URI limit can be avoided by using a macro.
The final search looks something like this (pseudo code):
index=data termA termB termC [loadjob bla:bla:bla | sort - start | dedup name] | ...
After macro:
index=data termA termB termC `current_global` | ...
A note on efficiency, though it has worked and scaled well so far I've made sure to run simpler terms before the enormous OR'd AND logic and I've had excellent results.
... View more