We have several summary searches that collect data into metric indexes. They run nightly and some of them create quite a large number of events (~100k). As a result we sometimes see warnings, that the metric indexes cannot be optimised fast enough.
A typical query looks like
index=uhdbox sourcetype="tvclients:log:analytics" name="app*" name="*Play*" OR name="*Open*" earliest=-1d@d+3h latest=-0d@d+3h
| bin _time AS day span=24h aligntime=@d+3h
| stats count as eventCount earliest(_time) as _time by day, eventName, releaseTrack, partnerId, deviceId
| fields - day
| mcollect index=uhdbox_summary_metrics split=true marker="name=UHD_AppsDetails, version=1.1.0" eventName, releaseTrack, partnerId, deviceId
The main contributor to the large number of events is the cardinality of deviceId (~100k) which effectively is a "MAC" address with a common prefix and defined length. I could create 4 / 8 /16 reports each selecting a subset of deviceIds and schedule them at different times, but it would be quite a burden to maintain those basicly identical copies.
So...
I wonder if there is a mechanism to shard the search results and feed them it into many separate mcollects that are spaced apart by some delay. Something like
index=uhdbox sourcetype="tvclients:log:analytics" name="app*" name="*Play*" OR name="*Open*" earliest=-1d@d+3h latest=-0d@d+3h
| shard by deviceId bins=10 sleep=60s
| stats count as eventCount earliest(_time) as _time by day, eventName, releaseTrack, partnerId, deviceId
| fields - day
| mcollect index=uhdbox_summary_metrics split=true marker="name=UHD_AppsDetails, version=1.1.0" eventName, releaseTrack, partnerId, deviceId
Maybe my pseudo code above is not so clear. What I would like to achieve is, that instead of one huge mcollect I get 10 mcollects (each for a approximately 1/10th of the events). They should be scheduled approximately 60s apart from each other...
What you suggest is not possible in a single search. Assuming the cardinality does not change much over the 24h period I don't suppose there is benefit in running the search hourly - which would produce more metrics and would need to be aggregated on consumption.
However, you could create N searches where the body of a search is a single macro, which runs your base SPL and you call the macro with the device id prefixes you want to search for. Not an elegant solution - but functional.
I don't understand the message you say you are getting though - I am not familiar with that - secondly what is the impact of that message occurring - does it break the collected data in some way and does it stop other searches from working?