Is it possible, via Splunk's Python SDK, to specify event sampling ratio (say 1:1000) or some equivalent random evaluation in a subsearch which returns a long OR expression while specifying that the outer search does not sample?
For concreteness, the subsearch is:
[search index=my_index | rex "(?i)deviceId=(?P<DevId>[^ ]+)" | dedup DevId | return 1000000 $DevId]
This returns a long OR lists, each of which can match one or more events. It is critical to extract all events associated with the randomly sampled device.
Event sampling applies on the result of your search. If you use a
subsearch to generate an
ORfilter the filter itself will not be subject to sampling but the result of the filtered search will be.
If a search matches 1,000,000 events when sampling is not used, using a sample ratio value of 100 would result in returning approximately 10,000 events.
So whatever you filter on in your search will be applied as is and then the sampling will take place.
Hope that helps.
You can use modulus to do so in your
subsearch, making it look something like this :
[search index=my_index | rex "(?i)deviceId=(?P[^ ]+)" | dedup DevId | streamstats count as sampler | eval sampler=sampler%5| where sampler=0 | return 1000000 $DevId]
This will use a fixed sampling rate of 20% (modulus 5).
That's clever. That sampler strategy, coupled with the outer query to return events, seems to return reasonable results for short time spans, eg 1hour, but when increasing time range to, say 24hr, only a few events are matched (keeping sampler rate fixed at say sampler%1000). Any idea why?
Could be that the subsearch is timing out and returning what it can after timeout, test how long the subsearch is taking by checking the job inspector or by running it seperately.