Splunk Search

Random sampling ratio in subsearch (long OR list) only?

Path Finder

Is it possible, via Splunk's Python SDK, to specify event sampling ratio (say 1:1000) or some equivalent random evaluation in a subsearch which returns a long OR expression while specifying that the outer search does not sample?

For concreteness, the subsearch is:

[search index=my_index   |  rex "(?i)deviceId=(?P<DevId>[^ ]+)" | dedup DevId | return 1000000 $DevId]

This returns a long OR lists, each of which can match one or more events. It is critical to extract all events associated with the randomly sampled device.

0 Karma

SplunkTrust
SplunkTrust

Hi @alancalvitti,

Event sampling applies on the result of your search. If you use a subsearch to generate an ORfilter the filter itself will not be subject to sampling but the result of the filtered search will be.

As mentioned here : https://docs.splunk.com/Documentation/Splunk/latest/Search/Retrieveasamplesetofevents

If a search matches 1,000,000 events when sampling is not used, using a sample ratio value of 100 would result in returning approximately 10,000 events.

So whatever you filter on in your search will be applied as is and then the sampling will take place.

Hope that helps.

Cheers,
David

0 Karma

Path Finder

Thanks, but I need the logic the other way around: sampling (with specified ratio) in subsearch, and no sampling in outer search. Is there a way to emulate this behavior?

0 Karma

SplunkTrust
SplunkTrust

You can use modulus to do so in your subsearch, making it look something like this :

[search index=my_index | rex "(?i)deviceId=(?P[^ ]+)"  | dedup DevId | streamstats count as sampler | eval sampler=sampler%5| where sampler=0 | return 1000000 $DevId]

This will use a fixed sampling rate of 20% (modulus 5).

0 Karma

Path Finder

That's clever. That sampler strategy, coupled with the outer query to return events, seems to return reasonable results for short time spans, eg 1hour, but when increasing time range to, say 24hr, only a few events are matched (keeping sampler rate fixed at say sampler%1000). Any idea why?

0 Karma

SplunkTrust
SplunkTrust

Could be that the subsearch is timing out and returning what it can after timeout, test how long the subsearch is taking by checking the job inspector or by running it seperately.

0 Karma