Random sampling ratio in subsearch (long OR list) ...

alancalvitti · ‎12-11-2019

Is it possible, via Splunk's Python SDK, to specify event sampling ratio (say 1:1000) or some equivalent random evaluation in a subsearch which returns a long OR expression while specifying that the outer search does not sample?

For concreteness, the subsearch is:

[search index=my_index   |  rex "(?i)deviceId=(?P<DevId>[^ ]+)" | dedup DevId | return 1000000 $DevId]

This returns a long OR lists, each of which can match one or more events. It is critical to extract all events associated with the randomly sampled device.

DavidHourani · ‎12-11-2019

Hi @alancalvitti,

Event sampling applies on the result of your search. If you use a subsearch to generate an ORfilter the filter itself will not be subject to sampling but the result of the filtered search will be.

As mentioned here : https://docs.splunk.com/Documentation/Splunk/latest/Search/Retrieveasamplesetofevents

If a search matches 1,000,000 events when sampling is not used, using a sample ratio value of 100 would result in returning approximately 10,000 events.

So whatever you filter on in your search will be applied as is and then the sampling will take place.

Hope that helps.

Cheers,
David

alancalvitti · ‎12-11-2019

Thanks, but I need the logic the other way around: sampling (with specified ratio) in subsearch, and no sampling in outer search. Is there a way to emulate this behavior?

DavidHourani · ‎12-11-2019

You can use modulus to do so in your subsearch, making it look something like this :

[search index=my_index | rex "(?i)deviceId=(?P[^ ]+)"  | dedup DevId | streamstats count as sampler | eval sampler=sampler%5| where sampler=0 | return 1000000 $DevId]

This will use a fixed sampling rate of 20% (modulus 5).

alancalvitti · ‎12-12-2019

That's clever. That sampler strategy, coupled with the outer query to return events, seems to return reasonable results for short time spans, eg 1hour, but when increasing time range to, say 24hr, only a few events are matched (keeping sampler rate fixed at say sampler%1000). Any idea why?

DavidHourani · ‎12-12-2019

Could be that the subsearch is timing out and returning what it can after timeout, test how long the subsearch is taking by checking the job inspector or by running it seperately.

Random sampling ratio in subsearch (long OR list) only?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Announcing Modern Navigation: A New Era of Splunk User Experience

Observability Simplified: Combining User Experience, Application Performance & ...

Event Series May & June: From Network Visibility to Service Intelligence

Join the Conversation