ML-SPL sample command count=[subsearch value] issu...

isaiz · ‎12-13-2022

Hi. I have an issue but I can't find the solution nor someone who had the same issue so I post it here.

I want to downsample the training part of a dataset using sample command. To do so, I get the count of the minority class using a subsearch and I use it to sample my dataset by class value.

| sample [
         | search index=main sourcetype="heartstroke_csv" 
         | sample partitions=10 seed=29 
         | where partition_number<8 and stroke=1 
         | stats count as samples
         | return $samples
      ]
      seed=29 by stroke

This works fine, but when I use it inside an accelerated data model it raises the following error:

Error in 'sample' command: Unrecognized argument: [ | search index=main sourcetype="heartstroke_csv" | sample partitions=10 seed=29 | where partition_number<8 and stroke=1 | stats count as samples | return count=samples ]

So I thought about explicitly assign the argument count:

| sample count= [
         | search index=main sourcetype="heartstroke_csv" 
         | sample partitions=10 seed=29 
         | where partition_number<8 and stroke=1 
         | stats count as samples
         | return $samples
      ]
      seed=29 by stroke

And this doesn't work even in a regular unaccelerated dataset. It returns:

Error in 'sample' command: Invalid value for count: must be an int

I'm using the V 8.2.5 of Splunk Enterprise and the V 5.3.3 of Splunk Machine Learning Toolkit. I don't know if it's a version issue or my syntax is wrong.

EDIT: I have tried also:

| sample [
         | search index=main sourcetype="heartstroke_csv" 
         | sample partitions=10 seed=29 
         | where partition_number<8 and stroke=1 
         | stats count as samples
         | return count=samples
      ] 
      seed=29 by stroke

This works only in the unaccelerated data model. In the accelerated fails to recognize the argument again.

EDIT2: As the final trial, I put explicitly the integer values to the count parameter ( ... | sample count=193 ... ) It seems my model is unaccelerable. I have 2 datasets, one for the data and the other for the ids that correspond to the regular and downsampled version of the train and test sets.

dataset: DATA

index=main sourcetype="heartstroke_csv"

dataset: SETS

index=main sourcetype="heartstroke_csv"
| sample partitions=10 seed=29
| eval set=if(partition_number<8,"train","test")
| appendpipe [
  | where set="train" 
  | appendpipe [
      | where stroke=0 and bmi!="N/A" or stroke=1
      | sample count=193 ```[
         | search index=main sourcetype="heartstroke_csv" 
         | sample partitions=10 seed=29 
         | where partition_number<8 and stroke=1 
         | stats count as samples
         | return count=samples
      ] ```
      seed=29 by stroke
      | stats values(id) as ids by set | eval downsample=1
      ]
  | appendpipe [
      | stats values(id) as ids by set | eval downsample=0
      ]
  | search ids=*
  ]
| appendpipe [
  | where set="test" 
  | appendpipe [
      | sample count=56 ```[
         | search index=main sourcetype="heartstroke_csv" 
         | sample partitions=10 seed=29 
         | where partition_number>=8 and stroke=1 
         | stats count as samples
         | return count=samples
         ]```   
      seed=29 by stroke
      | stats values(id) as ids by set | eval downsample=1
      ]
  | appendpipe [
      | stats values(id) as ids by set | eval downsample=0
      ]
  | search ids=*
  ]
| search ids=*
| rename ids as id
| table id set downsample

isaiz · ‎12-14-2022

I update my issue status in a separate post:

| sample [susbsearch] fails to be accelerated because both sample command (not a streaming command?) and subsearches aren't allowed in accelerated datasets.

Nevertheless, according to data model docs:

To accelerate a data model, it must contain at least one root event dataset, or one root search dataset that only uses streaming commands. Acceleration only affects these dataset types and datasets that are children of those root datasets. You cannot accelerate root search datasets that use nonstreaming commands (including transforming commands), root transaction datasets, and children of those datasets. Data models can contain a mixture of accelerated and unaccelerated datasets.

It is strange that splunk fails to parse my search query. It should keep unaccerelable datasets unaccelerated. Why does this happen?

The other unresolved issue left is why this:

| sample count= [
         | search index=main sourcetype="heartstroke_csv" 
         | sample partitions=10 seed=29 
         | where partition_number<8 and stroke=1 
         | stats count as samples
         | return $samples
      ]
      seed=29 by stroke

Raises this error in regular search queries.

Error in 'sample' command: Invalid value for count: must be an int

As far as i know, using return $value at the end of a subsearch that generates 1 value returns that value to the main search. Why isn't | sample count=[subsearch | return $value] the same as | sample count=<valuewrittenexplicitly>?

ML-SPL sample command count=[subsearch value] issue

data model

report acceleration

Unlock Database Monitoring with Splunk Observability Cloud

Purpose in Action: How Splunk Is Helping Power an Inclusive Future for All

[Upcoming Webinar] Demo Day: Transforming IT Operations with Splunk

Join the Conversation