Reporting

ML-SPL sample command count=[subsearch value] issue

isaiz
Loves-to-Learn Lots

Hi. I have an issue but I can't find the solution nor someone who had the same issue so I post it here.

I want to downsample the training part of a dataset using sample command. To do so, I get the count of the minority class using a subsearch and I use it to sample my dataset by class value.

 

| sample [
         | search index=main sourcetype="heartstroke_csv" 
         | sample partitions=10 seed=29 
         | where partition_number<8 and stroke=1 
         | stats count as samples
         | return $samples
      ]
      seed=29 by stroke

 

This works fine, but when I use it inside an accelerated data model it raises the following error:

 

Error in 'sample' command: Unrecognized argument: [ | search index=main sourcetype="heartstroke_csv" | sample partitions=10 seed=29 | where partition_number<8 and stroke=1 | stats count as samples | return count=samples ]

 

So I thought about explicitly assign the argument count:

 

| sample count= [
         | search index=main sourcetype="heartstroke_csv" 
         | sample partitions=10 seed=29 
         | where partition_number<8 and stroke=1 
         | stats count as samples
         | return $samples
      ]
      seed=29 by stroke

 

And this doesn't work even in a regular unaccelerated dataset. It returns:

 

Error in 'sample' command: Invalid value for count: must be an int

 

I'm using the V 8.2.5 of Splunk Enterprise and the V 5.3.3 of Splunk Machine Learning Toolkit. I don't know if it's a version issue or my syntax is wrong.

EDIT: I have tried also:

 

| sample [
         | search index=main sourcetype="heartstroke_csv" 
         | sample partitions=10 seed=29 
         | where partition_number<8 and stroke=1 
         | stats count as samples
         | return count=samples
      ] 
      seed=29 by stroke

 

This works only in the unaccelerated data model. In the accelerated fails to recognize the argument again.

EDIT2: As the final trial, I put explicitly the integer values to the count parameter ( ... | sample count=193 ... ) It seems my  model is unaccelerable. I have 2 datasets, one for the data and the other for the ids that correspond to the regular and downsampled version of the train and test sets.

 

dataset: DATA

index=main sourcetype="heartstroke_csv"
dataset: SETS

index=main sourcetype="heartstroke_csv"
| sample partitions=10 seed=29
| eval set=if(partition_number<8,"train","test")
| appendpipe [
  | where set="train" 
  | appendpipe [
      | where stroke=0 and bmi!="N/A" or stroke=1
      | sample count=193 ```[
         | search index=main sourcetype="heartstroke_csv" 
         | sample partitions=10 seed=29 
         | where partition_number<8 and stroke=1 
         | stats count as samples
         | return count=samples
      ] ```
      seed=29 by stroke
      | stats values(id) as ids by set | eval downsample=1
      ]
  | appendpipe [
      | stats values(id) as ids by set | eval downsample=0
      ]
  | search ids=*
  ]
| appendpipe [
  | where set="test" 
  | appendpipe [
      | sample count=56 ```[
         | search index=main sourcetype="heartstroke_csv" 
         | sample partitions=10 seed=29 
         | where partition_number>=8 and stroke=1 
         | stats count as samples
         | return count=samples
         ]```   
      seed=29 by stroke
      | stats values(id) as ids by set | eval downsample=1
      ]
  | appendpipe [
      | stats values(id) as ids by set | eval downsample=0
      ]
  | search ids=*
  ]
| search ids=*
| rename ids as id
| table id set downsample

 

Labels (2)
Tags (3)
0 Karma

isaiz
Loves-to-Learn Lots

I update my issue status in a separate post:


| sample [susbsearch] fails to be accelerated because both sample command (not a streaming command?) and subsearches aren't allowed in accelerated datasets.

Nevertheless, according to data model docs:

To accelerate a data model, it must contain at least one root event dataset, or one root search dataset that only uses streaming commands. Acceleration only affects these dataset types and datasets that are children of those root datasets. You cannot accelerate root search datasets that use nonstreaming commands (including transforming commands), root transaction datasets, and children of those datasets. Data models can contain a mixture of accelerated and unaccelerated datasets.

It is strange that splunk fails to parse my search query. It should keep unaccerelable datasets unaccelerated. Why does this happen?

The other unresolved issue left is why this:

 

| sample count= [
         | search index=main sourcetype="heartstroke_csv" 
         | sample partitions=10 seed=29 
         | where partition_number<8 and stroke=1 
         | stats count as samples
         | return $samples
      ]
      seed=29 by stroke

 

Raises this error in regular search queries.

 

Error in 'sample' command: Invalid value for count: must be an int

 

As far as i know, using return $value at the end of a subsearch that generates 1 value returns that value to the main search. Why isn't | sample count=[subsearch | return $value] the same as | sample count=<valuewrittenexplicitly>?

0 Karma
Get Updates on the Splunk Community!

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud  In today’s fast-paced digital ...

Observability protocols to know about

Observability protocols define the specifications or formats for collecting, encoding, transporting, and ...

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...