Hi. I have an issue but I can't find the solution nor someone who had the same issue so I post it here.
I want to downsample the training part of a dataset using sample command. To do so, I get the count of the minority class using a subsearch and I use it to sample my dataset by class value.
| sample [
| search index=main sourcetype="heartstroke_csv"
| sample partitions=10 seed=29
| where partition_number<8 and stroke=1
| stats count as samples
| return $samples
]
seed=29 by stroke
This works fine, but when I use it inside an accelerated data model it raises the following error:
Error in 'sample' command: Unrecognized argument: [ | search index=main sourcetype="heartstroke_csv" | sample partitions=10 seed=29 | where partition_number<8 and stroke=1 | stats count as samples | return count=samples ]
So I thought about explicitly assign the argument count:
| sample count= [
| search index=main sourcetype="heartstroke_csv"
| sample partitions=10 seed=29
| where partition_number<8 and stroke=1
| stats count as samples
| return $samples
]
seed=29 by stroke
And this doesn't work even in a regular unaccelerated dataset. It returns:
Error in 'sample' command: Invalid value for count: must be an int
I'm using the V 8.2.5 of Splunk Enterprise and the V 5.3.3 of Splunk Machine Learning Toolkit. I don't know if it's a version issue or my syntax is wrong.
EDIT: I have tried also:
| sample [
| search index=main sourcetype="heartstroke_csv"
| sample partitions=10 seed=29
| where partition_number<8 and stroke=1
| stats count as samples
| return count=samples
]
seed=29 by stroke
This works only in the unaccelerated data model. In the accelerated fails to recognize the argument again.
EDIT2: As the final trial, I put explicitly the integer values to the count parameter ( ... | sample count=193 ... ) It seems my model is unaccelerable. I have 2 datasets, one for the data and the other for the ids that correspond to the regular and downsampled version of the train and test sets.
dataset: DATA
index=main sourcetype="heartstroke_csv"
dataset: SETS
index=main sourcetype="heartstroke_csv"
| sample partitions=10 seed=29
| eval set=if(partition_number<8,"train","test")
| appendpipe [
| where set="train"
| appendpipe [
| where stroke=0 and bmi!="N/A" or stroke=1
| sample count=193 ```[
| search index=main sourcetype="heartstroke_csv"
| sample partitions=10 seed=29
| where partition_number<8 and stroke=1
| stats count as samples
| return count=samples
] ```
seed=29 by stroke
| stats values(id) as ids by set | eval downsample=1
]
| appendpipe [
| stats values(id) as ids by set | eval downsample=0
]
| search ids=*
]
| appendpipe [
| where set="test"
| appendpipe [
| sample count=56 ```[
| search index=main sourcetype="heartstroke_csv"
| sample partitions=10 seed=29
| where partition_number>=8 and stroke=1
| stats count as samples
| return count=samples
]```
seed=29 by stroke
| stats values(id) as ids by set | eval downsample=1
]
| appendpipe [
| stats values(id) as ids by set | eval downsample=0
]
| search ids=*
]
| search ids=*
| rename ids as id
| table id set downsample
I update my issue status in a separate post:
| sample [susbsearch] fails to be accelerated because both sample command (not a streaming command?) and subsearches aren't allowed in accelerated datasets.
Nevertheless, according to data model docs:
To accelerate a data model, it must contain at least one root event dataset, or one root search dataset that only uses streaming commands. Acceleration only affects these dataset types and datasets that are children of those root datasets. You cannot accelerate root search datasets that use nonstreaming commands (including transforming commands), root transaction datasets, and children of those datasets. Data models can contain a mixture of accelerated and unaccelerated datasets.
It is strange that splunk fails to parse my search query. It should keep unaccerelable datasets unaccelerated. Why does this happen?
The other unresolved issue left is why this:
| sample count= [
| search index=main sourcetype="heartstroke_csv"
| sample partitions=10 seed=29
| where partition_number<8 and stroke=1
| stats count as samples
| return $samples
]
seed=29 by stroke
Raises this error in regular search queries.
Error in 'sample' command: Invalid value for count: must be an int
As far as i know, using return $value at the end of a subsearch that generates 1 value returns that value to the main search. Why isn't | sample count=[subsearch | return $value] the same as | sample count=<valuewrittenexplicitly>?