I have a test dataset containing users, time and websites that they access.
I used the Splunk Machine learning toolkit with Detecting Outliers Assistant to get the increase in number of visits to Job searching websites.
index="http" sourcetype="http" AND "monster.com" OR "careerbuilder.com" OR "job-hunt.org" OR "aol.com/jobs" OR "simplyhired.com" OR "yahoo.com/hotjobs"
| where user=ABB0427
| sort _time | bucket _time span=1d
| stats count by _time, user
| eventstats median("count") as median
| eval absDev=(abs('count'-median))
| eventstats median(absDev) as medianAbsDev
| eval lowerBound=(median-medianAbsDev*exact(2)), upperBound=(median+medianAbsDev*exact(1))
| eval isOutlier=if('count' < lowerBound OR 'count' > upperBound, 1, 0)
| fields _time, "count", lowerBound, upperBound, isOutlier, *
This results for this user is 10 outliers.
Now I am trying to make a similar search that would provide the number of outliers for each of the users. If I try to just remove the filtering for this user and leave the data with all 1000 users, the result of outliers for this user is not anymore 10, but 4. Looks like the lowerbound, upperbound are different when removing the user filtering, and looks like all the users are being calculated using the same lowerbound, upperbound. I expected that the calculation is done differently for each user. Attached some pictures.
Any suggestion how to calculate the outliers for each user ?
Hi @jorjiana88
You need to add the user to the split bys...
| eventstats median("count") as median by user
| eval absDev=(abs('count'-median))
| eventstats median(absDev) as medianAbsDev by user
Additionally, you may want to add
| makecontinuous _time
after your bucket
command to fill in any empty time gaps.
I think that's all you're missing.
Hi @jorjiana88
You need to add the user to the split bys...
| eventstats median("count") as median by user
| eval absDev=(abs('count'-median))
| eventstats median(absDev) as medianAbsDev by user
Additionally, you may want to add
| makecontinuous _time
after your bucket
command to fill in any empty time gaps.
I think that's all you're missing.
Thans a lot ! If I split by users the calculation is different for each user as expected, but I still have another problem. Not sure, maybe I should ask new question for this 🙂
The result is still not the same, I don't get 10 outliers for that user. I think the problem is more up in the query.
This query (where I do counts for all users, and filtering is done only at the end) , shows only the counts for the first 9 days:
index="http" sourcetype="http" AND "monster.com" OR "careerbuilder.com" OR "job-hunt.org" OR "aol.com/jobs" OR "simplyhired.com" OR "yahoo.com/hotjobs"
| sort _time | bucket _time span=1d
| stats count as counts by _time, user
| search user=ABB0427
This one where I filter by this user from the beginning shows counts for 25 days:
index="http" sourcetype="http" AND "monster.com" OR "careerbuilder.com" OR "job-hunt.org" OR "aol.com/jobs" OR "simplyhired.com" OR "yahoo.com/hotjobs" AND abb0427
| sort _time | bucket _time span=1d
| stats count as counts by _time, user
The difference is that the first one is filtering on the value of the field - whereas the second one (25 days) is searching for the string occurrences of the abb0427
in the _raw
field.
Because of the data, the result should be the same even if only searching for the string.
Actually after removing the | sort _time , both queries result in the same, so the issue is solved. Thank you very much for the super fast response.