Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Splunk Search

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community
- :
- Splunk Answers
- :
- Using Splunk
- :
- Splunk Search
- :
- How to calculate stdev for a count of one field ba...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark Topic
- Subscribe to Topic
- Mute Topic
- Printer Friendly Page

- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

How to calculate stdev for a count of one field based on another?

jwalzerpitt

Motivator

03-17-2017
09:06 AM

I am trying to figure out how to calculate the stdev of the number of emails a user sends. I have the following search so far where I am calculating the count of MessageId per SenderAddress:

`index=exchange | stats count as MessageId by SenderAddress`

I'm getting tripped up trying to bring in/carry over the count of MessageId per SenderAddress. Do I need to create an eval for this field and then plug that into the rest of the search of:

`| eventstats mean(field) AS mean_field, stdev(field) AS stdev_field | eval Z_score=round(((field-mean_field)/stdev_field),2) | where Z_score>1.5 OR Z_score<-1.5 | table _time, SenderAddress, FromIP, field, mean_field, Z_score | sort -Z_score`

Thx

- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

DalJeanis

SplunkTrust

03-17-2017
08:07 PM

The main question is defining what item you are trying to calculate the stdev for.

If you want to calculate the stdev for the number of recipients per email, then you need to calculate individual records for the number of recipients in each email, and then calculate the stdev.

```
index=exchange
| eval RecipCount=mvcount(myrecipientMVfield)
| stats avg(RecipCount) as RecipCountAvg, stdev(RecipCount) as RecipCountStdev by SenderAddress
```

On the other hand, if you want to calculate the stdev of the average number of emails he sends per day, regardless of how many recipient they are to, then you need to calculate how many he sends each day, then calculate the stdev.

```
index=exchange | bin _time span=1d
| stats count as EmailsSentPerDay by SenderAddress _time
| stats count as DaysCalculated, avg(EmailsSentPerDay) as EmailsSentAvg, stdev(EmailsSentPerDay) as EmailsSentStdev by SenderAddress
```

So, FIRST figure out what exact statistic it is that you are counting that you want to know how variable it is. Then isolate that thing as a single record whose variability is in question. Then calculate ( and apply z scores etc).

- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

jwalzerpitt

Motivator

03-20-2017
07:07 AM

Thx for the reply. I want to calculate the stdev of the avg number of emails per sender per day, regardless of recipients while eliminating the senders who legitimately send a high number of emails (mailing lists, etc), which would leave me with senders who get flagged as outliers.

As always the problem is defining the time range for the baseline to compare against. 1 day? 7 days? 30 days? And so on. Eventually, I'd like to put the second search through the ML toolkit to see what can be identified.

Thx again

- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

DalJeanis

SplunkTrust

03-20-2017
10:19 AM

Okay, so I'd start with a very long baseline to calculate your typical users. Say 90 days.

You're probably going to have to develop a verified blacklist of some sort in order to accomplish "not counting 'legitimate' high-volume users". Probably create a standing csv file of your blacklist, use a left-join from your data to inputcsv and kill any record that doesn't come back null from the join.

Before that, you need to explore your data and find out how many types of users you have.

Here's what I'd do. First, produce a single file of how many emails are produced each day by each user, going back at least 90 days...

```
index=exchange earliest=-91d@d latest=-1d@d | bin _time span=1d
| stats count as EmailsSentPerDay by SenderAddress _time
| outputcsv dailyuserstats.csv
```

Next, do a little statistical analysis with this to determine your "classes" of user.

```
| inputcsv dailyuserstats.csv
| stats count as DaysCalculated, avg(EmailsSentPerDay) as EmailsSentAvg, stdev(EmailsSentPerDay) as EmailsSentStdev by SenderAddress
| eval RelPctSD = EmailsSentStdev/EmailsSentAvg
| kmeans k=10 EmailsSentAvg EmailsSentStdev RelPctSD cfield=SenderClusters
```

Now you have the users clustered by their characteristics. The cluster numbers are essentially random. You can play with the k= value (probably keeping it in the range of 3-10) in order to see how the system thinks about different partitions. Do some visualization of the clusters, perhaps by (A) aggregating each cluster and mapping each on a scatterplot at the point of its centroid, size based on count of users, or (B) doing a scatterplot with the points colored as per their cluster. You can pick any two of the three measures, since the third is just a combination of the other two, and any pair could present the data slightly different than the other two pairs.

Hopefully, after the first time you do that, you will have an idea of what clusters you want to exclude from the data. You may even already have the information you are looking for. Hard to tell, without having your actual data to play with.

Once you've done a little futzing around with the data, come back with a new question and we'll help you take the next step.

- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

jwalzerpitt

Motivator

03-20-2017
10:30 AM

Thx - I'll start plugging away at the two searches you created above (greatly appreciated).

Fooling around, I combined your second original search with niketnaily's:

`index=exchange | bin _time span=1d `

| stats count as EmailsSent by SenderAddress

| eventstats mean(EmailsSent) AS mean_sent, stdev(EmailsSent) AS stdev_sent

| eval Z_score=round(((EmailsSent-mean_sent)/stdev_sent),2)

| table SenderAddress, EmailsSent, mean_sent, Z_score

It appears the mean_sent is comparing every user against the overall average instead of against the user's own average. How would I individually calculate the avg of number of emails sent per user and then calculate the z_score against that avg? Looking for users who are outliers based on the amount of email they send, not the total of emails sent by all users.

Thx

- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

niketnilay

Legend

03-25-2017
10:26 AM

@jwalzerpitt seems like you need to add by SenderAddress to eventstats as well.

```
| eventstats mean(EmailsSent) AS mean_sent, stdev(EmailsSent) AS stdev_sent by SenderAddress
```

Please try it out and let us know if it worked.

____________________________________________

| makeresults | eval message= "Happy Splunking!!!"

| makeresults | eval message= "Happy Splunking!!!"

- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

jwalzerpitt

Motivator

03-28-2017
01:18 PM

Thx for the reply

The only thing that's not being calculated is the Z score and I'm trying to figure out why.

- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

jwalzerpitt

Motivator

03-28-2017
01:20 PM

Looks like something is possibly wrong with the way I'm calculating stdev:

`eventstats mean(EmailsSent) AS mean_sent, stdev(EmailsSent) AS stdev_sent by SenderAddress`

Both of these are coming up blank

- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

niketnilay

Legend

03-17-2017
03:12 PM

I am not sure why you are performing eventstats for mean and stdev ans not through stats. However, hope you are looking for the following:

```
index=exchange
| eventstats mean(field) AS mean_field, stdev(field) AS stdev_field
| stats count as MessageId last(mean_field) as mean_field last(stdev_field) as stdev_field by SenderAddress
| eval Z_score=round(((field-mean_field)/stdev_field),2)
| where Z_score>1.5 OR Z_score<-1.5
```

Besides, if you are computing Z score for anomaly detection you can refer to Splunks anomalydetection command. https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Anomalydetection

You might also find some useful SPL commands for your usecase if you go through 5-6 pages of advanced statistics in Splunk documentation: https://docs.splunk.com/Documentation/Splunk/latest/Search/Aboutadvancedstatistics

____________________________________________

| makeresults | eval message= "Happy Splunking!!!"

| makeresults | eval message= "Happy Splunking!!!"

- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

jwalzerpitt

Motivator

03-20-2017
06:57 AM

Thx for the search and the links to anomaly detection and the advanced stats,

.conf21 Now Fully Virtual!

Register for FREE Today!

Register for FREE Today!