Splunk Search

How generate a search to monitor devices that send out an unusual amount of logs?

willluo
Engager

Dear fellows,

i am trying to write a search string to monitor which of my devices send out an unusual amount of logs.

i think i may try to find out the volume by host by day, then find out the avg_value of log volume by host per day. This value should be keeping update by day.
If a device's realtime log volume > the device's (avg_value*2) then send an alert

I tried to use the following search string but i don't know how to continue. And there is null value to be consider.

index=_internal source=*license_usage.log* type=Usage 
| convert ctime(_time) as timestamp timeformat="%d/%m/%Y"
| chart sum(b) AS volume_b by h timestamp

Would you mind helping me to solve the problem? Thank you very much.

0 Karma
1 Solution

DalJeanis
SplunkTrust
SplunkTrust

Here are some things you need to consider:

1) License usage log is often something that an organization does not want users looking at. Have you verified that you have access, and that your base search returns any records at all? If this returns some records, then you can proceed.

index=_internal source=license_usage.log type=Usage  | head 5

2) In order for you to detect a device that starts sending out unaccustomed levels of logs, you have to know what the normal level for THAT DEVICE is. So, you'll need the device name. It looks like you have a variable "h" in your chart command... The word "host" is probably as short as I would go for a real field name. Likewise, "b" for bytes is far less descriptive than what I would do.

3) You are reformatting the timestamp before you chart it, and you are doing it into an order that it will not sort in actual data order. In a typical year, January first, then February first, then March first through December first would be in order, then January second and so on. For most purposes, you need to get in the habit of either reformatting at the chart itself, or using "%Y-%m-%d" so it will sort into a useful order.

4) This counts transactions rather than bytes, but it can provide you with some quick and useful information regarding your hosts.

This part gets you a single record of the number of events per day for each host...

| metasearch earliest=-30d@d latest=-0d@d host=*
| bin _time span=1d 
| stats count as hostdaycount by host _time 

...This part adds that up and adds a record counting the number of days the host has been reporting, the last date it reported, the average count, the standard deviation of the count, and the maximum value for the daily count. It then calculates two potential limits, twice the average, and 2.5 standard deviations above the average. With a highly variable host, the latter might be a better choice, but you'll have to review and see what you think.

| appendpipe [| stats count as countdays, max(_time) as today, avg(hostdaycount) as avghostday, stdev(hostdaycount) as stdevhostday, max(hostdaycount) as maxhostday by host
   | eval _time = today
   | eval avghostday=round(avghostday,1)
   | eval stdevhostday=round(stdevhostday,1)
   | eval hostlimit2=avghostday+2*stdevhostday
   | eval hostlimit=2* avghostday]

Now, we still have individual records for each day, and most of them we don't need. The first command will roll the stats we've appended together with the record for the last day that host reported. The second command drops all the records that were NOT the last one reported for the host.

| stats values(*) as * by host _time
| where isnotnull(today) 

Now, one more thing, if a host stopped reporting prior to today, then we don't want to do anything about it in this report/alert, so find out the overall last data, and kill any records that don't have that date. We also don't want to alert on brand new hosts, so we're killing any records that don't have at least 5 days. You can adjust the number to your needs.

| eventstats max(_time) as FinalDay
| where _time = FinalDay AND AND countdays>5

Now, we table the results so that you can review it and decide what kind of limit you want to run in your organization.

| table host _time hostdaycount avghostday stdevhostday maxhostday hostlimit1 hostlimit2
| rename hostdaycount as "days active", 
     avghostday as "average events for this host", 
     stdevhostday as "stdev of events for this host", 
     maxhostday as "max events for this host",
     hostlimit1 as "2x average for this host",
     hostlimit2 as "2.5 stdevs above average"

Given the data from the above metasearch, you will be able to decide on a limit. If a COUNT, rather than a BYTE COUNT, will server your needs, then you can use this sample with whatever minor mods you might like to institute (number of SDs above average, limit on which hosts, and so on.)

View solution in original post

DalJeanis
SplunkTrust
SplunkTrust

Here are some things you need to consider:

1) License usage log is often something that an organization does not want users looking at. Have you verified that you have access, and that your base search returns any records at all? If this returns some records, then you can proceed.

index=_internal source=license_usage.log type=Usage  | head 5

2) In order for you to detect a device that starts sending out unaccustomed levels of logs, you have to know what the normal level for THAT DEVICE is. So, you'll need the device name. It looks like you have a variable "h" in your chart command... The word "host" is probably as short as I would go for a real field name. Likewise, "b" for bytes is far less descriptive than what I would do.

3) You are reformatting the timestamp before you chart it, and you are doing it into an order that it will not sort in actual data order. In a typical year, January first, then February first, then March first through December first would be in order, then January second and so on. For most purposes, you need to get in the habit of either reformatting at the chart itself, or using "%Y-%m-%d" so it will sort into a useful order.

4) This counts transactions rather than bytes, but it can provide you with some quick and useful information regarding your hosts.

This part gets you a single record of the number of events per day for each host...

| metasearch earliest=-30d@d latest=-0d@d host=*
| bin _time span=1d 
| stats count as hostdaycount by host _time 

...This part adds that up and adds a record counting the number of days the host has been reporting, the last date it reported, the average count, the standard deviation of the count, and the maximum value for the daily count. It then calculates two potential limits, twice the average, and 2.5 standard deviations above the average. With a highly variable host, the latter might be a better choice, but you'll have to review and see what you think.

| appendpipe [| stats count as countdays, max(_time) as today, avg(hostdaycount) as avghostday, stdev(hostdaycount) as stdevhostday, max(hostdaycount) as maxhostday by host
   | eval _time = today
   | eval avghostday=round(avghostday,1)
   | eval stdevhostday=round(stdevhostday,1)
   | eval hostlimit2=avghostday+2*stdevhostday
   | eval hostlimit=2* avghostday]

Now, we still have individual records for each day, and most of them we don't need. The first command will roll the stats we've appended together with the record for the last day that host reported. The second command drops all the records that were NOT the last one reported for the host.

| stats values(*) as * by host _time
| where isnotnull(today) 

Now, one more thing, if a host stopped reporting prior to today, then we don't want to do anything about it in this report/alert, so find out the overall last data, and kill any records that don't have that date. We also don't want to alert on brand new hosts, so we're killing any records that don't have at least 5 days. You can adjust the number to your needs.

| eventstats max(_time) as FinalDay
| where _time = FinalDay AND AND countdays>5

Now, we table the results so that you can review it and decide what kind of limit you want to run in your organization.

| table host _time hostdaycount avghostday stdevhostday maxhostday hostlimit1 hostlimit2
| rename hostdaycount as "days active", 
     avghostday as "average events for this host", 
     stdevhostday as "stdev of events for this host", 
     maxhostday as "max events for this host",
     hostlimit1 as "2x average for this host",
     hostlimit2 as "2.5 stdevs above average"

Given the data from the above metasearch, you will be able to decide on a limit. If a COUNT, rather than a BYTE COUNT, will server your needs, then you can use this sample with whatever minor mods you might like to institute (number of SDs above average, limit on which hosts, and so on.)

willluo
Engager

It really nice of you to provide such comprehensive answer to me. It helps a lot. Thank you very much.

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...