We have alerts for high Windows Server CPU usage, and we have automated vulnerability scanners which can trip these alerts, which we'd like to ignore. The existing alert has just one nested search to check CPU usage, and the main search uses that host name to figure out the processes with high CPU, roughly like this:
```analyze processes on high-cpu hosts```
index=perf source=process cpu>25
```find high-CPU hosts```
[search
index=perf source=cpudata cpu>95
|stats count by host
|search count > 2
|fields host ]
```omitted: other macros, formatting, etc.```
In order to ignore the vulnerability scans, we have to check the IIS index for requests from servers that start with a particular name (scanner*) -- this works fine:
```analyze processes on high-cpu hosts```
index=perf source=process cpu>25
```disregard hosts taking scanner traffic```
[search
index=iis
|lookup dnslookup clientip OUTPUT clienthost
|eval scanner=(if(like(clienthost,"scanner%"),1,0)
|stats sum(scanner) as count by host
|search count<10
|fields host
```find high-CPU hosts```
|search [search
index=perf source=cpudata cpu>95
|stats count by host
|search count > 2
|fields host ]]
```omitted: other macros, formatting, etc.```
The problem is that not all servers use IIS. For example a SQL Server will never appear in the IIS index. So I'm trying to find a way to have the host value passed to the main search when the IIS sub-search doesn't have any hits at all.
I kind of suspect I should combine the IIS search into the high-CPU subsearch (reference both indexes) but I'm having a hard time wrapping my head around how that would work.
As a side note, performance seems pretty bad. Each subsearch runs subsecond, stand-alone (and our indexes like IIS have 1 million plus events per minute), but the multi-subsearch version takes a little over 1 minute -- this surprises me since only a few host values are passed out of the innermost search. Performance suggestions are very welcome but we can live with the processing time if I can solve this other issue.
If I understand correctly, you want the hosts with high cpu but not if the client host begins with "scanner"? Does something like this work for you?
```analyze processes on high-cpu hosts```
index=perf source=process cpu>25
```find high-CPU hosts```
[search
index=perf source=cpudata cpu>95
|stats count by host
|search count > 2
|fields host ]
```disregard hosts taking scanner traffic```
NOT [search
index=iis
|lookup dnslookup clientip OUTPUT clienthost
|where like(clienthost,"scanner%")
|fields host
|format]
```omitted: other macros, formatting, etc.```
Almost ... I want hosts with high CPU, and not if the client host begins with "scanner" or when the host doesn't appear in the IIS index at all. That's why I think I probably need to combine them, then eval and stats by host some 1-or-0 fields based on which index produced a match.
However, the IIS index is enormous (about 80% of around 4000 servers are IIS) which is why a nested search (so specific hosts are identified) would be preferable to a combined search.
In fact, now I think I have to do both due to the sheer volume of IIS data -- query for high CPU first to generate a target list of hosts, then run the combined IIS/CPU query so that the CPU index will generate hits for hosts without IIS traffic.
```analyze processes on high-cpu hosts```
index=perf source=process cpu>25
```filter the high-CPU hosts without scan traffic```
[search
(index=perf source=cpudata cpu>95) or index=iis
|lookup dnslookup clientip OUTPUT clienthost
|eval scantraffic=if(index="iis" AND like(clienthost,"scanner%"), 1, 0)
|eval highcpu=if(index="perf", 1, 0)
|stats sum(scantraffic) as scans, sum(highcpu) as cpu by host
|search cpu>2 AND scans<10
|fields host
```generate list of high-CPU hosts```
|search [search
index=perf source=cpudata cpu>95
|stats count by host
|search count > 2
|fields host ]]
```omitted: other macros, formatting, etc.```
Quite frustrating. That works when I run that series of searches individually (adding a "format" command and manually adding the host list to each of them). When run individually, the three searches total execution time is maybe four seconds.
But combined as nested or sub-searches (not sure which term is correct, maybe both)?
It times out at 60 seconds without finding anything.
Completely ridiculous.
What about something like this
```analyze processes on high-cpu hosts```
index=perf source=process cpu>25
```find high-CPU hosts```
[search
index=perf source=cpudata cpu>95
```disregard hosts taking scanner traffic```
NOT [search
index=iis
|lookup dnslookup clientip OUTPUT clienthost
|where like(clienthost,"scanner%")
|fields host
|format]
|stats count by host
|search count > 2
|fields host
|format]
```omitted: other macros, formatting, etc.```
That has the same problem I was trying to avoid above -- if I search IIS logs without a list of host names from some other subsearch, it has to look through ~15 million events every time the alert runs (15 minute intervals and about 1 million IIS events per minute logged).
The high-CPU host list is typically only a few machines at most, and the CPU log is relatively small (each host logs an event every 5 minutes), so it makes a good starting-point filter.
Create a summary index from the iis index which has just the distinct values in
Yeah that's what johnhua is suggesting below. It's a long process to get anything into production here, unfortunately, but it may come to that.
I appreciate the help.
@mv10 Keep in mind that using dnslookup to resolve hostnames is very slow. You should try to avoid it by either 1. specifying a list of scanner IPs in your search, or 2. create a lookup to resolve your scanner and ips.
NOT [search
index=iis clientip IN (172.168.1.1, 172.168.1.2, 172.168.1.3, 172.168.1.4)
| fields host
| format]
Thanks for the thought. Now that you mention it, I do remember running into dnslookup perf issues in the past. Unfortunately this is an enormous company and there's near zero chance of getting anyone to provide and maintain a lookup.
However, that doesn't explain why nested searches are so slow that they fail, when the identical searches individually executed are very, very fast (specifically, approx averages inner to outer on a 15 minute period known to produce six host names: 0.7 sec, 1.6 sec, and 0.9 sec respectively).
Let's narrow down where the issue is occurring.
Try this search without IIS and without using format command on the subsearch host results which could make things slower:
index=perf source=process cpu>25 earliest=-15m@m
[search index=perf source=cpudata cpu>95 earliest=-15m@m
|stats count by host | where count>2 | fields host]
To generate and maintain a reverse_dns lookup for clientips to IIS, it's pretty simple automate.
1. Run this manually to create and populate the initial lookup data:
index=iis clientip=*
| dedup clientip
| lookup dnslookup clientip OUTPUT clienthost
| where LEN(clienthost)>1
| eval update_time=now()
| table update_time clientip clienthost
| outputlookup iis_rdns_ip_host_lookup.csv
2. Create an schedule job, hourly or daily to update and refresh the lookup file. To speed up the process, this query will only resolve IPs that's not found in the lookup table or if the record is older than 7 days. You should also adjust the data retention (current set to 30 days)
index=iis clientip=*
| dedup clientip
| lookup iis_rdns_ip_host_lookup.csv clientip OUTPUT clienthost update_time
``` Excluded existing records was updated within the last 7 days ```
| where update_time<relative_time(now(), "-7d@d") OR isnull(clienthost)
| lookup dnslookup clientip OUTPUT clienthost
| where LEN(clienthost)>1
| eval update_time=COALESCE(update_time, now())
| append [
| inputlookup iis_rdns_ip_host_lookup.csv | where
``` Removed stale records >30 days from being saved```
update_time>relative_time(now(), "-30d@d")]
| dedup clientip
| table update_time clientip clienthost
| outputlookup iis_rdns_ip_host_lookup.csv
What you suggested I try is the existing current-state alert (the first in my original post). It works fine. That's what I was trying to modify / improve.
I appreciate all the details about creating a lookup, but I don't have permissions or any way to get that into production so it's a moot point, unfortunately.