Compare two indexes and report on mismatch

Richy_s · ‎09-25-2024

I'm comparing two indexes, A and B, using the hostname as the common field. My current search successfully identifies whether each hostname in index A is present in index B. However, I also want to include additional information from index A, such as the operating system and device type, in the output. This information is not present in index B. How can I modify my query to display the operating system alongside the status (missing/ok) for each hostname?

below is the query I am using

index=A sourcetype="Any" | eval Hostname=lower(Hostname) | table Hostname | dedup Hostname | append [ search index=B sourcetype="foo" | eval Hostname=lower(Reporting_Host) | table Hostname | dedup Hostname ] | stats count by Hostname
| eval match=if(count=1, "missing", "ok")

yuanliu · ‎09-25-2024

First, unless you have prior knowledge that number of Hostname in index A is always larger than that in index B in any search period, "missing" simply means that the name appears only in one index. The following does not try to address this problem, but will give you what you want, and is much simpler, perhaps more performant.

(index=A sourcetype="Any") OR (index=B sourcetype="foo")
| eval Hostname=coalesce(lower(Hostname), lower(Reporting_Host))
| fields index Hostname operating_system device_type
| stats values(*) as * by Hostname
| eval match=if(mvcount(index) == 1, "missing", "ok")

Not only operating system and device type, you can add any other fields of interest that may only exist in one of indices.

Richy_s · ‎09-26-2024

Thank you @yuanliu @jawahir007 Both of your solutions are working absolutely fine. @yuanliu yes, index A always has larger number of hosts compared to index B.

I would like to further expand this query to match the IP address aswell. Can you provide some guidance around that.

index A data

Hostname	IP address	OS
xyz	190.1.1.1, 101.2.2.2, 102.3.3.3, 4.3.2.1	Windows
zbc	100.0.1.0	Linux
alb	190.1.0.2	Windows
cgf	20.4.2.1	Windows
bcn	20.5.3.4, 30.4.6.1	Solaris

Index B

Hostname

zbc

30.4.6.1

alb

101.2.2.2

Results

Hostname	IP address	OS	match
xyz	190.1.1.1, 101.2.2.2, 102.3.3.3, 4.3.2.1	Windows	ok(because IP address 101.2.2.2 is matching)
zbc	100.0.1.0	Linux	ok
alb	190.1.0.2	Windows	ok
cgf	20.4.2.1	Windows	missing(neither hostname is present nor the IP is matching)
bcn	20.5.3.4, 30.4.6.1	Solaris	yes(IP is matching)

In my initial use case, I compared the hostnames in index A with those in index B. Now, I want to check if the hosts in index A are reporting their IP addresses in index B. If there’s a match, I will mark the corresponding hostname in index A as "ok."

yuanliu · ‎09-26-2024

Assuming that, like in the OP, index A still carries Hostname field that you want to compare with Reporting_Host in index B. In addition, index A has a field "IP address". This should get your desired result.

index=A sourcetype="Any"
| stats values("IP address") as "IP address" by Hostname OS
| append
    [search index=B sourcetype="foo"
    | stats values(Reporting_Host) as Reporting_Host]
| eventstats values(eval(lower(Reporting_Host))) as Reporting_Host
| where index != "B"
| mvexpand "IP address"
| eval match = if(lower(Hostname) IN (Reporting_Host) OR 'IP address' IN (Reporting_Host), "ok", null())
| stats values("IP address") as "IP address" values(match) as match by Hostname OS
| fillnull match value="missing"

Use the following emulation:

| makeresults format=csv data="Hostname,	IP address,	OS
xyz,	190.1.1.1:101.2.2.2:102.3.3.3:4.3.2.1,Windows
zbc,	100.0.1.0,	Linux
alb,	190.1.0.2,	Windows
cgf,	20.4.2.1,	Windows
bcn,	20.5.3.4:30.4.6.1,	Solaris"
| eval "IP address" = split('IP address', ":")
| eval index = "A"
| append
    [makeresults format=csv data="Reporting_Host
zbc
30.4.6.1
alb
101.2.2.2"
    | eval index = "B"]
``` the above emulates
index=A sourcetype="Any"
| stats values("IP address") as "IP address" by Hostname OS
| append
    [search index=B sourcetype="foo"
    | stats values(Reporting_Host) as Reporting_Host]
```

the result is

Hostname	OS	IP address	match
alb	Windows	190.1.0.2	ok
bcn	Solaris	20.5.3.4 30.4.6.1	ok
cgf	Windows	20.4.2.1	missing
xyz	Windows	101.2.2.2 102.3.3.3 190.1.1.1 4.3.2.1	ok
zbc	Linux	100.0.1.0	ok

Richy_s · ‎09-27-2024

@yuanliu Thank you for your response. I tried below query but it doesn't seem to be working. When I further cut down the query for testing, looks like "|where index!=B" is not working. Everything before this query is working but when I add this condition, I get 0 results.

also, the query seems to be very aggressive. My index A has almost close to 70k events and index B has around 10k events. Splunk was crashing few times when I try to run the query.

Any suggestions, how to address this ?

yuanliu · ‎09-27-2024

My apologies. I was switching between two different approaches and the filters got crossed. To use the subsearch method above, modify that line to | where isnotnull(OS)

index=A sourcetype="Any"
| stats values("IP address") as "IP address" by Hostname OS
| append
    [search index=B sourcetype="foo"
    | stats values(Reporting_Host) as Reporting_Host]
| eventstats values(eval(lower(Reporting_Host))) as Reporting_Host
| where isnotnull(OS)
| mvexpand "IP address"
| eval match = if(lower(Hostname) IN (Reporting_Host) OR 'IP address' IN (Reporting_Host), "ok", null())
| stats values("IP address") as "IP address" values(match) as match by Hostname OS
| fillnull match value="missing"

Depending on your deployment, combining the two index searches could improve performance, like this

(index=A sourcetype="Any") OR (index=B sourcetype="foo")
| eventstats values(eval(lower(Reporting_Host))) as Reporting_Host
| where index != "B"
| mvexpand "IP address"
| eval match = if(lower(Hostname) IN (Reporting_Host) OR 'IP address' IN (Reporting_Host), "ok", null())
| stats values("IP address") as "IP address" values(match) as match by Hostname OS
| fillnull match value="missing"

But eventstats and mvexpand could be bigger performance hindrances. There could be ways to avoid mvexpand; there could be ways to improve eventstats. But unless you can isolate the main contributor to slowness, they are not worth exploring.

Performance is a complex subject with any querying language. You can start by doing some basic tests. For example, run those two subsearches separately and compare with combined search. If the total time is comparable, index search is the main hindrance. That will be very difficult to improve. Another test could be to add dedup before stats. And so on.

Richy_s · ‎10-03-2024

Hi @yuanliu Unfortunately none of the below queries are working for me. First one is crashing splunk so unable to test it. Second one, I don't get any results. Could be because the field "Reporting_Host" is present only in index B and since we are excluding index B in the next step, the results are 0. However I tried renaming the Hostname field in index A and try running the query but no results.

Can we test this scenario using a look up table, that might improve the search performance. Can you give me something in this regards?

Richy_s · ‎10-03-2024

Something like this

index=B | stats count by Reporting_Host | search NOT [| inputlookup inventory.csv ] | table Hostname ]

inventory.csv has the table pickup from index A. Lookup query is - index=B | stats values(IP address) by Hostname Operating_system

PS This query is not working, just a thought.

yuanliu · ‎10-03-2024

You said crash several times. This makes me think that your server may not have enough memory for that giant eventstats. (Again, this all depends on how many unique host name and IP addresses are in those tens of thousands of events. If that number is unusually large, it could exceed system RAM. But it also makes me suspect that your search heads might be under provisioned.)

If you are willing to use inventory lookup, things can improve. Given that only one field from index B is useful in your logic, inventory should come from this index. (You also said index B is smaller.)

Here is my suggested setup for inventory.csv.

index = B sourcetype="foo"
| dedup Reporting_Host
| table Reporting_Host
| outputlookup inventory.csv

After this,

index=A sourcetype="Any"
| fields "IP address" Hostname OS
| dedup "IP address" Hostname OS
| eval Hostname = lower(Hostname)
| lookup inventory.csv Reporting_Host as Hostname output Reporting_Host as match
| lookup inventory.csv Reporting_Host as "IP address" output Reporting_Host as match
| eval match = if(isnull(match), "missing", "ok")
| table Hostname	"IP address"	OS	match

Richy_s · ‎10-04-2024

Hi @yuanliu

Thank you for your feedback! I will definitely look into the performance issue and plan for further improvements.

Regarding the query, I tried it out, and here’s how it’s working: Index A contains around 70k assets, which serves as our asset inventory. Some hosts in this index have multiple IP addresses assigned to them. Index B has just the hostname, but this can include a mix of IP addresses, FQDNs, and hostnames.

When I ran the query, the first lookup compared the Reporting_Host with the hostnames in Index A and determined whether there was a match. The second lookup compared the Reporting_Host against the IP addresses in Index A to check for matches. However, when we combined these lookups as shown in the query you shared, the results only reflected matches from the second lookup, meaning only the IP addresses were being compared. Additionally, since a host in Index A has multiple IP addresses, the query gives a match for the IP address that corresponds, but for the remaining IP addresses associated with that host, it shows them as missing.

yuanliu · ‎10-04-2024

when we combined these lookups as shown in the query you shared, the results only reflected matches from the second lookup, meaning only the IP addresses were being compared. Additionally,

My mistake again. When using the same output name, the second lookup overrides the first. Use outputnew in the second.

index=A sourcetype="Any"
| fields "IP address" Hostname OS
| dedup "IP address" Hostname OS
| eval Hostname = lower(Hostname)
| lookup inventory.csv Reporting_Host as Hostname output Reporting_Host as match
| lookup inventory.csv Reporting_Host as "IP address" OUTPUTNEW Reporting_Host as match
| eval match = if(isnull(match), "missing", "ok")
| table Hostname	"IP address"	OS	match

jawahir007 · ‎09-25-2024

index=A sourcetype="Any" 
| eval Hostname=lower(Hostname) 
| table Hostname os device_type ```# Include os and device_type fields```
| dedup Hostname 
| append [ 
search index=B sourcetype="foo" 
| eval Hostname=lower(Reporting_Host) 
| table Hostname 
| dedup Hostname 
] 
| stats values(os) as os values(device_type) as device_type count by Hostname 
| eval match=if(count=1, "missing", "ok") 
| table Hostname os device_type match

------

Compare two indexes and report on mismatch

eval

subsearch

If you find this solution helpful, please consider accepting it and awarding karma points !!

Why You Can't Miss .conf25: Unleashing the Power of Agentic AI with Splunk & Cisco

Deep Dive into Federated Analytics: Unlocking the Full Power of Your Security Data

Your summer travels continue with new course releases

Are you a member of the Splunk Community?

Compare two indexes and report on mismatch

eval

subsearch

If you find this solution helpful, please consider accepting it and awarding karma points !!

Why You Can't Miss .conf25: Unleashing the Power of Agentic AI with Splunk & Cisco

Deep Dive into Federated Analytics: Unlocking the Full Power of Your Security Data

Your summer travels continue with new course releases