Hello Community,
I have a challenge finding and isolating the unique hosts out of two sources (DHCL and SysMon in my case)
I did try the following but it did work as expected:
EXAMPLE 1:
index=dhcp_source_index | stats count by host | eval source="dhcp" | append [ search index=sysmon_index | stats count by host | eval source="sysmon" ] | stats values(source) as sources by host | where mvcount(sources)=1 AND sources="dhcp"
EXAMPLE 2:
index=my_index | dedup host, source | stats list(source) as sources by host | append [search index=my_index | stats latest(_time) as last_seen by host] | eventstats max(last_seen) as last_seen by host | where mvcount(sources)=1 | table host, last_seen
The numbers from the manual findings and the above SPLs differ
Thanks in advance
| set diff
[| tstats count where source_1 by host | table host]
[| tstats count where source_2 by host | table host]That SPL provides a list of all of the hosts not seen in source_2
The search is not wrong but the last statement is inaccurate because set diff as shown produces a list of all hosts in source_1 not seen in source_2, plus all hosts in source_2 not seen in source_1. (The statement is correct only if hosts in source_2 is a subset of that in source_1. Maybe this is a condition known in your use case?) So, it is equivalent to the search I posted in this one.
To get list of only those hosts in source_1 that are not in source_2, use my search in this earlier one or, as @PickleRick suggested, improve it with tstats like
| tstats values(host) as host
where source_1 NOT
[tstats values(host) as host
where source_2]
If hosts in source_2 is a subset of that in source_1 as may be the case, this method will produce the exact same result, and will still be more efficient.
Two questions.
With these assumptions, you don't need stats. The following subsearch should suffice:
index=dhcp_source_index NOT
[search index=sysmon_index
| stats values(host) as host]
It's not clear how you distinguish your "sources". Your first search simply pulls data from two separate indexes while the second one does something completely strange.
Please describe what constitutes those sets of sources you want to calculate difference from.
@PickleRick Thanks for the reply.
Please, ignore both searches.
What I want to pull out the total unique assets in the DHCP source. I then want to be able to compare to the totals of unique assets in the SysMon source and output these assets that do not have SysMon present.
Thanks in advance.
Sure. But how do you define "assets"? How do you differentiate between them?
Because while you can use the general approach of combining two separate searches (either by means of append or multisearch) with additional field to classify your results into one of two sets, there might be more effective ways in specific cases.
Thanks for the reply @PickleRick
Sure, there will be a third column containing only assets that are not seen in both sources simultaneously and in addition at the end of the list there should be Totals of these assets.
Would you be able to develop a sample solution for this, please?
Thank you.
But what do you mean by asset? What in your data tells you that this is one "asset" and this is another one? Is it the host field or some other field within your data? Or any combination of fields?
Hi, thanks for the reply.
To simplify it let us say we have two lists of items of same type it could be anything.
How can we compare both lists and list only the subset of items not common to both lists.
Regards,
D
You can see my earlier comment. Essentially you don't need append/stats for this job.
index=dhcp_source_index NOT
[search index=sysmon_index
| stats values(host) as host]
If you only want to know which DHCP hosts are not in sysmon_index, add stats values(host) or stats count by host after this search.
Thanks for your reply @yuanliu
Unfortunately, your search did not provide the results I wanted. After executing the separate searches and abstract manually the result differs from the resultant of your search. Please do try it out.
After lots of try/error I finally found the one that does the trick. It is by using 'set diff' command. I will provide my solution tomorrow for everyone to use.
Regards,
Dan
set diff does not give host that do not have SysMon as the original question specifies. So, you want to know which sets of hosts are unique to each search, and not care that the only come from dhcp_source_index? (That is why I was asking very specific clarification questions, and stated clear assumptions of what my search is intended to do.)
Again, set is an expensive operation. You should be able to use stats to achieve it. The following is equivalent to set diff:
index=dhcp_source_index OR index=sysmon_index
| stats values(index) as index by host
| where mvcount(index) == 1
Maybe you have some requirements that you are not telling us?
Thanks for the reply @yuanliu
Agree to disagree.
If you look at the very beginning of my post I asked:
"I have a challenge finding and isolating the unique hosts out of two sources"
I think this is clear and SysMon and DHCP were just an example. Nothing concrete.
During the communication I have reiterated this statement.
Apologies if misunderstood.
Thanks all for your help.
Honestly, I have no idea what was the point of this thread. You asked some vague question (apparently wih some assumptions known only to you). People tried to help you by asking more precise questions (because often there is more than one way to do something in Splunk and depending on the data and requirements some solutions may be way way more effective than others) and you kept throwing more and more remarks leading nowhere.
Even if you simply asked it as a purely theoretical exercise - to find out what approaches you can take to solve this kind of problem, you should have said so clearly - the descriptive answer about general possible approaches is something completely different than a solution to a specific problem.
And no, the set operators are not very effective and are very rarely used since there are usually much better solutions - either filtering out by a subsearch as @yuanliu showed or by classifying, statsing and filtering (or even faster - tstatsing if you can use indexed fields).
Frustration is the least emotional state I wanted to achieve here. Apologies!
I still believe it was not me confusing people, I just wanted help with simply being able to compare two datasets and printing out the only hosts seen in one of the data sources. if that sounds confusing then again apologies.
I have been a member for some time now and always admired all the help given here.
I have tested all solutions provided so far, and seen no results (it might be my fault).
The only solution that provided me with the results I want was the following for all to use.
I agree (do not have great knowledge around set command as of now) set might not be as efficient as other commands bet here is what worked for me.
| set diff
[| tstats count where source_1 by host | table host]
[| tstats count where source_2 by host | table host]
That SPL provides a list of all of the hosts not seen in source_2
At the end of the day it is important for people to get some working examples. They are testing and either working or not working. Silence is gold time to time and no one wants frustration waves. I was not meant to do any harm. Been in the industry for a long time enough to realize we all different and have different emotions.
Please, by all means if you can create something that would prove me wrong or anything better than set command for the community to use.
Thank you!
| set diff
[| tstats count where source_1 by host | table host]
[| tstats count where source_2 by host | table host]That SPL provides a list of all of the hosts not seen in source_2
The search is not wrong but the last statement is inaccurate because set diff as shown produces a list of all hosts in source_1 not seen in source_2, plus all hosts in source_2 not seen in source_1. (The statement is correct only if hosts in source_2 is a subset of that in source_1. Maybe this is a condition known in your use case?) So, it is equivalent to the search I posted in this one.
To get list of only those hosts in source_1 that are not in source_2, use my search in this earlier one or, as @PickleRick suggested, improve it with tstats like
| tstats values(host) as host
where source_1 NOT
[tstats values(host) as host
where source_2]
If hosts in source_2 is a subset of that in source_1 as may be the case, this method will produce the exact same result, and will still be more efficient.
Thanks very much for the solution @yuanliu
Much appreciated!