Solved: UserID exists in query 1 but not query 2 by host ?

sjringo · ‎07-14-2023

I have been trying to figure this out but getting stumped. I have seen other questions similar but just slightly different. I have read from other posts to try to not use outer join's ?

I have two queries that are executing the same query just over two different time frames and I want to find only those UserID's that exists in Query1 but not in Query2 which would tell me the user is currently active but was not active before...

Query 1:

index=anIndex sourcetype=aSourceypte earliest=-17m@m latest=-2m@m

| rex field=_raw "^(?:[^,\n]*,){2}(?P<LoginUserID>\w+\.\w+)"

| dedup host LoginUserID

| sort host LoginUserID

| table host LoginUserID

Query2:

index=anIndex sourcetype=aSourceypte earliest=-32m@m latest=-17m@m

| rex field=_raw "^(?:[^,\n]*,){2}(?P<LoginUserID>\w+\.\w+)"

| dedup host LoginUserID

| sort host LoginUserID

| table host LoginUserID

Been trying to figure out a way to use the count of the occurrences of LoginUserID and if it = 1 then that tells me it only exists in one query but stumped there too.

Am I close or thinking about this the wrong way ?

yuanliu · ‎07-14-2023

Sounds like a perfect job for subsearch

index=anIndex sourcetype=aSourceypte earliest=-17m@m latest=-2m@m
| rex field=_raw "^(?:[^,\n]*,){2}(?P<LoginUserID>\w+\.\w+)"
| search NOT
  [ search index=anIndex sourcetype=aSourceypte earliest=-32m@m latest=-17m@m
   | rex field=_raw "^(?:[^,\n]*,){2}(?P<LoginUserID>\w+\.\w+)"
   | stats values(LoginUserID) as LoginUserID by host
   | mvexpand LoginUserID ]
| dedup host LoginUserID
| sort host LoginUserID
| table host LoginUserID

View solution in original post

yuanliu · ‎07-14-2023

Sounds like a perfect job for subsearch

index=anIndex sourcetype=aSourceypte earliest=-17m@m latest=-2m@m
| rex field=_raw "^(?:[^,\n]*,){2}(?P<LoginUserID>\w+\.\w+)"
| search NOT
  [ search index=anIndex sourcetype=aSourceypte earliest=-32m@m latest=-17m@m
   | rex field=_raw "^(?:[^,\n]*,){2}(?P<LoginUserID>\w+\.\w+)"
   | stats values(LoginUserID) as LoginUserID by host
   | mvexpand LoginUserID ]
| dedup host LoginUserID
| sort host LoginUserID
| table host LoginUserID

PickleRick · ‎07-15-2023

I will disagree here 😉 Since the base search is the same, I'd say it's a perfect job for avoiding the subsearch by classification.

<base search> earliest=-32m@m latest=-2m@m
| rex "..."
| eval period=if(time<relative_time(now(),"-17m@m"),"first_half","second_half")
|stats values(LoginUserID) as LoginUserID by host period

sjringo · ‎07-15-2023

Im working through this and have a question about one line.

| eval period=if( time < relative_time(now(),"-17m@m"),"first_half","second_half")

In this line what is "time" ?

I am aware of _time and have made the necessary changes to get my query to run with results.

I have changed the query to use "time" and "_time" and both return the same results ?

PickleRick · ‎07-15-2023

Nice catch. I meant _time of course. I was writing this on a tablet and fighting with the screen keyboard must have made me forget the underscore 🙂

And what do you mean by "same results"? My search should give you separate lists of users per host per each period. So you'd get a list of users from first 15 minutes and another list from second 15 minutes. Then you can work on that.

sjringo · ‎07-15-2023

My mistake. I had modified the query/code and it was giving me false results.

What I am working on now is that I have the results.

i.e.

Host, Period, LoginUserID

X,first_half, user1

X, second_half, user2

X, second_half, user1

Im not interested in specific users, just the total for each host. Or the difference between the two period's?

i.e. Users that are active in the second_half that did not exist in the first_half

yuanliu · ‎07-15-2023

@sjringo wrote:
Im not interested in specific users, just the total for each host. Or the difference between the two period's?
i.e. Users that are active in the second_half that did not exist in the first_half

I am a strong proponent of not using subsearch/join, etc. One reason why I lean into subsearch for this particular problem is in part because my use of subsearch "naturally" excludes uers that are active in both periods. Another good reason to not avoid subsearch is because the two searches in this case have zero overlapping event. (I analyzed this factor in an earlier post involving more complex intervals.) Large number of duplicate events is the stronger reason behind advocating against subsearch. (The reason against join can vary.)

This said, my use of subsearch does incur the cost of launching a second search, and execution of two identical stats commands. These are relative small. (Even here, whether one stats over larger value set is more efficient than two stats over smaller value sets is debatable.)

However, if you absolutely must avoid subsearch, the difference can easily be calculated if you changing stats terms:

index=anIndex sourcetype=aSourceypte earliest=-32m@m latest=-2m@m
| rex field=_raw "^(?:[^,\n]*,){2}(?P<LoginUserID>\w+\.\w+)"
| eval period=if(_time < relative_time(now(), "-17m@m"), "first_half", "second_half")
| stats values(period) as period by LoginUserID host
| where period == "second_half" AND period != "first_half"

sjringo · ‎07-15-2023

The only reason I mentioned staying away from subsearches and/or joins is that when I was reading other posts with similar questions or looking for a similar solution I kept seeing comments to stay away from it.

I actually stumbled across a different 'solution' which was to use: set diff [..] using two overlapping time frames.

1st timeframe is -32 to now, 2nd is -32 to -17. then removing duplicates and then doing a stats count(LoginUserID) by host

I wanted to get all three 'solutions' working and then compare the results to see if the all were the same which they are.

My queries are not running long nor am I worried about performance.

I prefer less code so I am leaning toward giving 'credit' to your first solution even though PickleRicks also does the same job with a few more 'steps' and PickleRick has helped me in the past so I have to figure a way to recognize both of your time/effort !!!

PickleRick · ‎07-15-2023

OK. After stats values() you should have two lists per each host - one from the earlier period, one from the later one.

Now you have two ways of dealing with this.

1) mvexpand those lists and do a "reverse stats"

| stats values(period) as period by host LoginUserID

This way for each user for each host you'll have info whether he logged in only during the first period, second one or both. You can then filter it and merge back. Like

| where period="first_half" AND NOT period="second_half"
| stats values(LoginUserID) by host

This way you'll get a list of users which logged in only during the first half. (mind you, the AND NOT condition is important and it works differently than if you wrote period!="second_half").

2) Or you can use xyseries to merge your results into single rows per host.

And now since there is no command to compare two fields in terms of set elements, you have to use foreach with mode=multivalue to iterate over elements from one field to search another field with mvfind and compose your output field accordingly. But I find this approach less "splunky".

UserID exists in query 1 but not query 2 by host ?

count

field extraction

join

rex

subsearch

table

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)