Solved: How to compare fields over multiple sourcetypes wi...

MuS · ‎03-31-2014

Hello everyone,

Now, this one bugs me for some time and this question got my attention back to this topic.
How can one compare fields over multiple source types without the use of join, append or any other subsearch?

I know, there some use cases where one has to use either of the above commands. But I don't want to and I don't need to, so what can be done?

cheers, MuS

MuS · ‎07-02-2015

join or append are the most intuitive direction to go in, but it's not very efficient and it's pretty cumbersome.

I will show what can be done by using a run everywhere example. This will report on one field user which is available in both sourcetypes and some others group or run_time which is only available in one sourcetype.

The use case here is to show which user, group and run_time we have per date_hour.

Let's start with the obvious one: Have you met chart?

index=_internal earliest=-2h@h latest=-0h@h sourcetype=splunkd OR sourcetype=scheduler | chart values(user) AS user values(group) AS group values(run_time) AS run_time by date_hour

The next obvious will be: Have you met stats?

index=_internal earliest=-2h@h latest=-0h@h sourcetype=splunkd OR sourcetype=scheduler | stats values(user) AS user values(group) AS group values(run_time) AS run_time by date_hour

Now move on for some fancy stats stuff?

index=_internal earliest=-2h@h latest=-0h@h sourcetype=splunkd OR sourcetype=scheduler 
| eval hour-{sourcetype}=date_hour 
| stats values(hour-*) AS * values(user) AS user values(group) AS group values(run_time) AS run_time 
| mvexpand splunkd  
| mvexpand scheduler 
| where splunkd=scheduler 
| rename splunkd AS date_hour 
| fields - scheduler

the eval on hour-{sourcetype}=date_hour will create a new field which looks in this case like hour-splunkd and hour-scheduler which will be used later in the where clause....

or another nice run everywhere example:

index=_internal sourcetype="splunkd" OR sourcetype="splunk_web_access" | streamstats count by status, idx, sourcetype | stats values(idx) AS idx, values(status) AS status, values(sourcetype) AS sourcetype | mvexpand status | eval Status = if(match(idx,status), "MATCH", "NO MATCH") | table status, idx, Status

So why all this hustle and not use join instead?

Because it is not neccesary to join all search just because you have different sources.
Because you can use stats or chart for it.
Because it will be faster.
Because it is fun to challenge basic Splunk commands and do some tricks with them 🙂

Just for the record, all the above example run on my laptop and Splunk 6 for about 2.5 seconds while the join to get the same result take about 4.5 seconds:

index=_internal sourcetype=splunkd earliest=-2h@h latest=-0h@h |  join date_hour [ search index=_internal sourcetype=scheduler earliest=-2h@h latest=-0h@h ] | stats values(date_hour) AS date_hour values(user) AS user values(group) AS group

One more thing: I know that in some use cases one is forced to use join or append, but before that - just give stats a chance .....

cheers, MuS

View solution in original post

MuS · ‎07-02-2015

join or append are the most intuitive direction to go in, but it's not very efficient and it's pretty cumbersome.

I will show what can be done by using a run everywhere example. This will report on one field user which is available in both sourcetypes and some others group or run_time which is only available in one sourcetype.

The use case here is to show which user, group and run_time we have per date_hour.

Let's start with the obvious one: Have you met chart?

index=_internal earliest=-2h@h latest=-0h@h sourcetype=splunkd OR sourcetype=scheduler | chart values(user) AS user values(group) AS group values(run_time) AS run_time by date_hour

The next obvious will be: Have you met stats?

index=_internal earliest=-2h@h latest=-0h@h sourcetype=splunkd OR sourcetype=scheduler | stats values(user) AS user values(group) AS group values(run_time) AS run_time by date_hour

Now move on for some fancy stats stuff?

index=_internal earliest=-2h@h latest=-0h@h sourcetype=splunkd OR sourcetype=scheduler 
| eval hour-{sourcetype}=date_hour 
| stats values(hour-*) AS * values(user) AS user values(group) AS group values(run_time) AS run_time 
| mvexpand splunkd  
| mvexpand scheduler 
| where splunkd=scheduler 
| rename splunkd AS date_hour 
| fields - scheduler

the eval on hour-{sourcetype}=date_hour will create a new field which looks in this case like hour-splunkd and hour-scheduler which will be used later in the where clause....

or another nice run everywhere example:

index=_internal sourcetype="splunkd" OR sourcetype="splunk_web_access" | streamstats count by status, idx, sourcetype | stats values(idx) AS idx, values(status) AS status, values(sourcetype) AS sourcetype | mvexpand status | eval Status = if(match(idx,status), "MATCH", "NO MATCH") | table status, idx, Status

So why all this hustle and not use join instead?

Because it is not neccesary to join all search just because you have different sources.
Because you can use stats or chart for it.
Because it will be faster.
Because it is fun to challenge basic Splunk commands and do some tricks with them 🙂

Just for the record, all the above example run on my laptop and Splunk 6 for about 2.5 seconds while the join to get the same result take about 4.5 seconds:

index=_internal sourcetype=splunkd earliest=-2h@h latest=-0h@h |  join date_hour [ search index=_internal sourcetype=scheduler earliest=-2h@h latest=-0h@h ] | stats values(date_hour) AS date_hour values(user) AS user values(group) AS group

One more thing: I know that in some use cases one is forced to use join or append, but before that - just give stats a chance .....

cheers, MuS

aa70627 · ‎05-04-2020

@MuS how would this work if one of your dataset is a CSV file, where the CSV file is greater than 50k ? Is it possible to run this sort of query without using |append or join?

MuS · ‎05-04-2020

You can always use | inputcsv append=t file.csv in your SPL.

cheers, MuS

ismarslomic · ‎04-15-2017

@Mus thanks for very useful answer! To avoid messing up this thread I have created new Question which is highly related to this one and is about extending to three searches: https://answers.splunk.com/answers/521078/joining-fields-from-three-different-indexes-and-so.html

I would appreciate your help!

Lowell · ‎02-11-2016

I know I'm late to the party, just wanted to throw in one caution.

It's interesting that streamstats is actually faster than a join or append in this case. I could see that working for a small amount of data, but I suspect that factors like data set size (of both the primary and secondary sources) as well as search mode (single server vs distributed) could have a significant impact on performance.

The overall advice here is great, I just think it's prudent to point out that search commands like transaction, streamstats, and eventstats require that ALL events be streamed to the search head which can have some significant network bandwidth and performance implications. Compare this to stats where map-reduce allows the data to be precomputed on the indexers. Bottom line, don't forget to look a the "remote search" and the "report search" in the job inspector.

imanpoeiri · ‎09-14-2015

Hi @MuS,

After I POC the 3 of the method, I notice each of these are giving me different numbers of events.

Stats give me 2967 events
Chart give me 3296 events
Join give me 3092 events

Now I am questioning the accuracy of the methodology.

MuS · ‎09-15-2015

Again provide more Information and details and do it please in your other question.

wrangler2x · ‎09-09-2015

@MuS -- Where you said, "the eval on hour-{sourcetype}=date_hour will create a new field which looks in this case like hour-splunkd and hour-scheduler which will be used later in the where clause...." I am wondering about that, because in the where clause it says, where splunkd=scheduler so I don't understand that explanation.

Another thing puzzling me is the *values(hour-) AS *** in the stats command. I don't understand at all what that is doing.

MuS · ‎09-09-2015

In addition try this simple Splunk> Fu search and you will see how stats values(*) AS * works 🙂

index=_internal earliest=-24h@h latest=-0h@h sourcetype=splunkd OR sourcetype=scheduler 
| stats values(sourcet*) AS Splunk>Fu-t* count

wrangler2x · ‎09-09-2015

Okay, I think I missed something in the first explanation that is now cleared up (maybe!). So the use of brackets {} around an existing field yields the possible values for that field in that search in an eval. So you use those appended to something to create new field names to which you can assign anything I suppose, but date_hour in the example.

So it seems that in the stats command (and perhaps elsewhere) use of a partial field name followed by a * will cause splunk to auto-complete all possible field names with that specified beginning.

But in the rename Splunk>fu-t* it looks like the * autocompletes based on what ending was previously matched, which in this case is ype. Have I got it?

MuS · ‎09-10-2015

@wrangler2x, both is correct.

MuS · ‎09-09-2015

@wrangler2x
HeHe, this can be really mind twisting, but it's not - let me explain:

| eval hour-{sourcetype}=date_hour

will create fields that are named hour-splunkd or hour-scheduler with the value of the corresponding date_hour field of splunkd or scheduler. Next the stats will use the hour-* fields like this:

 | stats values(hour-*) AS *

and renames them into the value of star; hour-splunkd becomes splunkd and hour-scheduler becomes scheduler - Does that make sense?

khubyarb · ‎06-30-2016

Can you apply this same method to search across multiple indices instead of multiple sourcetypes?

martin_mueller · ‎06-30-2016

Yes, any combination of search filters.

khubyarb · ‎06-30-2016

Thanks for clearing that up!

wrangler2x · ‎09-09-2015

Oh, that is twisted sister! Yes, makes perfect sense. I had no idea the star could be used that way. That's pretty wow, and I see how it allows you access to the data for the two sourcetypes.

So, I take it then that when you do where splunkd=scheduler that you are pairing the data_hour for each set of information?

MuS · ‎09-10-2015

@wrangler2x: correct.

martin_mueller · ‎07-02-2015

I've occasionally found eventstats to be particularly helpful as well.

0range · ‎07-02-2015

append0 a chance ....
append0?

MuS · ‎07-02-2015

HeHe that is funny, is the 'raw' text it says just give stats a chance but in the answer it is append() ?!???

For sure it should also say ...give stats a chance 😉

How to compare fields over multiple sourcetypes without 'join', 'append' or use of subsearches?

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Join the Conversation

How to compare fields over multiple sourcetypes without 'join', 'append' or use of subsearches?

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...