Re: What are alternatives to using the join comman...

tsunamii · ‎04-01-2016

Hello Splunkers, I would like to seek advice on how to achieve the same goal without having to use the join command. This is the current search, with the use of the join command:

index=myidx sourcetype=myapp_log status=startSecurityUpgrade earliest=-6h@h latest=now | eval Start_Time=strftime(_time, "%m/%d/%y %H:%M:%S") 
| join host_ip 
[search index=idms_sat sourcetype=myapp_log status=sendCode  earliest=-6h@h latest=now | eval Enrolled_Time=strftime(_time, "%m/%d/%y %H:%M:%S") ] 
| where Enrolled_Time > Start_Time | timechart span=10m count(host_ip)

The problem I have with the above search (join command) is that if the data set is large, the count would become inconsistent.

nagendra008 · ‎04-03-2016

Hi,

If you have a large data then don't prefer join, because it is too slow. I suggest you to go with streamstats to join based on the host_ip.

--- >  First read the multiple indexs which you want to search, Now you will see all the events from different indexes
----> coalesce which helps you to group two columns in to single
--- >  Streamstats   .... your_search | streamstats values(x), values(y) by common_field | .......

Please take this as reference :

index=myidx sourcetype=myapp_log status=startSecurityUpgrade earliest=-6h@h latest=now OR index=idms_sat sourcetype=myapp_log status=sendCode |
eval HOST_IP = coallesce(host_ip1, host_ip2)|
streamstats current=t values(Enrolled_Time) , values(Start_Time) by HOST_IP |
timechart span=10m count(HOST_IP)

Regards,
Nagee,

sideview · ‎04-03-2016

This answer has a number of serious problems. The "OR" is a bit haphazard, here falling between "latest=now" and "index=idms_sat", so as written the OR would anly apply to those two terms. I believe you intended to wrap the two sets of search terms in parens. Also you can't have earliest/latest included in boolean expressions; I believe Splunk would give an error message about this. you're using coalesce to normalize "host_ip1" and "host_ip2", but in the question there's already a single consistent field called "host_ip" so maybe this is just a vague example. The choice of streamstats values() here is very strange. Possibly you intended to do eventstats... however it doesn't matter because the work done by streamstats is not used in any way by the subsequent timechart command. I like some of the things you're trying to point the asker towards, but offering these observations so that you might do some more further reading.

sideview · ‎04-01-2016

Certainly.

Step 1. Reformulate your two base searches as a single search with a "disjunction" ie a big "OR".

(index=myidx sourcetype=myapp_log status=startSecurityUpgrade) OR (search index=idms_sat sourcetype=myapp_log status=sendCode)

Next we use a technique broadly called "conditional eval", to paint fields called Start_Time and Enrolled_Time on the appropriate events.

(and I'm actually... throwing away your strftime because it's not helping here. _time is epochtime-valued, and a) you need to compare it with the > operator so that's a good thing, and b) the timechart command wants it as an epochtime value, so turning it into a string with strftime isnt going to help either of those things)

So base search plus "conditional eval" now looks like this:

(index=myidx sourcetype=myapp_log status=startSecurityUpgrade) OR (search index=idms_sat sourcetype=myapp_log status=sendCode) 
| eval Start_Time=if (status="startSecurityUpgrade",_time,null())
| eval Enrolled_Time = if (status="sendCode",_time,null())

Now we're ready to use stats to do the "joining"

(index=myidx sourcetype=myapp_log status=startSecurityUpgrade) OR (search index=idms_sat sourcetype=myapp_log status=sendCode) 
| eval Start_Time=if (status="startSecurityUpgrade",_time,null())
| eval Enrolled_Time = if (status="sendCode",_time,null())
| stats values(Start_Time) as Start_Time values(Enrolled_Time) as Enrolled_Time by host_ip

I'm using values here just... so you can get the lay of the land. If there's only one of each per value of host_ip, all the better. If there are sometimes more than one, you'll have to think about which one you want to use... EIther way I advise starting from values() and thinking carefully about the one-to-many possibilities.

Last but not least, apply your filtering and do your timechart. Oh and let's say you picked min(Start_Time) and max(Enrolled_Time) respectively, once you pondered the values() situation.

(index=myidx sourcetype=myapp_log status=startSecurityUpgrade) OR (search index=idms_sat sourcetype=myapp_log status=sendCode) 
| eval Start_Time=if (status="startSecurityUpgrade",_time,null())
| eval Enrolled_Time = if (status="sendCode",_time,null())
| stats min(Start_Time) as Start_Time max(Enrolled_Time) as Enrolled_Time by host_ip
| where Enrolled_Time > Start_Time 
| timechart span=10m count

One more note: you had timechart count(host_ip), which is a little strange. count(host_ip) will count all the rows that have any non-null value of host_ip, and since every row coming out of stats, or for that matter coming out of your join, will have a non-null value of host_ip..... it's the exact same result as doing timechart count. Perhaps your intention was to do dc(host_ip) (ie the distinct-count of host_ip) but.... again in this case that would also be the same as doing count, since each row coming out of this stats command (or out of your join command), has a host_ip value that is guaranteed unique. Long story short... you want just timechart count here.

This is vastly preferable to using join. Not only will you avoid the row-limits and the execution-time limits, (and avoid a terrible deathspiral of increasing maxresultrows in limits.conf), you will do more of the processing out at your indexers, and you'll correspondingly send far less data back over the wire to your search heads. Oh and it'll run much faster and use less resources overall.

piebob · ‎04-01-2016

ALL HAIL SIDEVIEW! UPGOATS ALL 'ROUND!

ddrillic · ‎04-02-2016

Trying it a bit-

(index=provider source="part-m-00009")
| eval tin_provider=if(source="part-m-00009","XXXX","ccccc")

why tin_provider ends as ccccc?

sideview · ‎04-03-2016

I'm not sure what you're asking? In this particular eval syntax you have here, it's saying create a field called "tin_provider" and if the "source" field has the value "part-m-00009" (which in this example it always will), then set tin_provider to "XXXX". Otherwise set tin_provider to "ccccc". In the syntax of my answer I use the if() function inside eval a few times, but note that my third argument is often null(), indicating that the field should not be created at all in the case that the condition is false.

ddrillic · ‎04-03-2016

Right, that's the thing - I tried to work on your code and got stuck as the new field tin_provider was not created because null was assigned to it. In order to debug it, I hard-coded these two values.
Then I see that I get the value ccccc and not the expected XXXX.
Does it make sense?

sideview · ‎04-03-2016

I'm not sure. Have you simplified that syntax before posting the comment? As written it makes no sense that the if() would fall through to the "ccccc". Also, "part-m-00009" is a pretty unusual value for "source". Have you translated field names or something? Can you post the actual question you have as a separate question? It's a bit confusing to have these tin_provider comments in here since it has nothing to do with this user's question.

tsunamii · ‎04-01-2016

Thank you very much @sideview, for the detailed information and advice.

What are alternatives to using the join command for my search?

Data Management Digest – December 2025

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Join the Conversation

What are alternatives to using the join command for my search?

Data Management Digest – December 2025

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...