Hi all, thank in advance for your time!
I have a problem writing a properly working query with this case study:
I need to take data from index=email1 to find matching data from index=email2. I tried to do it this way: from index=email1 I take the fields src_user and recipient and use the appropriate search to look for it in the email2 index.
Query examples that I used:
index=email1 sourcetype=my_sourcetype source_user=*
[ search index=email2 sourcetype=my_sourcetype source_user=* | fields source_user ]
OR
index=email1 sourcetype=my_sourcetype
| join src_user, recipient [search index=emai2 *filters*]
Everything looked OK in the control sample (I found events in a 10-minute window, e.g. 06:00-06:10), which at first glance matched, but when I extended the search time, e.g. to 24h, it did not show me any events, even those that matched in a short time window (even though they were in these 24 hours).
Thank you for any ideas or solutions for this case.
You already had some sugestions which are OK but the question is what are your limitations on this search. How many events do you expect from each of those data sets, how long is the search supposed to take - these can warrant a different approach to the problem.
For example, since you're dealing with email data, it's a relatively valid question why aren't you using CIM datamodel (and have it accelerated).
Hi @BigJohnQ ,
your first solution or the one from @ITWhisperer are the most efficient if in the subsearch you have less than 50,000 results.
If instead you could have in the subsearch more than 50,000 results you should try another solution:
index IN (email1,email2) sourcetype=my_sourcetype source_user=*
| stats dc(index) AS index_count values(*) AS * BY source_user
| where index_count>1
you can replace the values(*) AS * with the list of all fields you need to have in the results.
Avoid you second solution because it's very slow!
Ciao.
Giuseppe
10k results, not 50k. The 50k results limit is for join command. "Normal" subsearch has a default 10k results limit.
(yes, all those limits can be confusing and are easy to mistake with one another).
Try something like this
index=email2 sourcetype=my_sourcetype source_user=* [
search index=email1 sourcetype=my_sourcetype source_user=* | eval recipient = source_user | fields recipient | dedup recipient | format]