I need to join data from two (or more, ultimately) different sourcetypes based on the shared "host" field. Just a subsearch will not achieve what I want here due to the way I want to stats the data out. This join almost works:
index=customer_1 sourcetype=http
[search index=customer_1 sourcetype=dns | dedup host | fields host]
| join host [search index=customer_1 sourcetype=dns]
| stats values(dns_answer) values(http_uri) values(http_response) values(sourcetype) dc(sourcetype) values(_time) count by http_domain
It displays the expected data, almost! However, values(sourcetype) is only returning "dns" as a value! Join is not pushing the sourcetype value of "http" into the events, even though all of the other http fields are being joined correctly and displaying in the stats output. I imagine any other fields that share a name would only get the right side's value as well but I haven't tested. What gives?
Side note - I use a subsearch initially on the left side of the join because it would be too much data to feed into the right side without it. I'm not entirely sure what I'm doing is best practice but the way joins work are actually backwards in terms of the way I want to use them here, efficiently speaking (I wish I could flip the left and right queries but that doesn't work).
Anyways, anyone know what is going on here with the missing values from the left side?
Have you tried it without the subsearch and join, like this
index=customer_1 sourcetype=http OR sourcetype=dns
| stats dc(host) as hosts values(dns_answer) values(http_uri) values(http_response) values(sourcetype) dc(sourcetype) values(_time) count by http_domain | where hosts=1
So this would work under normal circumstances. However, I need to group the events together if and only if a certain kind of event exists in the first place. If that events exists, it I take out the "host" with a subsearch and then search for that in other places. To confuse the point further,, I'm not really using "host" in my real query. The value could be a host, ip, etc, so I cannot effectively "count by http_domain" because I'll never know what was given to me first.
These logs are coming from an alert system.