I'm new to Splunk and it is not quite clear to me how one would assign hostnames to remote computers based on the DNS queries made by local computers.
I understand that an index on the DNS data could be created such that querying_ip + response_ip => external_hostname.
Similarly a matching index could be created on flow records: internal_ip + external_ip
How does one perform the join such that the external_hostname will be the query that immediately preceded the flow record and not stale data?
Other questions answered here involve performing reverse DNS queries which is fine for internal hostnames but will not in general work for external hostnames due to CDNs and other realities of the internet. I need to use the data from the DNS query that likely occurred immediately before the flow.
Reverse lookups do not give the correct answer and it is impossible to maintain a DNS lookup table at scale in real time.
For example:
User goes to *.cnn.com. A reverse lookup of the IP of that connection will be the name of the Content Delivery Network (CDN) provider contracted by CNN, for instance *.akamai.com ... NOT cnn.com.
Next the User goes to *.cbs.com. The IP address could be the same as the previous connection because CNN and CBS might both contract with Akamai for content delivery.
I need the answers to be the *.cnn.com and *.cbs.com for each of these sessions even in the case when the IP addresses are the same.
In SQL I would use a windowed query to perform this match constrained by the time of the HTTP connection and take the first record from the matching DNS sorted in reverse order. In a map-reduce approach I would combine the two data sets and order by (querying_ip/client_ip, response_ip/server_ip, time, "DNS"/"HTTP") and use a stateful transform to return one row per HTTP record. I don't see a way to perform this sort of complicated windowed join in the Splunk syntax.
... View more