I'm new to Splunk and it is not quite clear to me how one would assign hostnames to remote computers based on the DNS queries made by local computers.
I understand that an index on the DNS data could be created such that querying_ip + response_ip => external_hostname.
Similarly a matching index could be created on flow records: internal_ip + external_ip
How does one perform the join such that the external_hostname will be the query that immediately preceded the flow record and not stale data?
Other questions answered here involve performing reverse DNS queries which is fine for internal hostnames but will not in general work for external hostnames due to CDNs and other realities of the internet. I need to use the data from the DNS query that likely occurred immediately before the flow.
Reverse lookups do not give the correct answer and it is impossible to maintain a DNS lookup table at scale in real time.
For example:
User goes to *.cnn.com. A reverse lookup of the IP of that connection will be the name of the Content Delivery Network (CDN) provider contracted by CNN, for instance *.akamai.com ... NOT cnn.com.
Next the User goes to *.cbs.com. The IP address could be the same as the previous connection because CNN and CBS might both contract with Akamai for content delivery.
I need the answers to be the *.cnn.com and *.cbs.com for each of these sessions even in the case when the IP addresses are the same.
In SQL I would use a windowed query to perform this match constrained by the time of the HTTP connection and take the first record from the matching DNS sorted in reverse order. In a map-reduce approach I would combine the two data sets and order by (querying_ip/client_ip, response_ip/server_ip, time, "DNS"/"HTTP") and use a stateful transform to return one row per HTTP record. I don't see a way to perform this sort of complicated windowed join in the Splunk syntax.
Please read this answer. it must help you
https://answers.splunk.com/answers/105246/dns-resolution-in-a-search.html
I saw that answer and related answers. All of these answer either involve a reverse DNS lookup or a table lookup. Neither of these will give the correct answer in this case and I tried to explain why.
Example:
User goes to *.cnn.com. A reverse lookup of the IP of that connection will be the name of the Content Delivery Network (CDN) provider contracted by CNN, for instance *.akamai.com ... NOT cnn.com.
Next the User goes to *.cbs.com. The IP address could be the same as the previous connection because CNN and CBS might both contract with Akamai for content delivery.
I need the answers to be the *.cnn.com and *.cbs.com for each of these sessions even in the case when the IP addresses are the same.
In SQL I would use a windowed query to perform this match constrained by the time of the HTTP connection and take the first record from the matching DNS sorted in reverse order. In a map-reduce approach I would combine the two data sets and order by (querying_ip/client_ip, response_ip/server_ip, time, "DNS"/"HTTP") and use a stateful transform to return one row per HTTP record. I don't see a way to perform this sort of complicated windowed join in the Splunk syntax.
I saw this answer and the related answers. I tried to make it clear that these do not answer my question and explained why in my question. I need the equivalent of a SQL Windowed query with a TOP clause for the preceding DNS query to match against the current flow record. It is not clear how to accomplish this in Splunk's query format.
Reverse lookups do not give the correct answer. For instance if a user goes to *.cnn.com the reverse lookup will provide the Content Delivery Network (CDN) host contracted by CNN. For instance *.akamai.com and not *.cnn.com. It will be impossible to maintain a table fast enough for table lookups to be valid. I need to match the current connection to the most recently preceding DNS query there is no other option. This could be a fraction of a second earlier or it could be hours earlier depending on caching and TTLs set by the DNS provider.
Also note that the same IP can have different answers at different times and with different clients. This is why I combined the client IP with the result IP. However this does not solve the "immediately before the current session" problem. I don't want the most recent DNS query response, I want the most recent DNS query response just before the flow in question. A query over a time range should be able to provide multiple answers for the same IP that reflect the user's actions.