Solved: Correlating DNS logs with HTTP logs within Zeek/Co...

dtaylor

Good day,

I've been tasked with gathering a list of all users who've accessed an internal site over a couple months. Those logs are kept in Zeek/Corelight HTTP logs. Unfortunately, those logs only show the IP address of the machine accessing, not the hostname of the device and given how much time has passed, the only way I can think of to find which device was assigned a specific IP at a past time is to check Zeek's known_hosts logs(which I'm pretty sure gets generated from DNS logs?).

So far, this is the search I'm using

(index=zeek_logs sourcetype=zeek_known_devices) OR (index=zeek_logs sourcetype=zeek_http mysite.mydomain.com) 172.31.154.91
| eval device_src_ip = coalesce(assigned_ip, id_orig_h)
| transaction device_src_ip maxspan=1d
| stats values(dhcp_host_name) as hostname count by id_orig_h

The known_devices logs contain the assigned_ip and dhcp_host_name fields. The http logs contain the id_orig_h field which has the IP's of those machines which accessed mysite.mydomain.com. I'd like to correlate the address taken from id_orig_h with the address found in assigned_ip to get the hostname of the device, however, I need to limit the time somehow such that id_orig_h doesn't match on logs from yesterday where some random device happened to pull the same IP.

I'm trying to use transaction, and it does work....but I quickly run into the issue of transaction hitting its memory limits. The number of HTTP logs is very finite(only a couple dozen), but there are millions of logs in known_hosts. If I could limit those known_hosts logs to only pulling events for the IP's found in the HTTP logs, that'd solve the issue, I think. I tried doing exactly this using a subsearch.....but I quickly ran into the issue of me just having no clue how to properly use a subsearch. There the one part of Splunk that I struggle to wrap my head around despite trying to read up on them.

Thank you for any assistance, and naturally is anyone has ay better suggestions, I'm all ears. This won't be a regularly run search, so it doesn't matter if it takes 5+ minutes to run.....but it also can't take an hour plus.

PickleRick

First things first. Don't know about others but for me it's much easier to _see_ the data (even if it's some kind of a mockup, as long as it contains the relevant info) than read the _description_ of the data.

Still, your description seems prety solid, let's work with it.

Disclaimer: I don't have my Splunk environment at hand so I'm writing on a "something like that should work" basis - the syntax might be slightly off here and there - have nowhere to check at this time.

So you have a huge set of zeek_known_devices data like (mockup format)

_time=timestamp, assigned_ip=X.X.X.X, dhcp_host=myhost1

And a handful of zeek_http events like:

_time=timestamp, id_orig_h=X.X.X.X, [... url and stuff...]

And you want to "join" the data on (let's use SQL-like syntax) zeek_known_devices.assigned_ip=zeek_http_events.id_orig.h, right to get the hostname into the http events, right?

It's an interesting problem. Of course the "intuitive" way to get rid of the join command is to use clever statsing. But you have to somehow include the time restrictions so that you don't get the wrong IP which was assigned on a completely different day.

So there could be more than one approach to this.

If you can safely assume that each device gets only one DHCP lease per day (or any other timespan), you can simply bin your time field and use that to your advantage.

| bin _time as timespan
| stats values(dhcp_host_name) as hostame count by id_orig_h timespan

That's one of the approaches.

Another could be to merge the timestamp with the hostname (I'm assuming that you're limiting the http logs to just one day or other consistent period)

| eval timehost=_time.":".hostname
| stats values(timehost) as timehost count dc(sourcetype) as sourcetypes by device_src_ip
| where sourcetypes=2

Now you have a multivalued iphost field which you can split

| mvexpand timehost

And split again to get your original data

| eval hosttime=mvfind(split(timehost,":"),0),hostname=mvfind(split(timehost,":"),1)

Now not only our data set is limited only to those events matching our IPs (which should be way less events than the original full known_devices set). And we can also limit it a bit further since we're only interested in leases assigned _before_ our http logs (we'll use the additional field later so we're not using just where on its own)

| eval delay=_time - hosttime
| where delay>0

So we're still left with all dhcp entries from the beginning of time up until our http log entry, right? So what can we do? There are several options.

Option 1:

Since we should already have limited data set, we can use eventstats to filter out only the minimum delay

| eventstats min(delay) as mindelay by device_src_ip
| where delay=mindelay

Option 2:

Since your data is by default sorted in reverse chronological order you might skip this first step

| sort -_time

As your have newer events first, the first value for given timehost will be the latest DHCP lease for it. So use dedup to leave only that one

| dedup timehost

Option 3:

I'm not going to dig too deeply into this one but you could also try to use autoregress/streamstats to copy over previous values and detect when those changed (which means that you're dealing with a new entry).

View solution in original post

PickleRick