Getting Data In

Correlate and combine data from two sources, but only when there's a match.

rononeill
Explorer

Mighty Splunk people... I'm having a problem getting my data further refined-- I've got source A that shows me a hostname and a MAC address (basically DHCP logs) and source B that indicates the network segment joined, which unfortunately only contains the MAC address. I need to be able to correlate the two sources, but only when there's a match (against a subset of source B, which I already have.) I want to disregard the other data and getting just the correlated data is the problem.

Here's a sanitized set of my source records:

Sep 21 16:52:45 sourceA dhcpd: DHCPREQUEST for 192.168.0.10 from aa:aa:aa:aa:aa:aa (homer)  
Sep 21 16:52:24 sourceA dhcpd: DHCPREQUEST for 192.168.0.11 from bb:bb:bb:bb:bb:bb (marge)
Sep 21 16:52:22 sourceB syslog: eventd_to_syslog():User[bb:bb:bb:bb:bb:bb] joins specific_network   
Sep 21 16:52:20 sourceA dhcpd: DHCPREQUEST for 192.168.0.12 from cc:cc:cc:c:cc: (bart)

So in this case there's no accompanying sourceB entry for homer or bart. How can I filter so that the only remaining data is the one that's correlated?

I want to be able to eventually take this to a table with an output something like this:

timestamp         hostname 
Sep 1 16:52:22  marge 

(without homer or bart.)

Right now I'm doing basically the following:

(host=SourceA) OR ("specific_network") | eval macaddress=coalesce(sourceA_mac,sourceB_mac) |  table computername macaddress 

In this case the key field, macaddress is showing in the table as null, although in specific fields, I can see where it is applied in the detail view.

Expanding it out to

(host=SourceA) OR ("specific_network") | eval macaddress=coalesce(sourceA_mac,sourceB_mac) |  table computername macaddress  sourceA_mac sourceB_mac 

Then i can see that the fields aren't keying off the matching MAC address. I get separate lines, homer marge and bart included. I'd really rather treat this data as if both sources were combined if and only if there were a match, but it doesn't seem like a join is the right answer.

Help?

0 Karma
1 Solution

MuS
SplunkTrust
SplunkTrust

Hi rononeil,

subsearches like join are expensive to run and have hard limits. Most of the time a stats will do the job more efficient and without hard limits. Try something like this:

your base search here to get all mac events 
| stats count(source) AS source_count by mac computername macaddress sourceA_mac sourceB_mac
| search source_count > "1" 
| table computername macaddress  sourceA_mac sourceB_mac 

This assumes you have a field called mac in both sources. It will count the source of each mac and only shows events if they are available in both source.

You can learn more about stats vs join in this answer http://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-joi...

Hope this helps ...

cheers, MuS

View solution in original post

MuS
SplunkTrust
SplunkTrust

Hi rononeil,

subsearches like join are expensive to run and have hard limits. Most of the time a stats will do the job more efficient and without hard limits. Try something like this:

your base search here to get all mac events 
| stats count(source) AS source_count by mac computername macaddress sourceA_mac sourceB_mac
| search source_count > "1" 
| table computername macaddress  sourceA_mac sourceB_mac 

This assumes you have a field called mac in both sources. It will count the source of each mac and only shows events if they are available in both source.

You can learn more about stats vs join in this answer http://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-joi...

Hope this helps ...

cheers, MuS

rononeill
Explorer

This got me very close to the right solution. Here's what I ended up with.

(search that builds sanitized search records) | eval mac=coalesce(sourceA_mac sourceB_mac) | transaction mac maxpause=300s | stats values (*) by computername mac sourceA_mac | table dhcp_hostname mac

(will work on the timestamp next, but that's trivial.)

So to break it down--

eval mac=coalesce() adds a field to each set of records "mac" so that the mac address can be found within one common field (in addition to the individual fields they were stored in previously.) You can't use rename here because the second rename clobbers the first.

transaction mac maxpause=300s This says that if you have two records with matching mac fields within 5 minutes, join the fields (without the computationally expensive join command.) This is close enough for my report.

stats values (*) fieldA fieldB this is basically a test to ensure that the fields are non-null. By looking at computername and SourceA_mac, I know that it's a validly joined field. The original dataset wouldn't have both the computername and sourceA_mac fields together unless they've been joined by the transaction command. This effectively discards all fields that don't match those. I also pass mac through this command because I want it in my table.

table - should be self explanitory. I don't show sourceA_mac since it's redundant

I know that's long-winded but this query took me a while. Hopefully this might be of help to others.

0 Karma

MuS
SplunkTrust
SplunkTrust

@rononeil, if this answered your question, please accept this answer.

0 Karma

rononeill
Explorer

It's very close-- I'm still trying to tune the code and report back a successful code snippet. I actually didn't have a valid field like 'mac' but the eval coalesce function did that trick.

0 Karma

masonmorales
Influencer

Try using a subsearch: http://docs.splunk.com/Documentation/Splunk/6.2.5/SearchTutorial/Useasubsearch

It basically acts as a boolean AND between the field values in each set of search results. Note: Field names will have to match in both sources (i.e. use rename if different).

Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...