I'm new to Splunk and I'm trying to figure out how to merge five different fields, containing an IP address, as the only value together. I want it to overwrite the duplicate data but retain any unique data when consolidating the rows.
My source data is using a wildcard, I've looked at the join function, but I cant seem to quite get my head around how to use it with a wildcard as the source.
This is what I have so far, it dedups
my IP addresses but it also needs to merge the data between the different IP address fields together. Otherwise I get too many matches.
source="dataset:*" | dedup device_ip_address keepempty=true | dedup end_host_ip_address keepempty=true | dedup switch_ip_address keepempty=true | dedup ip_address keepempty=true | dedup IPADDR keepempty=true | table device_ip_address device_model end_host_ip_address end_host_mac_address end_host_name ip_address interface_admin_status interface_description IPADDR switch_interface switch_ip_address switch_model switch_name
I want to just consolidate the data using the ip address fields as the common denominator. Also there may be more ip addresses in one field than another, i want to retain that data too.
Hope I was clear enough!
Here are two solutions I got with Daljeanis and maciep help.
First solution
source="devices:*"
| fields IP_field1, IP_field2, IP_field3 switch_name interface_status
| eval myIPlist = mvdedup(mvappend(IP_field1, IP_field2, IP_field3))
| eval myMAC = mvdedup(mvappend(MAC1, MAC2))
| fields myIPlist switch_name interface_status
| stats values(myMAC) as myMAC, latest(*) as * by myIPlist
| table myIPlist myMAC switch_name interface_status
And here is the second solution
source="devices:*"
| eval myIPlist = coalesce(IP_field1, IP_field2, IP_field3)
| eval myMAC = coalesce(MAC1, MAC2)
| stats values(myMAC) as myMAC, latest(*) as * by myIPlist
| table myIPlist myMAC switch_name interface_status
Chose the one that works best for you
You can't necessarily do the dedup on multiple fields without losing data. Try something like this...
your search that gets all the records
| fields ip1 ip2 ip3 ip4... data1 data2 data3 data4 ...
| eval myIPlist = mvdedup(mvappend(ip1,ip2,ip3,ip4))
| fields myIPlist ... whatever other non-IP fields you need to keep ...
| stats count values(*) as * by myIPlist
In the above, ip1...ip4 represent any field you might have an IP address in, whatever their names might be, and data1... data4 represent any other data you want to keep, whatever their names might be.
This seems to be the same as doing what maciep suggested
| eval myIPlist = coalesce(ip1 ip2 ip3 ip4)
it is similar to what i was doing. however, if you have more than one type of IP field in a single event, then i think this will handle it better than coalesce does.
it would be helpful if you could specify exactly what you'd like the end result to be. Do you only want to see a single ip field and then other fields associated with the ip? Or do you still need to know what "type" of ip it is? Also, could one event have more than one type of IP address in it? And if so, how would you expect that to show up in the end result?
In general too, be careful with the dedup command. It works, but you may end up removing data that you want. Maybe try stats instead. A very simple, but maybe not useful example, would be to "normalize" the ip to a single field and the do stats on that field. The coalesce command will return the first non-null value in the list.
source = dataset:*
| eval my_ip = coalesce(device_ip_address, end_host_ip_address, switch_ip_address, ip_address, IPADDR,"n/a")
| stats values(*) as * by my_ip
Yes end result should see only one IP field with all associated fields attached to it, I am trying to export the entire database presenting all fields of data while removing duplicates and consolidating the fields together.
I am not at the office right now, but I will try the coalesce option and see if it gets me any closer to my end goal. Thanks for helping!
Edited*
That command worked, however now I have an issue where I get multiple values in a signal field. It's a last discovered date field and it is pulling multiple dates into the field, I only want to see the earliest date. Any idea how to do that? I searched around and tried
convert mktime(end_host_last_discovered) as date timeformat="%Y %m %d %H:%M:%S"
| eval Createdate=now()+(60*60*24*365)
| foreach date[eval Createdate= if (typeof('<<FIELD>>')=="Invalid", CreatedateEpoch, min(CreatedateEpoch, '<<FIELD>>'))]
| eval Createdate = strftime(CreatedateEpoch, "%Y-%m-%d %H:%M:%S")
But It effectively did nothing so I know I'm doing something wrong.
This is my current output for the "end_host_last_discovered" field,
end_host_last_discovered
2018-10-09 08:28:59
2018-10-09 09:20:55
2018-10-09 09:21:34
2018-10-09 09:31:38
etc...
This needs to just show the latest date, how do I do that?
Side note, this field will grow in size as the search runs.
you should be able to get the latest value of the field with the stats command as well...something like this maybe
| stats latest(discovered_date) as discovered_date, values(*) as * by my_ip
Doesn't seem to work
source="devices:*"
| eval my_ip = coalesce(device_ip_address, end_host_ip_address, switch_ip_address, ip_address, IPADDR,"n/a")
| stats values(*) as * by my_ip
| stats latest(end_host_last_discovered) as end_host_last_discovered, values(*) as * by my_ip
| table my_ip end_host_last_discovered
Will still return multiple dates in the end_host_last_discovered field, any other ideas?
ah...nope just update the existing stats command you were using, don't pipe to it again. i should have been more specific.
source="devices:*"
| eval my_ip = coalesce(device_ip_address, end_host_ip_address, switch_ip_address, ip_address, IPADDR,"n/a")
| stats latest(end_host_last_discovered) as end_host_last_discovered, values(*) as * by my_ip
| table my_ip end_host_last_discovered
That did not work either, also i think i might have to backtrack what i said about it working earlier. As I ran the scan I started to notice I was getting a lot of additional values in no particular order to almost all of my fields not just the end_host_last_discovered field. Making it difficult to know what I am looking at if I see some 20 mac addresses associated to one ip address 😕
Sorry for making the problem difficult to figure out, but I do really appreciate you helping.
I feel like I'm bashing my head against a wall trying to figure out the best way to export meaningful data in a readable format.
Well, that's what the values() aggregate does - gives you a list of unique values for a field. If that's not what you want, how would you like to consolidate multiple field values associated to one IP address? Or would you? Would just want the latest? Does matter if it's a device_ip or a switch_ip, etc? Would that be the same case for all of the other fields?
It would probably be helpful to figure out how you would tackle this manually w/o splunk - which fields/values to keep and what that end result would look like...so that then you would have a goal to work toward with Splunk.
Thought I did, but I see your point. I will look over this more and see if I can come up with a better plan to organize the data.
After sleeping on it, it now makes sense how stats works. I was able to get the latest data for all of my fields using
| stats latest(*) as * by my_ip
But there are a few fields where I would like to actually see all the values, like the MAC address field ( can help in finding duplicate IP's ). Is there a way to have a stats command do something like this?
| stats latest(*) values(MAC, hostname) as * by my_IP
Where it excludes "MAC" and "hostname" from the wildcard and instead does the values function, for those two fields?
yep, pretty sure something like this should do it...i think...hopefully.
| stats values(MAC) as MACs, values(hostname) as hosts, latest(*) as * by my_IP
That worked! Thanks a million maciep!
good to hear!