This question was recently sent in by a customer, re-posting here for the community:
We are rolling out Splunk for managing the entire IT Infrastructure. Until now we've used it for the Security team (using the Enterprise Security app) and more recently we've brought in OS health from our AD servers using universal forwarders and the Windows Infrastructure app. So far, Splunk is really helping out.
What I want to do is analyze the traffic flows between several network devices: Juniper VPN, managed switches, Cisco routers, F5 load balancers, and the virtual switches in our VMware environment.
In particular I'm looking to identify top talkers and want to be alerted when a network link is approaching full capacity. When a switch port is experiencing degraded service, I want to know which servers and applications are affected.
I know there are other products out there that will do this, but those tools only look at the network. I'd prefer to put this functionality into Splunk as part of our tools consolidation process. What are my options?
First, there are a number of Splunk Apps that will help you bring in and visualized information from the devices you mentioned. I'd start there, using the appropriate F5, Juniper, and Cisco apps. All are on http://splunkbase.splunk.com
Those apps won't help with the end-to-end monitoring you're asking about (which I'll get to as well) but what they will do is provide deeper insight into the operation of those different products.
Most (or all) of those devices also support sending information via syslog. Typically this will include hardware level events, logs of configuration changes, and so on. This is generally low volume data but worth looking at. I'd certainly want Splunk to open a ServiceNow ticket whenever a device reports a fan failure or loss of a redundant power supply!
With all of that being said, it still doesn't address your need: I'm looking to identify top talkers and want to be alerted when a network link is approaching full capacity. When a switch port is experiencing degraded service, I want to know which servers and applications are affected.
What you'll want is the network meta-information that's available from each of your devices. Commonly called "flow" data, this is the data set that will help you answer your question. This comes in many flavors, including:
- NetFlow v5 or v9 data from older Cisco gear
- J-Flow data from your Juniper equipment
- sFlow (for sampled NetFlow) from your switches
- IPFIX data from your VMware virtual switches and pretty much every other intelligent networking device released since 2015
Note that all of these network flow types are in binary format; Splunk cannot ingest them directly.
Wikipedia has a great write-up on each of these for those who are interested. The TLDR version: NetFlow was invented by Cisco, other vendors had their own versions. IPFIX replaces them as a common, universal standard.
In Splunkbase, you'll find a few different TAs from Splunk, one for IPFIX data and one for NetFlow v5/v9 data. They'd help you bring in some of the data, but would not address your Juniper devices or the sampled flow data from your switches.
I believe that to accomplish what you're after, you'll want to use NetFlow Logic's "NetFlow Integrator". (See their app at https://splunkbase.splunk.com/app/489/)
How it works:
First, NetFlow Integrator acts a sort of middleware. It takes in all of the different flow data types, converts them from binary format. When Integrator sees data coming in, it reaches back to the sending network device to do some SNMP-based data collection. This allows Integrator to determine data such as port speed and duplex, and other device information.
What Integrator does next is up to you. It can send each flow record to Splunk (converted to syslog format) or send aggregated information periodically, or both. Sending the aggregated data is the best fit for most Splunk environments, Flow data can be VERY high volume and this allows you to keep your Splunk license usage low.
Once the data gets to Splunk, you'll finally have your answer. NetFlow Logic has apps on Splunkbase that use the Splunk platform to tie all of the data together. This includes reports top talkers, network utilization/health/saturation, and traffic flows affected by networking issues. It does this even with your VMware switches, top of rack devices, and (as I saw at VMworld this week) VMware NSX.
I hope that helps point you in the right direction!
While we're talking about the network - I'll mention "Splunk App for Stream" as well. It wouldn't help with the use case you asked about, but it's worth knowing about. Stream allows you to look at the application protocol level to analyze communication between servers via TCP or UDP. When the log files don't give the information you want, Stream allows you to bring in data for both IT Ops and Security use cases.
Great answer by @mdonnelly about flows, but I wanted to mention more about SNMP. All network gear speaks this protocol. The reason I bring it up is because of this part:
...to be alerted when a network link is approaching full capacity. When a switch port is experiencing degraded service...
Flows won't tell you about the health of an individual switch directly. You can infer this by doing the math in Splunk to add up the bandwidth seen, and compare it to what you know about the switch capacity. Also, flows don't know anything about physical ports, they only see the IP address and no lower. However, you can inspect this detail more directly by querying the switch with SNMP.
One way to work with SNMP is to use this app: SNMP Modular Input. Because Splunk doesn't do device discovery, you are going to need to configure this modinput to point it to your switches, and tell it which SNMP OIDs to query. This might be a non-trivial task, just warning you up front. But there's a wealth of detail to be found there.
Also, flows don't know anything about physical ports, they only see the IP address and no lower.
Typically flow records do include SNMP indexes of input and output interfaces, and could be mapped to physical port names by querying switches and routers. NetFlow Integrator with NetFlow for Splunk App will do it.
Correct, Flows themselves don't tell you capacity, and using SNMP is the best way to collect that. But the SNMP piece is part of what the NetFlow Integrator piece is doing - it's also doing the math on the fly in their middleware, so you don't have to calculate the values in Splunk after the fact. See the How it works section above for details.
The SNMP Modular Input is good for small numbers of devices, but when you get to several hundred devices to poll, it simply does not scale well for a variety of reasons.
First, there are a number of Splunk Apps that will help you bring in and visualized information from the devices you mentioned. I'd start there, using the appropriate F5, Juniper, and Cisco apps. All are on http://splunkbase.splunk.com
Those apps won't help with the end-to-end monitoring you're asking about (which I'll get to as well) but what they will do is provide deeper insight into the operation of those different products.
Most (or all) of those devices also support sending information via syslog. Typically this will include hardware level events, logs of configuration changes, and so on. This is generally low volume data but worth looking at. I'd certainly want Splunk to open a ServiceNow ticket whenever a device reports a fan failure or loss of a redundant power supply!
With all of that being said, it still doesn't address your need: I'm looking to identify top talkers and want to be alerted when a network link is approaching full capacity. When a switch port is experiencing degraded service, I want to know which servers and applications are affected.
What you'll want is the network meta-information that's available from each of your devices. Commonly called "flow" data, this is the data set that will help you answer your question. This comes in many flavors, including:
- NetFlow v5 or v9 data from older Cisco gear
- J-Flow data from your Juniper equipment
- sFlow (for sampled NetFlow) from your switches
- IPFIX data from your VMware virtual switches and pretty much every other intelligent networking device released since 2015
Note that all of these network flow types are in binary format; Splunk cannot ingest them directly.
Wikipedia has a great write-up on each of these for those who are interested. The TLDR version: NetFlow was invented by Cisco, other vendors had their own versions. IPFIX replaces them as a common, universal standard.
In Splunkbase, you'll find a few different TAs from Splunk, one for IPFIX data and one for NetFlow v5/v9 data. They'd help you bring in some of the data, but would not address your Juniper devices or the sampled flow data from your switches.
I believe that to accomplish what you're after, you'll want to use NetFlow Logic's "NetFlow Integrator". (See their app at https://splunkbase.splunk.com/app/489/)
How it works:
First, NetFlow Integrator acts a sort of middleware. It takes in all of the different flow data types, converts them from binary format. When Integrator sees data coming in, it reaches back to the sending network device to do some SNMP-based data collection. This allows Integrator to determine data such as port speed and duplex, and other device information.
What Integrator does next is up to you. It can send each flow record to Splunk (converted to syslog format) or send aggregated information periodically, or both. Sending the aggregated data is the best fit for most Splunk environments, Flow data can be VERY high volume and this allows you to keep your Splunk license usage low.
Once the data gets to Splunk, you'll finally have your answer. NetFlow Logic has apps on Splunkbase that use the Splunk platform to tie all of the data together. This includes reports top talkers, network utilization/health/saturation, and traffic flows affected by networking issues. It does this even with your VMware switches, top of rack devices, and (as I saw at VMworld this week) VMware NSX.
I hope that helps point you in the right direction!
While we're talking about the network - I'll mention "Splunk App for Stream" as well. It wouldn't help with the use case you asked about, but it's worth knowing about. Stream allows you to look at the application protocol level to analyze communication between servers via TCP or UDP. When the log files don't give the information you want, Stream allows you to bring in data for both IT Ops and Security use cases.