Our server administrators have asked us to enable NIC bonding on our blade servers to make chassis administration more efficient. I have read the RHEL documentation about how to enable NIC bonding on the server. However before I do, I wanted to see if anyone has any experience about running Splunk on a server with bonded NIC's. We would be enabling this on 5 blades, 4 indexers and 1 ES searchhead. All operating systems are RHEL 5.8 x64.
Any information concerning this issue would be appreciated.
Thanks
I would argue this is largely a non-issue for Splunk. In terms of abstraction, Splunk is unaware (and does not care) about the physical layer / link layer details of the underlying operating system's IP stack. As long as that OS's IP stack presents a fairly sane sockets-style interface, Splunk does not, should not, and cannot care. This is why operating systems abstract details of hardware away, so applications can count on a consistent interface. From a socket API perspective, Splunk won't be able to tell that you have bonded NICs underneath.
Some of the Unix app inputs MIGHT care, only because they do gather hardware-level statistics from the operating system. For example, network interface utilization dashboards may double-count traffic - once for the logical bond interface and once for the physical adapter it goes out. But, it will be double counted only because the operating system itself double-counts it.
That said, we use a fairly simple active/passive bonding configuration with an arp-check. On RHEL6, this is all that's needed:
[root@box network-scripts]# cat ifcfg-bond0
DEVICE=bond0
IPADDR=172.16.0.178
NETMASK=255.255.255.0
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
BONDING_OPTS="mode=active-backup arp_interval=1000 arp_ip_target=172.16.0.1"
GATEWAY=172.16.0.1
[root@box network-scripts]# cat ifcfg-eth0
DEVICE=eth0
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
USERCTL=no
[root@box network-scripts]# cat ifcfg-eth1
DEVICE=eth1
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
USERCTL=no
I would argue this is largely a non-issue for Splunk. In terms of abstraction, Splunk is unaware (and does not care) about the physical layer / link layer details of the underlying operating system's IP stack. As long as that OS's IP stack presents a fairly sane sockets-style interface, Splunk does not, should not, and cannot care. This is why operating systems abstract details of hardware away, so applications can count on a consistent interface. From a socket API perspective, Splunk won't be able to tell that you have bonded NICs underneath.
Some of the Unix app inputs MIGHT care, only because they do gather hardware-level statistics from the operating system. For example, network interface utilization dashboards may double-count traffic - once for the logical bond interface and once for the physical adapter it goes out. But, it will be double counted only because the operating system itself double-counts it.
That said, we use a fairly simple active/passive bonding configuration with an arp-check. On RHEL6, this is all that's needed:
[root@box network-scripts]# cat ifcfg-bond0
DEVICE=bond0
IPADDR=172.16.0.178
NETMASK=255.255.255.0
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
BONDING_OPTS="mode=active-backup arp_interval=1000 arp_ip_target=172.16.0.1"
GATEWAY=172.16.0.1
[root@box network-scripts]# cat ifcfg-eth0
DEVICE=eth0
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
USERCTL=no
[root@box network-scripts]# cat ifcfg-eth1
DEVICE=eth1
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
USERCTL=no
Marking this as the answer, with the note that we will run balance-alb mode, not active-backup.
Thanks
I have it on my CentOS 6.3 x64 machine. I have not seen any issues and my server has been running for almost a year now in production. I index around 15GB to 18GB per day. Here is what I did.
vim /etc/sysconfig/network-scripts/ifcfg-bond0
vim /etc/sysconfig/network-scripts/ifcfg-em1
vim /etc/sysconfig/network-scripts/ifcfg-em2
vim /etc/modprobe.d/modprobe.conf
modprobe bonding
service network restart
cat /etc/sysconfig/network-scripts/ifcfg-em2
DEVICE="em2"
ONBOOT="no"
USERCTL=no
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
cat /etc/sysconfig/network-scripts/ifcfg-em1
DEVICE="em1"
ONBOOT="no"
USERCTL=no
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE="bond0"
ONBOOT=yes
TYPE=Ethernet
BOOTPROTO=none
IPADDR=10.10.10.10
PREFIX=24
GATEWAY=10.10.10.254
DNS1=10.10.10.20
DNS2=10.10.10.30
DOMAIN="mydomain.local"
DEFROUTE=yes
IPV4_FAILURE_FATAL=yes
IPV6INIT=no
If you reboot your box with those NIC settings you are going to lose networking and better hope you have iLO or datacenter access.
The bond will come up but have no devices to use since you have "ONBOOT=no" on both of your physical NIC's.
Those configs are a timebomb waiting to go off, usually at the most inconvenient time.
What is the purpose of double bonded NICs? Isn't it true that Splunk wouldn't need that much throughput anyway? Seems like just a configuration hassle that doesn't add any real performance benefit. Of course based on the original post it was a management decision, not a technical decision. So why? Well buying extra NICs is a lot cheaper than upgrading the CPU or buying additional RAM, that's why. Am I wrong?
Thanks for your assistance with this issue. I selected the other answer as it is specific for RHEL.
Also, a requirement is to NOT use LACP (802.3ad). RHEL has an option to use bonding mode "balance-alb" which balances egress traffic based on slave load, and ingress traffic based on ARP negotiation. This would be the optimal mode for bonding in our environment.
Awesome. Thanks for the information. I have the instructions on how to implement it on RHEL, which are very similar to what you have described above. Since it is at the OS level, it should be transparent to Splunk. But I wanted to just get a feeler thread out to see if there were any odd issues.
I appreciate the feedback.
Early morning bump. Find it hard to believe that no one is running a bonded NIC setup with RHEL and Splunk. Thanks.
May I revive this post? It looks like the question about how to implement RHEL NIC bonding with Splunk has been answered, but I would like to ask the question of why should someone who is trying to maximize performance, bond their private internal network interfaces, essentially using 2 x 10Gb interfaces instead of a single 10Gb interface? Do you think people could see a performance improvement in the real-world with NIC bonding? If so, about how much?