All Apps and Add-ons
Highlighted

topics/models in UBA have gone into a pending state even after stop-all and start-all

Splunk Employee
Splunk Employee

After running a stop-all & start-all to restart, various topics/models have gone into a pending state and aren't getting data. It also appears that the data source connectors are no longer getting data as there are no EPS being displayed on the home page, and the number of processed events does not appear to be growing.
A stop-all start-all has been done a number of times since the issue started to try and rectify the problem.
What should I look into it further?

0 Karma
Highlighted

Re: topics/models in UBA have gone into a pending state even after stop-all and start-all

Splunk Employee
Splunk Employee

If you see below errors from the output of health check scripts,

Errors:

kubelet journalctl:

kubeletnodestatus.go:273] Setting node annotation to enable volume controller attach/detach
kubeletnodestatus.go:82] Attempting to register node YOURFQDNSERVERNAME
kubeletnodestatus.go:106] Unable to register node "YOURFQDNSERVERNAME" with API server: nodes "YOURFQDNSERVERNAME" is forbidden: node "shorthostname" cannot modify node "YOURFQDNSERVERNAME"
kubelet
nodestatus.go:273] Setting node annotation to enable volume controller attach/detach
kubelet
nodestatus.go:82] Attempting to register node YOURFQDN

Output of health check:

concern summary: Checking YOURFQDNSERVERNAME first ...

                                                         McAfee_ePO           | Splunk | Processing |       | SPLUNK/DIRECT     | 2019-01-15 05:18:10 | 0          | 0           | 0       |      949874 |                 0 |             485746 |           49 |    <== significant failed/skipped events; review datasource SPL
                              eps:                      0   <== no response from redis 'YOUR_FQDN'
                              status not OK ...            <== check UI system health monitor for errors
                                                        splunkuba     analyticsaggregator-rc-v6csh             0/1       Pending   0          14m       <none>          <none>   <== pod 'analyticsaggregator-rc-v6csh' is 'Pending'; not 'Running'
                                                        splunkuba     analyticsviewsbuilder-rc-nwffg           0/1       Pending   0          14m       <none>          <none>   <== pod 'analyticsviewsbuilder-rc-nwffg' is 'Pending'; not 'Running'
                                                        splunkuba     analyticswriter-rc-jscjf                 0/1       Pending   0          14m       <none>          <none>   <== pod 'analyticswriter-rc-jscjf' is 'Pending'; not 'Running'
                                                        splunkuba     anomalyaggregationmodel-rc-j968d         0/1       Pending   0          14m       <none>          <none>   <== pod 'anomalyaggregationmodel-rc-j968d' is 'Pending'; not 'Running'
                                                        splunkuba     devicetopic-modelgroup01-rc-d5ljq        0/1       Pending   0          14m       <none>          <none>   <== pod 'devicetopic-modelgroup01-rc-d5ljq' is 'Pending'; not 'Running'
                                                        splunkuba     devicetopic-modelgroup01-rc-z5mzc        0/1       Pending   0          14m       <none>          <none>   <== pod 'devicetopic-modelgroup01-rc-z5mzc' is 'Pending'; not 'Running'
                                                        splunkuba     domaintopic-modelgroup01-rc-8pvcx        0/1       Pending   0          14m       <none>          <none>   <== pod 'domaintopic-modelgroup01-rc-8pvcx' is 'Pending'; not 'Running'
                                                        splunkuba     domaintopic-modelgroup01-rc-mmzhw        0/1       Pending   0          14m       <none>          <none>   <== pod 'domaintopic-modelgroup01-rc-mmzhw' is 'Pending'; not 'Running'

.
.
.

The problem would be the way the hostnames are set with FQDN:
i.e.
hostnamectl status: Static hostname: YOURFQDNSERVERNAME

e.g: /etc/hosts: (on Master Node)

127.0.0.1 localhost
Ipaddress YOURFQDNSERVERNAME(e.g uba1.splunk.com)
Ip
address YOURFQDNSERVERNAME(e.g uba2.splunk.com)
Ipaddress YOURFQDN_SERVERNAME(e.g uba3.splunk.com)

To resolve or rectify the issue:

1.Check the current status of the containers
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide --all-namespaces
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -o wide --all-namespaces

2.Stop containers and services
/opt/caspida/bin/Caspida stop-containers
/opt/caspida/bin/Caspida stop-container-services #command not in 4.1, added in 4.2

3.Check the current status of the containers
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide --all-namespaces
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -o wide --all-namespaces

4.Stop kubelet and docker on all nodes
sudo service kubelet stop && sudo service docker stop
ssh uba1 "sudo service kubelet stop && sudo service docker stop"
ssh uba2 "sudo service kubelet stop && sudo service docker stop"

5.Check the current status of the containers
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide --all-namespaces
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -o wide --all-namespaces

6.Update hostnames to have shortnames
sudo hostnamectl set-hostname uba1
ssh uba2 "sudo hostnamectl set-hostname uba2"
ssh uba3 "sudo hostnamectl set-hostname uba3"

7.Restart kubelet and docker on all nodes
sudo service docker start && sudo service kubelet start
ssh uba2 "sudo service docker start && sudo service kubelet start"
ssh uba3 "sudo service docker start && sudo service kubelet start"

8.Start containers and services
/opt/caspida/bin/Caspida start-container-services #command not in 4.1, added in 4.2
/opt/caspida/bin/Caspida start-containers

9.Check the current status of the containers
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide --all-namespaces
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -o wide --all-namespaces

  1. You'll be able to check the status pending to running now.
0 Karma