topics/models in UBA have gone into a pending stat...

dchoi_splunk · ‎01-17-2019

After running a stop-all & start-all to restart, various topics/models have gone into a pending state and aren't getting data. It also appears that the data source connectors are no longer getting data as there are no EPS being displayed on the home page, and the number of processed events does not appear to be growing.
A stop-all start-all has been done a number of times since the issue started to try and rectify the problem.
What should I look into it further?

dchoi_splunk · ‎01-17-2019

If you see below errors from the output of health check scripts,

Errors:

kubelet journalctl:

kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
kubelet_node_status.go:82] Attempting to register node YOUR_FQDN_SERVERNAME
kubelet_node_status.go:106] Unable to register node "YOUR_FQDN_SERVERNAME" with API server: nodes "YOUR_FQDN_SERVERNAME" is forbidden: node "short_hostname" cannot modify node "YOUR_FQDN_SERVERNAME"
kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
kubelet_node_status.go:82] Attempting to register node YOUR_FQDN

Output of health check:

concern summary: Checking YOUR_FQDN_SERVERNAME first ...

                                                         McAfee_ePO           | Splunk | Processing |       | SPLUNK/DIRECT     | 2019-01-15 05:18:10 | 0          | 0           | 0       |      949874 |                 0 |             485746 |           49 |    <== significant failed/skipped events; review datasource SPL
                              eps:                      0   <== no response from redis 'YOUR_FQDN'
                              status not OK ...            <== check UI system health monitor for errors
                                                        splunkuba     analyticsaggregator-rc-v6csh             0/1       Pending   0          14m       <none>          <none>   <== pod 'analyticsaggregator-rc-v6csh' is 'Pending'; not 'Running'
                                                        splunkuba     analyticsviewsbuilder-rc-nwffg           0/1       Pending   0          14m       <none>          <none>   <== pod 'analyticsviewsbuilder-rc-nwffg' is 'Pending'; not 'Running'
                                                        splunkuba     analyticswriter-rc-jscjf                 0/1       Pending   0          14m       <none>          <none>   <== pod 'analyticswriter-rc-jscjf' is 'Pending'; not 'Running'
                                                        splunkuba     anomalyaggregationmodel-rc-j968d         0/1       Pending   0          14m       <none>          <none>   <== pod 'anomalyaggregationmodel-rc-j968d' is 'Pending'; not 'Running'
                                                        splunkuba     devicetopic-modelgroup01-rc-d5ljq        0/1       Pending   0          14m       <none>          <none>   <== pod 'devicetopic-modelgroup01-rc-d5ljq' is 'Pending'; not 'Running'
                                                        splunkuba     devicetopic-modelgroup01-rc-z5mzc        0/1       Pending   0          14m       <none>          <none>   <== pod 'devicetopic-modelgroup01-rc-z5mzc' is 'Pending'; not 'Running'
                                                        splunkuba     domaintopic-modelgroup01-rc-8pvcx        0/1       Pending   0          14m       <none>          <none>   <== pod 'domaintopic-modelgroup01-rc-8pvcx' is 'Pending'; not 'Running'
                                                        splunkuba     domaintopic-modelgroup01-rc-mmzhw        0/1       Pending   0          14m       <none>          <none>   <== pod 'domaintopic-modelgroup01-rc-mmzhw' is 'Pending'; not 'Running'

.
.
.

The problem would be the way the hostnames are set with FQDN:
i.e.
hostnamectl status: Static hostname: YOUR_FQDN_SERVERNAME

e.g: /etc/hosts: (on Master Node)

127.0.0.1 localhost
Ip_address YOUR_FQDN_SERVERNAME(e.g uba1.splunk.com)
Ip_address YOUR_FQDN_SERVERNAME(e.g uba2.splunk.com)
Ip_address YOUR_FQDN_SERVERNAME(e.g uba3.splunk.com)

To resolve or rectify the issue:

1.Check the current status of the containers
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide --all-namespaces
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -o wide --all-namespaces

2.Stop containers and services
/opt/caspida/bin/Caspida stop-containers
/opt/caspida/bin/Caspida stop-container-services #command not in 4.1, added in 4.2

3.Check the current status of the containers
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide --all-namespaces
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -o wide --all-namespaces

4.Stop kubelet and docker on all nodes
sudo service kubelet stop && sudo service docker stop
ssh uba1 "sudo service kubelet stop && sudo service docker stop"
ssh uba2 "sudo service kubelet stop && sudo service docker stop"

5.Check the current status of the containers
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide --all-namespaces
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -o wide --all-namespaces

6.Update hostnames to have shortnames
sudo hostnamectl set-hostname uba1
ssh uba2 "sudo hostnamectl set-hostname uba2"
ssh uba3 "sudo hostnamectl set-hostname uba3"

7.Restart kubelet and docker on all nodes
sudo service docker start && sudo service kubelet start
ssh uba2 "sudo service docker start && sudo service kubelet start"
ssh uba3 "sudo service docker start && sudo service kubelet start"

8.Start containers and services
/opt/caspida/bin/Caspida start-container-services #command not in 4.1, added in 4.2
/opt/caspida/bin/Caspida start-containers

9.Check the current status of the containers
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide --all-namespaces
sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -o wide --all-namespaces

You'll be able to check the status pending to running now.

topics/models in UBA have gone into a pending state even after stop-all and start-all

Splunk Platform | Upgrading your Splunk Deployment to Python 3.9

From Product Design to User Insights: Boosting App Developer Identity on Splunkbase

Detect and Resolve Issues in a Kubernetes Environment