After upgrade from 8.2.4 to 9.0.4.1 forwarders connect to indexers then after the Indexer cluster gets stabilized. All looked good - new data are delivered and indexed and searching works fine.
However, we start seeing the log messages below, WARN level messages, being populated into splunkd.log.:
05-31-2023 06:47:21.407 -0500 WARN SystemInfo [15415 TcpChannelThread] - Invalid file path /proc/1/cgroup while checking container status
During the upgrade, no new apps were added and no container is used for splunk. This kind of messages are found 3-4 times/min by different components and also in pretty much all splunk entities, including SH, deployer, LM, indexers and CM.
We would like an analysis for that one.
i) What's the WARN message?
This is a new one added to support containerized host to the versions above 9.0.3. If it's confirmed to be hosted by a container it will use different library calls to detect system configurations/capacities..
Basically we query /sys/fs/cgroup/memory/ and /sys/fs/cgroup/cpu,cpuacct and we read /proc/1/cgroup to see if we are running inside of a container. Firstly, validates that the splunkd is running inside the container and once it validates then its reads the cpu/ram from the cgroup rather than reading from the traditional syscalls.
-
ii) Why it is happening after upgrade:
- With the version 8.2.4 it's not there yet. It just starts to happen after the 9.0.4.1 upgrade as it's just added to the versions above 9.0.3.
iii) What the log message tells us:
- The user, splunk is not allowed to access "/proc/1/cgroup", which has even set to proper permissions
#ls -ld /proc/1
dr-xr-xr-x 9 root root 0 Apr 12 00:45 /proc/1
# ls -l /proc/1/cgroup
-r--r--r-- 1 root root 0 Jun 15 17:37 /proc/1/cgroup
- Other group users should be able to access and execute commands in "/proc/1" with "r-w" but the splunk account fails to access it and puts the WARN message:
iv) The reason for the access failure:
The proc file system was created with gid and hidepid=2 as below - only accessible by the account in the group 14001, that splunk account doesn't belong to;
----- mount options for proc --
$ findmnt /proc
proc on /proc type proc (rw,relatime,gid=14001,hidepid=2)
------
- hidepid=0 :By default, the hidepid option has the value zero (0). This means that every user can see all data.
- hidepid=1: When setting it to 1, the directories entries in /proc will remain visible, but not accessible.
- hidepid=2: With value 2 they are hidden altogether.
Reference: https://linux-audit.com/linux-system-hardening-adding-hidepid-to-proc/
v) Resolutions:
i) add splunk uid to the gid 14001,
$usermod -aG 14001 splunk
Or
ii) remove the gid and hidepid.
$ mount -o remount,gid=0,hidepid=0 /proc
FYI, the redhat doc recommends hidepid should not be used with systemd in rhel 7+ due to some reason mentioned in the link, https://access.redhat.com/solutions/6704531 .
i) What's the WARN message?
This is a new one added to support containerized host to the versions above 9.0.3. If it's confirmed to be hosted by a container it will use different library calls to detect system configurations/capacities..
Basically we query /sys/fs/cgroup/memory/ and /sys/fs/cgroup/cpu,cpuacct and we read /proc/1/cgroup to see if we are running inside of a container. Firstly, validates that the splunkd is running inside the container and once it validates then its reads the cpu/ram from the cgroup rather than reading from the traditional syscalls.
-
ii) Why it is happening after upgrade:
- With the version 8.2.4 it's not there yet. It just starts to happen after the 9.0.4.1 upgrade as it's just added to the versions above 9.0.3.
iii) What the log message tells us:
- The user, splunk is not allowed to access "/proc/1/cgroup", which has even set to proper permissions
#ls -ld /proc/1
dr-xr-xr-x 9 root root 0 Apr 12 00:45 /proc/1
# ls -l /proc/1/cgroup
-r--r--r-- 1 root root 0 Jun 15 17:37 /proc/1/cgroup
- Other group users should be able to access and execute commands in "/proc/1" with "r-w" but the splunk account fails to access it and puts the WARN message:
iv) The reason for the access failure:
The proc file system was created with gid and hidepid=2 as below - only accessible by the account in the group 14001, that splunk account doesn't belong to;
----- mount options for proc --
$ findmnt /proc
proc on /proc type proc (rw,relatime,gid=14001,hidepid=2)
------
- hidepid=0 :By default, the hidepid option has the value zero (0). This means that every user can see all data.
- hidepid=1: When setting it to 1, the directories entries in /proc will remain visible, but not accessible.
- hidepid=2: With value 2 they are hidden altogether.
Reference: https://linux-audit.com/linux-system-hardening-adding-hidepid-to-proc/
v) Resolutions:
i) add splunk uid to the gid 14001,
$usermod -aG 14001 splunk
Or
ii) remove the gid and hidepid.
$ mount -o remount,gid=0,hidepid=0 /proc
FYI, the redhat doc recommends hidepid should not be used with systemd in rhel 7+ due to some reason mentioned in the link, https://access.redhat.com/solutions/6704531 .