Deployment Architecture

Searches are delayed by 10s before any results are displayed.

Splunk Employee
Splunk Employee

In a distributed environment with search-head pooling enabled it takes about 10s from search execution to events being displayed on the UI.
A cursory look at the search job inspector reveals that startup.handoff is taking an inexplicable ~ 10s.
The behavior is the same across different browsers, search queries and timespans.

1 Solution

Splunk Employee
Splunk Employee

The culprit for this condition in our case was the leftover pooling.ini.lock file under /etc/pooling/.
Essentially when a pool member (search-head) validates itself with the pool a check is performed against pooling.ini and a lock is created by the requesting search-head.
The 'lock' is comprised of two files under: /etc/pooling/
1. pooling.ini.lock
2. pooling.ini.

On occasion search-heads fail to clean up after themselves and remove their respective lock and leaving them behind in the location above. Another search-head attempting to execute a search cannot validate itself against the pooling.ini due to the existence of the other lock and by default attempts/waits for 10s at which point it proceeds with accessing the pooling.ini regardless.

To confirm whether you are hitting this issue please check:

1) In splunkd.log or btool.log (it is unclear why this message appears in both places or 1 out of 2) there will be messages as the following:
ERROR SearchHeadPoolInfo - Error reading search head pool info: Failed to lock //****/pool/etc/pooling/pooling.ini with code 1, possible reason: No such file or directory

2) The job inspector output for any search job in version 5+ will include in Execution Costs a measurement:
startup.handoff = 10000
Note: In pre-v5 Splunk the issue may be there but the startup.handoff is not calculated, therefore it may be harder to verify if you have hit the condition. Also "total run time" in Job Inspector does NOT include the startup.handoff time.
This value is somewhere in the 10s value matching the current per current design timeout. In version 5.0.6 the behavior will change (SPL-66563) where Splunk will optimistically attempt to open the pooling.ini first and then fall back on a file-based lock around pooling.ini

3) Another simple check would be to execute ANY search and measure time against the "wall" clock. If it takes around 10s before you see anything on the UI, AND the above two checks are positive then you have hit this behavior.

If you have hit this condition the workaround is simply removing
1. pooling.ini.lock
2. pooling.ini.

View solution in original post

Splunk Employee
Splunk Employee

The culprit for this condition in our case was the leftover pooling.ini.lock file under /etc/pooling/.
Essentially when a pool member (search-head) validates itself with the pool a check is performed against pooling.ini and a lock is created by the requesting search-head.
The 'lock' is comprised of two files under: /etc/pooling/
1. pooling.ini.lock
2. pooling.ini.

On occasion search-heads fail to clean up after themselves and remove their respective lock and leaving them behind in the location above. Another search-head attempting to execute a search cannot validate itself against the pooling.ini due to the existence of the other lock and by default attempts/waits for 10s at which point it proceeds with accessing the pooling.ini regardless.

To confirm whether you are hitting this issue please check:

1) In splunkd.log or btool.log (it is unclear why this message appears in both places or 1 out of 2) there will be messages as the following:
ERROR SearchHeadPoolInfo - Error reading search head pool info: Failed to lock //****/pool/etc/pooling/pooling.ini with code 1, possible reason: No such file or directory

2) The job inspector output for any search job in version 5+ will include in Execution Costs a measurement:
startup.handoff = 10000
Note: In pre-v5 Splunk the issue may be there but the startup.handoff is not calculated, therefore it may be harder to verify if you have hit the condition. Also "total run time" in Job Inspector does NOT include the startup.handoff time.
This value is somewhere in the 10s value matching the current per current design timeout. In version 5.0.6 the behavior will change (SPL-66563) where Splunk will optimistically attempt to open the pooling.ini first and then fall back on a file-based lock around pooling.ini

3) Another simple check would be to execute ANY search and measure time against the "wall" clock. If it takes around 10s before you see anything on the UI, AND the above two checks are positive then you have hit this behavior.

If you have hit this condition the workaround is simply removing
1. pooling.ini.lock
2. pooling.ini.

View solution in original post