My Splunk instance is constantly indexing data 24*7, but I've noticed some gaps in the indexed data timeline recently. I have also noticed that data I could search on yesterday is not being returned today. This doesn't happen consistently, but regularly enough to cause concern. I looked in splunkd.log and index=_internal to ensure that the buckets have not rotated out of the DB, and also confirmed that the buckets spanning the time period of the gap are present and in good shape. What else can I do to track down this missing data?
In splunkd.log I see the following:
05-07-2011 05:44:45.466 +0000 WARN MetaData - /opt/splunk/var/lib/splunk/apache/db/hot_v1_59/Hosts.data: attempting safeService to attempt to fix up metadata
My environment consists of 4 indexers running 4.2, 300 UF instances (also 4.2) and a standalone deployment server, also 4.2. We use the deployment server to manage the configs of all instances.
Hi this may be caused by defect SPL-39127 on Splunk 4.2.0 and 4.2.1.
This is caused by a push from the deployment server which restarts Splunkweb on indexers (includes search heads that are performing summary indexing) that are deployment clients.
If Splunk is still at 4.2, first apply the latest 4.2.1 release which fixes an associated defect SPL-38464 where in rare cases, concurrent hash table and string length collisions for metadata field values can cause index-level metadata files to grow to very large sizes, up to several gigabytes.
Reference: http://splunk.com/base/Documentation/4.2/ReleaseNotes/Knownissues
If you encounter this problem, please file a case at splunk support.
http://www.splunk.com/support
To find if this is the case, search in the splunkd.log logs look for something like :
05-07-2011 05:44:45.466 +0000 WARN MetaData - /opt/splunk/var/lib/splunk/apache/db/hot_v1_59/Hosts.data: attempting safeService to attempt to fix up metadata
To find those errors in the internal logs, (and the indexer in case of search-peers), you can use this search :
index=_internal host="indexer hostname" source=splunkd.log safeService | rex " MetaData - (?P.*)/" | stats count by bucket splunk_server
Here is the manual procedure to fix.
Note: There are 2 options: run multiple rebuilds in parallel or a single sequential rebuild as detailed below.
1 - disable deploymentclient to prevent new corruption
(until the fix to SPL-39127: targeted for the upcoming maintenance release 4.2.2)
mv $SPLUNK_HOME/etc/system/local/deploymentclient.conf $SPLUNK_HOME/etc/system/local/deploymentclient.disabled
cd $SPLUNK_HOME/bin
./splunk cmd splunkd fsck --mode metadata --all > /tmp/trash
3 - stop splunk to prevent bucket rotation
4 - for each of them rebuild the tsdix files
the process is long, if you have several buckets, it is faster to run several rebuild in parallel (use & on linux)
./splunk cmd splunkd rebuild /pathtothebucketfolder/
./splunk cmd splunkd rebuild /pathtothebucketfolder1/ &
./splunk cmd splunkd rebuild /pathtothebucketfolder2/ &
etc...
./splunk cmd splunkd fsck --mode metadata --all --repair
5 - check the result with
./splunk cmd splunkd fsck --mode metadata --all
6 - restart splunk (it will also apply the modification to the deploymentclient config)
For further information on splunkd fsck refer on the Community Wiki to:
http:///www.splunk.com/wiki/Check_and_Repair_Metadata
How to prevent this from happening until 4.2.2 comes out?
There are two workarounds to address this.
The workaround for the associated bug SPL-38464 (setting "inPlaceUpdates = false" as a global parameter in the [default] stanza of indexes.conf) is still a valid one :
[default]
inPlaceUpdates = false
Another workaround is to set both "restartSplunkWeb=false" AND "restartSplunkd=false" in their serverclass.conf stanzas to disable restarts. The corruption happens in the splunkweb restart code path, but restarting splunkd also triggers splunkweb restart.
If applied, these work-arounds should be retired once 4.2.2 is installed.
Hi this may be caused by defect SPL-39127 on Splunk 4.2.0 and 4.2.1.
This is caused by a push from the deployment server which restarts Splunkweb on indexers (includes search heads that are performing summary indexing) that are deployment clients.
If Splunk is still at 4.2, first apply the latest 4.2.1 release which fixes an associated defect SPL-38464 where in rare cases, concurrent hash table and string length collisions for metadata field values can cause index-level metadata files to grow to very large sizes, up to several gigabytes.
Reference: http://splunk.com/base/Documentation/4.2/ReleaseNotes/Knownissues
If you encounter this problem, please file a case at splunk support.
http://www.splunk.com/support
To find if this is the case, search in the splunkd.log logs look for something like :
05-07-2011 05:44:45.466 +0000 WARN MetaData - /opt/splunk/var/lib/splunk/apache/db/hot_v1_59/Hosts.data: attempting safeService to attempt to fix up metadata
To find those errors in the internal logs, (and the indexer in case of search-peers), you can use this search :
index=_internal host="indexer hostname" source=splunkd.log safeService | rex " MetaData - (?P.*)/" | stats count by bucket splunk_server
Here is the manual procedure to fix.
Note: There are 2 options: run multiple rebuilds in parallel or a single sequential rebuild as detailed below.
1 - disable deploymentclient to prevent new corruption
(until the fix to SPL-39127: targeted for the upcoming maintenance release 4.2.2)
mv $SPLUNK_HOME/etc/system/local/deploymentclient.conf $SPLUNK_HOME/etc/system/local/deploymentclient.disabled
cd $SPLUNK_HOME/bin
./splunk cmd splunkd fsck --mode metadata --all > /tmp/trash
3 - stop splunk to prevent bucket rotation
4 - for each of them rebuild the tsdix files
the process is long, if you have several buckets, it is faster to run several rebuild in parallel (use & on linux)
./splunk cmd splunkd rebuild /pathtothebucketfolder/
./splunk cmd splunkd rebuild /pathtothebucketfolder1/ &
./splunk cmd splunkd rebuild /pathtothebucketfolder2/ &
etc...
./splunk cmd splunkd fsck --mode metadata --all --repair
5 - check the result with
./splunk cmd splunkd fsck --mode metadata --all
6 - restart splunk (it will also apply the modification to the deploymentclient config)
For further information on splunkd fsck refer on the Community Wiki to:
http:///www.splunk.com/wiki/Check_and_Repair_Metadata
How to prevent this from happening until 4.2.2 comes out?
There are two workarounds to address this.
The workaround for the associated bug SPL-38464 (setting "inPlaceUpdates = false" as a global parameter in the [default] stanza of indexes.conf) is still a valid one :
[default]
inPlaceUpdates = false
Another workaround is to set both "restartSplunkWeb=false" AND "restartSplunkd=false" in their serverclass.conf stanzas to disable restarts. The corruption happens in the splunkweb restart code path, but restarting splunkd also triggers splunkweb restart.
If applied, these work-arounds should be retired once 4.2.2 is installed.
if i upgrade to 4.2.2, do I still need to run the rebuild/repair operations?
If you have forwarders sending data, you can look for forwarder connectivity within the splunkd.log of both the indexers and forwarders. I would first check to make sure the forwarder indeed had connectivity during that time. Are these systems picking up network data or monitoring files? Some keys to debugging:
The above steps are typically enough to figure out if it is a problem getting the data, or indexing the data.