I recently had to realign our storage. Specifically, write cold data to one NFS share and hot/warm to another. Prior to this, all data was being written to the same storage which was not per our design. I placed our cluster master in maintenance-mode, stopped splunk on all indexers, then used rsync to copy data to the proper shares.
After moving data around and ensuring that NFS shares were then mounted in the proper locations, I attempted to bring everything back online. The cluster master starts fine. The indexers, though, do not.
I have only been able to start one indexer out of four. It seems to not be one specific indexer, though. I had splunk running on indexer1, but indexer2, indexer3, and indexer4 then failed. Later, I was able to start splunk on indexer2, but indexer1, indexer3, and indexer4 failed.
Examples of the errors I'm seeing are
ERROR STMgr - dir='/splunk/audit/db/hot_v1_64' st_open failure: opts=1 tsidxWritingLevel=1 (No such file or directory)
ERROR StreamGroup - Failed to open THING for dir=/splunk/audit/db/hot_v1_64 exists=false isDir=false isRW=false errno='No such file or directory' Your .tsidx files will be incomplete for this bucket, and you may have to rebuild it.
ERROR StreamGroup - failed to add corrupt marker to dir=/splunk/audit/db/hot_v1_64 errno=No such file or directory
and
ERROR HotDBManager - Could not service the bucket: path=/splunk/_introspection/db/hot_v1_388/rawdata not found. Remove it from host bucket list.
WARN TimeInvertedIndex - Directory /splunk/_introspection/db/hot_v1_388 appears to have been deleted
FATAL MetaData - Unable to open tempfile=/splunk/_introspection/db/hot_v1_388/Strings.data.temp for reason="No such file or directory"; this=MetaData: {file=/splunk/_introspection/db/hot_v1_388/Strings.data description=Strings totalCount=761 secsSinceFullService=0 global=WordPositionData: { count=0 ET=n/a LT=n/a mostRecent=n/a }
and
FATAL HotDBManager - Hot bucket with id=389 already exists. idx=_introspection dir=/splunk/_introspection/db/hot_v1_389
I've run 'splunk fsck repair --all-buckets-all-indexes' more than once, but these issues persist.
Can the underlying issues be corrected or should we cut our losses and start our collections fresh? Fortunately, this is an option we can use as a last resort.
Our cold storage is on a Data Domain backed by HDD. This was by design prior to my involvement in the administration of this deployment. This may have an impact in other areas and could feasibly be interfering with the repair of the existing data.
Our hot/warm NFS share is backed by SSD on a NetApp. It is split into peers; one for each of the four indexers. This allows each indexer to write to its own share while the data is still being consolidated by the NetApp.
It is feasible that despite the speed of the SSD storage, there are other bottlenecks causing slowdowns. I don't see that being a cause of the errors, though. Rather, they seem to be the result of data corruption that `splunk fsck` is not fixing.
As for the storage type recommended for each role, we're unfortunately still running 7.2.5. Realigning the data according to the original architecture was part of the process of upgrading. In fact, it was the first step in our process. That said, our storage configuration does meet recommendations in the table that you linked to:
Looking at the reference, though, I feel like I need to check the network latency as well. I doubt we're pushing near or beyond the recommended latency threshold, but it won't hurt to check.
All that said, I'm still looking for some help with the errors and why they aren't being corrected by `splunk fsck`.
Hi
1st Don't use NFS for Hot/Warm data it's not performing well for that purpose. In personally I don't use it even Cold only frozen if needed. https://docs.splunk.com/Documentation/Splunk/8.2.3/Capacity/Referencehardware#What_storage_type_shou...
I'm not sure if I understand your layout right. Do you have own storage part / directories on NFS system for individual indexer or have you put only one directory for all indexers and have that data there only once? How many IOPS you could get from NFS server when one node and/or all nodes are using it? When all is using it the minimum per servers should be more than 800.
r. Ismo