Hi all,
I configured a smartstore into 2 new splunk core infrastractures. i didnt' encounter error setting the indexer and multisite but when i configured the smartstore I started to receive these errors multiple time:
ERROR CacheManager [1721417 cachemanagerUploadExecutorWorker-3] - action=upload, cache_id="*THE*BUCKET*", status=failed, reason="HTTP Error 14: Retry policy exhausted in Read(): PerformWork() - CURL error [6]=Couldn't resolve host name [UNAVAILABLE]", elapsed_ms=881841
ERROR CacheManager [1721414 cachemanagerUploadExecutorWorker-0] - action=upload, cache_id="*THE*BUCKET*", status=failed, reason="HTTP Error 9: Permanent error in ComposeObject: {\n "error": {\n "code": 412,\n "message": "At least one of the pre-conditions you specified did not hold.",\n "errors": [\n {\n "message": "At least one of the pre-conditions you specified did not hold.",\n "domain": "global",\n "reason": "conditionNotMet",\n "locationType": "header",\n "location": "If-Match"\n }\n ]\n }\n}\n [FAILED_PRECONDITION]", elapsed_ms=327982
I checked the content of GCS folder with cmd:
splunk cmd splunkd rfs ls index:my_index | grep *THE*BUCKET*IN*ERROR*
I check the bucket and it's in the folder.
I tried to restart CM and a rolling restart of indexer but the error persist.
I share the .conf:
server.conf:
[cachemanager]
max_cache_size = 250000
hotlist_recency_secs = 604800
max_concurrent_downloads = 4
hotlist_bloom_filter_recency_hours = 168
indexes.conf:
[volume:remote_store]
storageType = remote
path = gs://bucket
remote.gs.credential_file=cred
This issue has been identified in Splunk bug:
SPL-191595: "GCP S2: Multipart upload is endlessly trying to upload already uploaded file for one of the buckets"
Fixed version on Prem releases:
9.0+
How to identify affected bucket files and indexes:
note: host IN (*idx*) should be a host filter to search across all of your indexers
index=_internal host IN (*idx*) sourcetype=splunkd source=*/splunkd.log component=CacheManager "At least one of the pre-conditions you specified did not hold" OR "Failed to copy localId" | rex field=cache_id "bid\|(?<idx>\S+)\~\S+\~\S+\|" | stats values(localId) values(remoteId) by idx cache_id
ie:
10-31-2022 16:57:52.396 +0000 ERROR CacheManager [149202 cachemanagerUploadExecutorWorker-20602] - action=upload, cache_id="bid|nginx~328~F40E062E-ED80-4BAC-B7EE-B60558832CE1|", status=failed, reason="HTTP Error 9: Permanent error in ComposeObject: {\n "error": {\n "code": 412,\n "message": "At least one of the pre-conditions you specified did not hold.",\n "errors": [\n {\n "message": "At least one of the pre-conditions you specified did not hold.",\n "domain": "global",\n "reason": "conditionNotMet",\n "locationType": "header",\n "location": "If-Match"\n }\n ]\n }\n}\n [FAILED_PRECONDITION]", elapsed_ms=11127 |
10-31-2022 16:57:52.396 +0000 WARN CacheManager [149202 cachemanagerUploadExecutorWorker-20602] - cache_id="bid|nginx~328~F40E062E-ED80-4BAC-B7EE-B60558832CE1|", issue="Failed to copy localId=/opt/splunk/var/lib/splunk/nginx/db/db_1665123401_1665073273_328_F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx to remoteId=nginx/db/aa/28/328~F40E062E-ED80-4BAC-B7EE-B60558832CE1/guidSplunk-F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx(0,-1,)" |
10-31-2022 16:57:52.396 +0000 ERROR GCSClient [149202 cachemanagerUploadExecutorWorker-20602] - action=GCSPutFromFileJob bucket=shopify-smartstore-production inputFile=/opt/splunk/var/lib/splunk/nginx/db/db_1665123401_1665073273_328_F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx remoteObject=nginx/db/aa/28/328~F40E062E-ED80-4BAC-B7EE-B60558832CE1/guidSplunk-F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx size=1055137502 parallelize=1 status=failed code='FAILED_PRECONDITION' msg='Permanent error in ComposeObject: {\n "error": {\n "code": 412,\n "message": "At least one of the pre-conditions you specified did not hold.",\n "errors": [\n {\n "message": "At least one of the pre-conditions you specified did not hold.",\n "domain": "global",\n "reason": "conditionNotMet",\n "locationType": "header",\n "location": "If-Match"\n }\n ]\n }\n}\n [FAILED_PRECONDITION]' |
workaround (pre 9.0):
indexes.conf
[default]
maxDataSize = 750
[<index>]
maxDataSize = 750
[volume:<volume>]
remote.s3.supports_versioning = false
remote.gs.upload_chunk_size = 0
note: It is not advised to set remote.gs.upload_chunk_size = 0 if <index> maxDataSize is larger than 750MB, instead choose to upgrade to 9.0+ or reduce <index> maxDataSize to "auto" which is 750MB.
If remote.gs.upload_chunk_size = 0 is used, revert back to the default after upgrading to 9.0+. Default is remote.gs.upload_chunk_size = 33554432
The following parameter from indexes.conf controls the bucket upload chunk size (for gcs):
remote.gs.upload_chunk_size = <unsigned integer> * Specifies the maximum size, in bytes, for file chunks in a parallel upload. * A value of 0 disables uploading in multiple chunks. Files are uploaded as a single (large) chunk. * Minimum value: 5242880 (5 MB) * Default: 33554432 (32MB)
bucket size is configured per <index> in indexes.conf:
maxDataSize = <positive integer>|auto|auto_high_volume
* The maximum size, in megabytes, that a hot bucket can reach before splunkd
triggers a roll to warm.
* Specifying "auto" or "auto_high_volume" will cause Splunk to autotune this
setting (recommended).
* You should use "auto_high_volume" for high-volume indexes (such as the
main index); otherwise, use "auto". A "high volume index" would typically
be considered one that gets over 10GB of data per day.
* "auto_high_volume" sets the size to 10GB on 64-bit, and 1GB on 32-bit
systems.
* Although the maximum value you can set this is 1048576 MB, which
corresponds to 1 TB, a reasonable number ranges anywhere from 100 to
50000. Before proceeding with any higher value, please seek approval of
Splunk Support.
* If you specify an invalid number or string, maxDataSize will be auto
tuned.
* NOTE: The maximum size of your warm buckets might slightly exceed
'maxDataSize', due to post-processing and timing issues with the rolling
policy.
* For remote storage enabled indexes, consider setting this value to "auto"
(750MB) or lower.
* Default: "auto" (sets the size to 750 megabytes)
note: With smartstore enabled indexes the recommended bucket size is the default "auto" / 750MB :
https://docs.splunk.com/Documentation/Splunk/latest/Indexer/ConfigureSmartStore
So buckets are uploaded from the indexer to the remote storage in 32MB chunks by default, multi-part uploads.
It appears this condition is hit when a multi-part upload fails (ie: splunk restart). So some chunks of the bucket may have been uploaded to the remote store, but not all so Splunk tries again to chunk and upload the bucket again.
Once a multipart upload is aborted, any parts in the process of being uploaded fail, and future requests that use the relevant upload ID fail.
Splunk may be trying to use the old ETag or uploadId of a failed multi-part upload which is why we run into this issue.
This issue has been identified in Splunk bug:
SPL-191595: "GCP S2: Multipart upload is endlessly trying to upload already uploaded file for one of the buckets"
Fixed version on Prem releases:
9.0+
How to identify affected bucket files and indexes:
note: host IN (*idx*) should be a host filter to search across all of your indexers
index=_internal host IN (*idx*) sourcetype=splunkd source=*/splunkd.log component=CacheManager "At least one of the pre-conditions you specified did not hold" OR "Failed to copy localId" | rex field=cache_id "bid\|(?<idx>\S+)\~\S+\~\S+\|" | stats values(localId) values(remoteId) by idx cache_id
ie:
10-31-2022 16:57:52.396 +0000 ERROR CacheManager [149202 cachemanagerUploadExecutorWorker-20602] - action=upload, cache_id="bid|nginx~328~F40E062E-ED80-4BAC-B7EE-B60558832CE1|", status=failed, reason="HTTP Error 9: Permanent error in ComposeObject: {\n "error": {\n "code": 412,\n "message": "At least one of the pre-conditions you specified did not hold.",\n "errors": [\n {\n "message": "At least one of the pre-conditions you specified did not hold.",\n "domain": "global",\n "reason": "conditionNotMet",\n "locationType": "header",\n "location": "If-Match"\n }\n ]\n }\n}\n [FAILED_PRECONDITION]", elapsed_ms=11127 |
10-31-2022 16:57:52.396 +0000 WARN CacheManager [149202 cachemanagerUploadExecutorWorker-20602] - cache_id="bid|nginx~328~F40E062E-ED80-4BAC-B7EE-B60558832CE1|", issue="Failed to copy localId=/opt/splunk/var/lib/splunk/nginx/db/db_1665123401_1665073273_328_F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx to remoteId=nginx/db/aa/28/328~F40E062E-ED80-4BAC-B7EE-B60558832CE1/guidSplunk-F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx(0,-1,)" |
10-31-2022 16:57:52.396 +0000 ERROR GCSClient [149202 cachemanagerUploadExecutorWorker-20602] - action=GCSPutFromFileJob bucket=shopify-smartstore-production inputFile=/opt/splunk/var/lib/splunk/nginx/db/db_1665123401_1665073273_328_F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx remoteObject=nginx/db/aa/28/328~F40E062E-ED80-4BAC-B7EE-B60558832CE1/guidSplunk-F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx size=1055137502 parallelize=1 status=failed code='FAILED_PRECONDITION' msg='Permanent error in ComposeObject: {\n "error": {\n "code": 412,\n "message": "At least one of the pre-conditions you specified did not hold.",\n "errors": [\n {\n "message": "At least one of the pre-conditions you specified did not hold.",\n "domain": "global",\n "reason": "conditionNotMet",\n "locationType": "header",\n "location": "If-Match"\n }\n ]\n }\n}\n [FAILED_PRECONDITION]' |
workaround (pre 9.0):
indexes.conf
[default]
maxDataSize = 750
[<index>]
maxDataSize = 750
[volume:<volume>]
remote.s3.supports_versioning = false
remote.gs.upload_chunk_size = 0
note: It is not advised to set remote.gs.upload_chunk_size = 0 if <index> maxDataSize is larger than 750MB, instead choose to upgrade to 9.0+ or reduce <index> maxDataSize to "auto" which is 750MB.
If remote.gs.upload_chunk_size = 0 is used, revert back to the default after upgrading to 9.0+. Default is remote.gs.upload_chunk_size = 33554432
The following parameter from indexes.conf controls the bucket upload chunk size (for gcs):
remote.gs.upload_chunk_size = <unsigned integer> * Specifies the maximum size, in bytes, for file chunks in a parallel upload. * A value of 0 disables uploading in multiple chunks. Files are uploaded as a single (large) chunk. * Minimum value: 5242880 (5 MB) * Default: 33554432 (32MB)
bucket size is configured per <index> in indexes.conf:
maxDataSize = <positive integer>|auto|auto_high_volume
* The maximum size, in megabytes, that a hot bucket can reach before splunkd
triggers a roll to warm.
* Specifying "auto" or "auto_high_volume" will cause Splunk to autotune this
setting (recommended).
* You should use "auto_high_volume" for high-volume indexes (such as the
main index); otherwise, use "auto". A "high volume index" would typically
be considered one that gets over 10GB of data per day.
* "auto_high_volume" sets the size to 10GB on 64-bit, and 1GB on 32-bit
systems.
* Although the maximum value you can set this is 1048576 MB, which
corresponds to 1 TB, a reasonable number ranges anywhere from 100 to
50000. Before proceeding with any higher value, please seek approval of
Splunk Support.
* If you specify an invalid number or string, maxDataSize will be auto
tuned.
* NOTE: The maximum size of your warm buckets might slightly exceed
'maxDataSize', due to post-processing and timing issues with the rolling
policy.
* For remote storage enabled indexes, consider setting this value to "auto"
(750MB) or lower.
* Default: "auto" (sets the size to 750 megabytes)
note: With smartstore enabled indexes the recommended bucket size is the default "auto" / 750MB :
https://docs.splunk.com/Documentation/Splunk/latest/Indexer/ConfigureSmartStore
So buckets are uploaded from the indexer to the remote storage in 32MB chunks by default, multi-part uploads.
It appears this condition is hit when a multi-part upload fails (ie: splunk restart). So some chunks of the bucket may have been uploaded to the remote store, but not all so Splunk tries again to chunk and upload the bucket again.
Once a multipart upload is aborted, any parts in the process of being uploaded fail, and future requests that use the relevant upload ID fail.
Splunk may be trying to use the old ETag or uploadId of a failed multi-part upload which is why we run into this issue.