Solved: Smartstore GCS: Why is there upload error status 9...

crazyTauron · ‎06-20-2022

Hi all,

I configured a smartstore into 2 new splunk core infrastractures. i didnt' encounter error setting the indexer and multisite but when i configured the smartstore I started to receive these errors multiple time:

ERROR CacheManager [1721417 cachemanagerUploadExecutorWorker-3] - action=upload, cache_id="*THE*BUCKET*", status=failed, reason="HTTP Error 14: Retry policy exhausted in Read(): PerformWork() - CURL error [6]=Couldn't resolve host name [UNAVAILABLE]", elapsed_ms=881841

ERROR CacheManager [1721414 cachemanagerUploadExecutorWorker-0] - action=upload, cache_id="*THE*BUCKET*", status=failed, reason="HTTP Error 9: Permanent error in ComposeObject: {\n "error": {\n "code": 412,\n "message": "At least one of the pre-conditions you specified did not hold.",\n "errors": [\n {\n "message": "At least one of the pre-conditions you specified did not hold.",\n "domain": "global",\n "reason": "conditionNotMet",\n "locationType": "header",\n "location": "If-Match"\n }\n ]\n }\n}\n [FAILED_PRECONDITION]", elapsed_ms=327982

I checked the content of GCS folder with cmd:

splunk cmd splunkd rfs ls index:my_index | grep *THE*BUCKET*IN*ERROR*

I check the bucket and it's in the folder.

I tried to restart CM and a rolling restart of indexer but the error persist.

I share the .conf:
server.conf:

[cachemanager]
max_cache_size = 250000
hotlist_recency_secs = 604800
max_concurrent_downloads = 4
hotlist_bloom_filter_recency_hours = 168

indexes.conf:

[volume:remote_store]
storageType = remote
path = gs://bucket
remote.gs.credential_file=cred

rphillips_splk · ‎11-20-2022

This issue has been identified in Splunk bug:
SPL-191595: "GCP S2: Multipart upload is endlessly trying to upload already uploaded file for one of the buckets"

Fixed version on Prem releases:
9.0+

How to identify affected bucket files and indexes:
note: host IN (*idx*) should be a host filter to search across all of your indexers

index=_internal host IN (*idx*) sourcetype=splunkd source=*/splunkd.log component=CacheManager "At least one of the pre-conditions you specified did not hold" OR "Failed to copy localId" | rex field=cache_id "bid\|(?<idx>\S+)\~\S+\~\S+\|" | stats values(localId) values(remoteId) by idx cache_id

ie:

10-31-2022 16:57:52.396 +0000 ERROR CacheManager [149202 cachemanagerUploadExecutorWorker-20602] - action=upload, cache_id="bid|nginx~328~F40E062E-ED80-4BAC-B7EE-B60558832CE1|", status=failed, reason="HTTP Error 9: Permanent error in ComposeObject: {\n "error": {\n "code": 412,\n "message": "At least one of the pre-conditions you specified did not hold.",\n "errors": [\n {\n "message": "At least one of the pre-conditions you specified did not hold.",\n "domain": "global",\n "reason": "conditionNotMet",\n "locationType": "header",\n "location": "If-Match"\n }\n ]\n }\n}\n [FAILED_PRECONDITION]", elapsed_ms=11127

10-31-2022 16:57:52.396 +0000 WARN CacheManager [149202 cachemanagerUploadExecutorWorker-20602] - cache_id="bid|nginx~328~F40E062E-ED80-4BAC-B7EE-B60558832CE1|", issue="Failed to copy localId=/opt/splunk/var/lib/splunk/nginx/db/db_1665123401_1665073273_328_F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx to remoteId=nginx/db/aa/28/328~F40E062E-ED80-4BAC-B7EE-B60558832CE1/guidSplunk-F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx(0,-1,)"

10-31-2022 16:57:52.396 +0000 ERROR GCSClient [149202 cachemanagerUploadExecutorWorker-20602] - action=GCSPutFromFileJob bucket=shopify-smartstore-production inputFile=/opt/splunk/var/lib/splunk/nginx/db/db_1665123401_1665073273_328_F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx remoteObject=nginx/db/aa/28/328~F40E062E-ED80-4BAC-B7EE-B60558832CE1/guidSplunk-F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx size=1055137502 parallelize=1 status=failed code='FAILED_PRECONDITION' msg='Permanent error in ComposeObject: {\n "error": {\n "code": 412,\n "message": "At least one of the pre-conditions you specified did not hold.",\n "errors": [\n {\n "message": "At least one of the pre-conditions you specified did not hold.",\n "domain": "global",\n "reason": "conditionNotMet",\n "locationType": "header",\n "location": "If-Match"\n }\n ]\n }\n}\n [FAILED_PRECONDITION]'

workaround (pre 9.0):

indexes.conf

[default]
maxDataSize = 750

[<index>]
maxDataSize = 750

[volume:<volume>]
remote.s3.supports_versioning = false
remote.gs.upload_chunk_size = 0

note: It is not advised to set remote.gs.upload_chunk_size = 0 if <index> maxDataSize is larger than 750MB, instead choose to upgrade to 9.0+ or reduce <index> maxDataSize to "auto" which is 750MB.
If remote.gs.upload_chunk_size = 0 is used, revert back to the default after upgrading to 9.0+. Default is remote.gs.upload_chunk_size = 33554432

The following parameter from indexes.conf controls the bucket upload chunk size (for gcs):

remote.gs.upload_chunk_size = <unsigned integer>
* Specifies the maximum size, in bytes, for file chunks in a parallel upload.
* A value of 0 disables uploading in multiple chunks. Files are uploaded
  as a single (large) chunk.
* Minimum value: 5242880 (5 MB)
* Default: 33554432 (32MB)

bucket size is configured per <index> in indexes.conf:

maxDataSize = <positive integer>|auto|auto_high_volume
* The maximum size, in megabytes, that a hot bucket can reach before splunkd
  triggers a roll to warm.
* Specifying "auto" or "auto_high_volume" will cause Splunk to autotune this
  setting (recommended).
* You should use "auto_high_volume" for high-volume indexes (such as the
  main index); otherwise, use "auto". A "high volume index" would typically
  be considered one that gets over 10GB of data per day.
* "auto_high_volume" sets the size to 10GB on 64-bit, and 1GB on 32-bit
  systems.
* Although the maximum value you can set this is 1048576 MB, which
  corresponds to 1 TB, a reasonable number ranges anywhere from 100 to
  50000. Before proceeding with any higher value, please seek approval of
  Splunk Support.
* If you specify an invalid number or string, maxDataSize will be auto
  tuned.
* NOTE: The maximum size of your warm buckets might slightly exceed
  'maxDataSize', due to post-processing and timing issues with the rolling
  policy.
* For remote storage enabled indexes, consider setting this value to "auto"
  (750MB) or lower.
* Default: "auto" (sets the size to 750 megabytes)

note: With smartstore enabled indexes the recommended bucket size is the default "auto" / 750MB :

maxDataSize. Do not change from default of auto (recommended).

https://docs.splunk.com/Documentation/Splunk/latest/Indexer/ConfigureSmartStore

So buckets are uploaded from the indexer to the remote storage in 32MB chunks by default, multi-part uploads.

It appears this condition is hit when a multi-part upload fails (ie: splunk restart). So some chunks of the bucket may have been uploaded to the remote store, but not all so Splunk tries again to chunk and upload the bucket again.

Once a multipart upload is aborted, any parts in the process of being uploaded fail, and future requests that use the relevant upload ID fail.

Splunk may be trying to use the old ETag or uploadId of a failed multi-part upload which is why we run into this issue.

View solution in original post

rphillips_splk · ‎11-20-2022

This issue has been identified in Splunk bug:
SPL-191595: "GCP S2: Multipart upload is endlessly trying to upload already uploaded file for one of the buckets"

Fixed version on Prem releases:
9.0+

How to identify affected bucket files and indexes:
note: host IN (*idx*) should be a host filter to search across all of your indexers

index=_internal host IN (*idx*) sourcetype=splunkd source=*/splunkd.log component=CacheManager "At least one of the pre-conditions you specified did not hold" OR "Failed to copy localId" | rex field=cache_id "bid\|(?<idx>\S+)\~\S+\~\S+\|" | stats values(localId) values(remoteId) by idx cache_id

ie:

10-31-2022 16:57:52.396 +0000 ERROR CacheManager [149202 cachemanagerUploadExecutorWorker-20602] - action=upload, cache_id="bid|nginx~328~F40E062E-ED80-4BAC-B7EE-B60558832CE1|", status=failed, reason="HTTP Error 9: Permanent error in ComposeObject: {\n "error": {\n "code": 412,\n "message": "At least one of the pre-conditions you specified did not hold.",\n "errors": [\n {\n "message": "At least one of the pre-conditions you specified did not hold.",\n "domain": "global",\n "reason": "conditionNotMet",\n "locationType": "header",\n "location": "If-Match"\n }\n ]\n }\n}\n [FAILED_PRECONDITION]", elapsed_ms=11127

10-31-2022 16:57:52.396 +0000 WARN CacheManager [149202 cachemanagerUploadExecutorWorker-20602] - cache_id="bid|nginx~328~F40E062E-ED80-4BAC-B7EE-B60558832CE1|", issue="Failed to copy localId=/opt/splunk/var/lib/splunk/nginx/db/db_1665123401_1665073273_328_F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx to remoteId=nginx/db/aa/28/328~F40E062E-ED80-4BAC-B7EE-B60558832CE1/guidSplunk-F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx(0,-1,)"

10-31-2022 16:57:52.396 +0000 ERROR GCSClient [149202 cachemanagerUploadExecutorWorker-20602] - action=GCSPutFromFileJob bucket=shopify-smartstore-production inputFile=/opt/splunk/var/lib/splunk/nginx/db/db_1665123401_1665073273_328_F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx remoteObject=nginx/db/aa/28/328~F40E062E-ED80-4BAC-B7EE-B60558832CE1/guidSplunk-F40E062E-ED80-4BAC-B7EE-B60558832CE1/1665104962-1665103890-8745906459075435722.tsidx size=1055137502 parallelize=1 status=failed code='FAILED_PRECONDITION' msg='Permanent error in ComposeObject: {\n "error": {\n "code": 412,\n "message": "At least one of the pre-conditions you specified did not hold.",\n "errors": [\n {\n "message": "At least one of the pre-conditions you specified did not hold.",\n "domain": "global",\n "reason": "conditionNotMet",\n "locationType": "header",\n "location": "If-Match"\n }\n ]\n }\n}\n [FAILED_PRECONDITION]'

workaround (pre 9.0):

indexes.conf

[default]
maxDataSize = 750

[<index>]
maxDataSize = 750

[volume:<volume>]
remote.s3.supports_versioning = false
remote.gs.upload_chunk_size = 0

note: It is not advised to set remote.gs.upload_chunk_size = 0 if <index> maxDataSize is larger than 750MB, instead choose to upgrade to 9.0+ or reduce <index> maxDataSize to "auto" which is 750MB.
If remote.gs.upload_chunk_size = 0 is used, revert back to the default after upgrading to 9.0+. Default is remote.gs.upload_chunk_size = 33554432

The following parameter from indexes.conf controls the bucket upload chunk size (for gcs):

remote.gs.upload_chunk_size = <unsigned integer>
* Specifies the maximum size, in bytes, for file chunks in a parallel upload.
* A value of 0 disables uploading in multiple chunks. Files are uploaded
  as a single (large) chunk.
* Minimum value: 5242880 (5 MB)
* Default: 33554432 (32MB)

bucket size is configured per <index> in indexes.conf:

maxDataSize = <positive integer>|auto|auto_high_volume
* The maximum size, in megabytes, that a hot bucket can reach before splunkd
  triggers a roll to warm.
* Specifying "auto" or "auto_high_volume" will cause Splunk to autotune this
  setting (recommended).
* You should use "auto_high_volume" for high-volume indexes (such as the
  main index); otherwise, use "auto". A "high volume index" would typically
  be considered one that gets over 10GB of data per day.
* "auto_high_volume" sets the size to 10GB on 64-bit, and 1GB on 32-bit
  systems.
* Although the maximum value you can set this is 1048576 MB, which
  corresponds to 1 TB, a reasonable number ranges anywhere from 100 to
  50000. Before proceeding with any higher value, please seek approval of
  Splunk Support.
* If you specify an invalid number or string, maxDataSize will be auto
  tuned.
* NOTE: The maximum size of your warm buckets might slightly exceed
  'maxDataSize', due to post-processing and timing issues with the rolling
  policy.
* For remote storage enabled indexes, consider setting this value to "auto"
  (750MB) or lower.
* Default: "auto" (sets the size to 750 megabytes)

note: With smartstore enabled indexes the recommended bucket size is the default "auto" / 750MB :

maxDataSize. Do not change from default of auto (recommended).

https://docs.splunk.com/Documentation/Splunk/latest/Indexer/ConfigureSmartStore

So buckets are uploaded from the indexer to the remote storage in 32MB chunks by default, multi-part uploads.

It appears this condition is hit when a multi-part upload fails (ie: splunk restart). So some chunks of the bucket may have been uploaded to the remote store, but not all so Splunk tries again to chunk and upload the bucket again.

Once a multipart upload is aborted, any parts in the process of being uploaded fail, and future requests that use the relevant upload ID fail.

Splunk may be trying to use the old ETag or uploadId of a failed multi-part upload which is why we run into this issue.

Smartstore GCS: Why is there upload error status 9 and 14?

configuration

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

Cloud Platform & Enterprise: Classic Dashboard Export Feature Deprecation