MM-62075: Ignore error from aws s3 rm command #864

agarciamontoro · 2024-12-04T10:52:57Z

Summary

When destroying a cluster that deploys an S3 bucket, we try to speed up its deletion by calling aws s3 rm s3://my-bucket concurrently while we run terraform destroy. This was added to fix https://mattermost.atlassian.net/browse/MM-47263, since we were seeing too many timeouts due to the S3 bucket not being properly destroyed in time. Emptying the bucket while other resources are being destroyed has worked to fix the timeout issue, but it introduces a new error: if the bucket is destroyed by Terraform before the aws s3 rm command finishes, this command, and thus the whole {comparison,deployment} destroy command, fails.

This is a spurious failure: all resources, including the S3 bucket, have been destroyed (that's exactly why the aws s3 rm command fails, actually), so we can simply ignore it.

In the case this command would error in any other way that we are not accounting for as of today, the {comparison,deployment} destroy command would still fail, since it would notice that the S3 bucket has not been destroyed.

Ticket Link

https://mattermost.atlassian.net/browse/MM-62075

This command is making the whole {comparison, deployment} destroy to fail when there is an S3 bucket deployed. That error is valid in the local context of the aws s3 rm command, but not in the global sense: since we run the aws s3 rm command concurrently with terraform destroy, aws s3 rm fails when the bucket is destroyed before the command has finished. This is a spurious failure: all resources, including the S3 bucket, have been destroyed (that's exactly why the aws s3 rm command fails, actually), so we can simply skip it. In the case this command would error in any other way that we are not accounting for as of today, the {comparison, deployment} destroy command would still fail, since it would notice that the S3 bucket has not been destroyed.

toninis · 2024-12-04T12:35:29Z

Check https://github.com/mattermost/delivery-platform/pull/222#issuecomment-2517224229

agarciamontoro · 2024-12-04T12:40:50Z

Thanks for taking a look, @toninis! Let me answer your comment here:

I think you should use force_destroy option . Check here

The problem here was that the whole thing was timing out because it was incredibly slow. I don't like the solution of using a concurrent aws s3 rm, but we found it sped things up quite a lot. That said, I believe we've never tried the force_destroy option, so maybe that would solve it as well. I'm a bit concerned about doing that now, because I'd need to run more tests and delay this further.

That's why I'm proposing getting this PR merged. If this one gets accepted, I'll open a ticket to test the force_destroy option and properly test it with time.

agnivade · 2024-12-04T13:57:01Z

deployment/terraform/destroy.go

+			// since it introduces spurious failures when the bucket is
+			// destroyed before the command finishes.
+			// See https://mattermost.atlassian.net/browse/MM-62075
+			_ = t.runAWSCommand(emptyBucketCtx, emptyS3BucketArgs, nil)


Is it possible at all to know what error is returned when the bucket is destroyed before the command finishes, and ignore that exact error rather than all errors? If that is too complicated, this is fine as well. Just curious.

We could manually grep the returned error, looking for something like failed to empty. I don't know if that same string would show up in other errors, though. That would need some more testing.

I actually ran into this today. It seems like the error is NoSuchBucket:

info [2024-12-06 14:26:13.837 +05:30] delete failed: s3://agnivalocallt.s3bucket/users/baw7cx384i8ppdka1h1zj3ntdr/profile.png An error occurred (NoSuchBucket) when calling the DeleteObject operation: The specified bucket does not exist caller="terraform/engine.go:71" info [2024-12-06 14:26:13.851 +05:30] fatal error: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist caller="terraform/engine.go:71"

agarciamontoro · 2024-12-04T15:12:54Z

Oh, it turns out we already haveforce_destroy!

mattermost-load-test-ng/deployment/terraform/assets/cluster.tf

Line 228 in deadfc9

force_destroy = true

It was added way before the original timeout issue was filed and the aws s3 rm fix was added. cc/@toninis

toninis · 2024-12-04T15:16:47Z

Oh, it turns out we already haveforce_destroy!

mattermost-load-test-ng/deployment/terraform/assets/cluster.tf

Line 228 in deadfc9

force_destroy = true

It was added way before the original timeout issue was filed and the aws s3 rm fix was added. cc/@toninis

That answers my question . Because the rm command should not fail with file does not exist unless another delete operation is in place. Perfect 🆗

cpoile

Thanks for the detailed summary!

agarciamontoro added the 2: Dev Review Requires review by a core committer label Dec 4, 2024

agarciamontoro requested review from cpoile and agnivade December 4, 2024 10:52

toninis approved these changes Dec 4, 2024

View reviewed changes

agnivade reviewed Dec 4, 2024

View reviewed changes

agarciamontoro requested a review from agnivade December 4, 2024 15:15

agnivade approved these changes Dec 4, 2024

View reviewed changes

cpoile approved these changes Dec 4, 2024

View reviewed changes

agarciamontoro added 4: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Dec 5, 2024

agarciamontoro merged commit 6a7c726 into master Dec 5, 2024
1 check passed

agarciamontoro deleted the MM-62075.ignore.aws.s3.rm.command branch December 5, 2024 08:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MM-62075: Ignore error from aws s3 rm command #864

MM-62075: Ignore error from aws s3 rm command #864

agarciamontoro commented Dec 4, 2024 •

edited

Loading

toninis commented Dec 4, 2024

agarciamontoro commented Dec 4, 2024

agnivade Dec 4, 2024

agarciamontoro Dec 4, 2024

agnivade Dec 6, 2024

agarciamontoro commented Dec 4, 2024

toninis commented Dec 4, 2024

cpoile left a comment

MM-62075: Ignore error from aws s3 rm command #864

MM-62075: Ignore error from aws s3 rm command #864

Conversation

agarciamontoro commented Dec 4, 2024 • edited Loading

Summary

Ticket Link

toninis commented Dec 4, 2024

agarciamontoro commented Dec 4, 2024

agnivade Dec 4, 2024

Choose a reason for hiding this comment

agarciamontoro Dec 4, 2024

Choose a reason for hiding this comment

agnivade Dec 6, 2024

Choose a reason for hiding this comment

agarciamontoro commented Dec 4, 2024

toninis commented Dec 4, 2024

cpoile left a comment

Choose a reason for hiding this comment

agarciamontoro commented Dec 4, 2024 •

edited

Loading