Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MM-62075: Ignore error from aws s3 rm command #864

Merged
merged 1 commit into from
Dec 5, 2024

Conversation

agarciamontoro
Copy link
Member

@agarciamontoro agarciamontoro commented Dec 4, 2024

Summary

When destroying a cluster that deploys an S3 bucket, we try to speed up its deletion by calling aws s3 rm s3://my-bucket concurrently while we run terraform destroy. This was added to fix https://mattermost.atlassian.net/browse/MM-47263, since we were seeing too many timeouts due to the S3 bucket not being properly destroyed in time. Emptying the bucket while other resources are being destroyed has worked to fix the timeout issue, but it introduces a new error: if the bucket is destroyed by Terraform before the aws s3 rm command finishes, this command, and thus the whole {comparison,deployment} destroy command, fails.

This is a spurious failure: all resources, including the S3 bucket, have been destroyed (that's exactly why the aws s3 rm command fails, actually), so we can simply ignore it.

In the case this command would error in any other way that we are not accounting for as of today, the {comparison,deployment} destroy command would still fail, since it would notice that the S3 bucket has not been destroyed.

Ticket Link

https://mattermost.atlassian.net/browse/MM-62075

This command is making the whole {comparison, deployment} destroy to
fail when there is an S3 bucket deployed. That error is valid in the
local context of the aws s3 rm command, but not in the global sense:
since we run the aws s3 rm command concurrently with terraform destroy,
aws s3 rm fails when the bucket is destroyed before the command has
finished.

This is a spurious failure: all resources, including the S3 bucket, have
been destroyed (that's exactly why the aws s3 rm command fails,
actually), so we can simply skip it.

In the case this command would error in any other way that we are
not accounting for as of today, the {comparison, deployment} destroy
command would still fail, since it would notice that the S3 bucket has
not been destroyed.
@agarciamontoro agarciamontoro added the 2: Dev Review Requires review by a core committer label Dec 4, 2024
@toninis
Copy link

toninis commented Dec 4, 2024

@agarciamontoro
Copy link
Member Author

Thanks for taking a look, @toninis! Let me answer your comment here:

I think you should use force_destroy option . Check here

The problem here was that the whole thing was timing out because it was incredibly slow. I don't like the solution of using a concurrent aws s3 rm, but we found it sped things up quite a lot. That said, I believe we've never tried the force_destroy option, so maybe that would solve it as well. I'm a bit concerned about doing that now, because I'd need to run more tests and delay this further.

That's why I'm proposing getting this PR merged. If this one gets accepted, I'll open a ticket to test the force_destroy option and properly test it with time.

// since it introduces spurious failures when the bucket is
// destroyed before the command finishes.
// See https://mattermost.atlassian.net/browse/MM-62075
_ = t.runAWSCommand(emptyBucketCtx, emptyS3BucketArgs, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible at all to know what error is returned when the bucket is destroyed before the command finishes, and ignore that exact error rather than all errors? If that is too complicated, this is fine as well. Just curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could manually grep the returned error, looking for something like failed to empty. I don't know if that same string would show up in other errors, though. That would need some more testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually ran into this today. It seems like the error is NoSuchBucket:

info  [2024-12-06 14:26:13.837 +05:30] delete failed: s3://agnivalocallt.s3bucket/users/baw7cx384i8ppdka1h1zj3ntdr/profile.png An error occurred (NoSuchBucket) when calling the DeleteObject operation: The specified bucket does not exist caller="terraform/engine.go:71"
info  [2024-12-06 14:26:13.851 +05:30] fatal error: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist caller="terraform/engine.go:71"

@agarciamontoro
Copy link
Member Author

Oh, it turns out we already haveforce_destroy!

It was added way before the original timeout issue was filed and the aws s3 rm fix was added. cc/@toninis

@toninis
Copy link

toninis commented Dec 4, 2024

Oh, it turns out we already haveforce_destroy!

It was added way before the original timeout issue was filed and the aws s3 rm fix was added. cc/@toninis

That answers my question . Because the rm command should not fail with file does not exist unless another delete operation is in place. Perfect 🆗

Copy link
Member

@cpoile cpoile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed summary!

@agarciamontoro agarciamontoro added 4: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Dec 5, 2024
@agarciamontoro agarciamontoro merged commit 6a7c726 into master Dec 5, 2024
1 check passed
@agarciamontoro agarciamontoro deleted the MM-62075.ignore.aws.s3.rm.command branch December 5, 2024 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants