Skip to content
This repository was archived by the owner on Aug 2, 2023. It is now read-only.

fix: improve rescan image task to prevent too many requests #501

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

youngjun0627
Copy link

  • Problem

If use public docker registry(not stable image in cr.backend.ai), we request info to docker.hub. In execution, docker.hub disconnected our connection for rescan-image task, because too many requests is occured. The following output shows the problem.

Traceback (most recent call last):
  File "/Users/uchanlee/backend.ai/backend.ai-dev/manager/src/ai/backend/manager/cli/etcd.py", line 293, in _impl
    await shared_config.rescan_images(registry)
  File "/Users/uchanlee/backend.ai/backend.ai-dev/manager/src/ai/backend/manager/config.py", line 681, in rescan_images
    tg.create_task(scanner.rescan_single_registry(reporter))
  File "/Users/uchanlee/.pyenv/versions/3.9.5/envs/venv-bbmn3abm-manager/lib/python3.9/site-packages/aiotools/taskgroup.py", line 199, in __aexit__
    raise me from None
ai.backend.common.logging.PickledException: TaskGroupError('unhandled errors in a TaskGroup; 1 sub errors: (TaskGroupError)\n + TaskGroupError: unhandled errors in a TaskGroup; 1 sub errors: (TaskGroupError)\n + TaskGroupError: unhandled errors in a TaskGroup; 2 sub errors: (ClientResponseError)\n + ClientResponseError: 429, message=\'Too Many Requests\', url=URL(\'https://registry-1.docker.io/v2/lablup/kernel-nodejs/manifests/10-alpine\')\n |   File "/Users/uchanlee/backend.ai/backend.ai-dev/manager/src/ai/backend/manager/container_registry/base.py", line 160, in _scan_tag\n |     resp.raise_for_status()\n |   File "/Users/uchanlee/.pyenv/versions/3.9.5/envs/venv-bbmn3abm-manager/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1004, in raise_for_status\n |     raise ClientResponseError(\n\n\n + ClientResponseError: 429, message=\'Too Many Requests\', url=URL(\'https://registry-1.docker.io/v2/lablup/kernel-nodejs/manifests/6-alpine\')\n |   File "/Users/uchanlee/backend.ai/backend.ai-dev/manager/src/ai/backend/manager/container_registry/base.py", line 160, in _scan_tag\n |     resp.raise_for_status()\n |   File "/Users/uchanlee/.pyenv/versions/3.9.5/envs/venv-bbmn3abm-manager/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1004, in raise_for_status\n |     raise ClientResponseError(\n\n\n |   File "/Users/uchanlee/backend.ai/backend.ai-dev/manager/src/ai/backend/manager/container_registry/base.py", line 139, in _scan_image\n |     tg.create_task(self._scan_tag(sess, rqst_args, image, tag))\n |   File "/Users/uchanlee/.pyenv/versions/3.9.5/envs/venv-bbmn3abm-manager/lib/python3.9/site-packages/aiotools/taskgroup.py", line 199, in __aexit__\n |     raise me from None\n\n\n |   File "/Users/uchanlee/backend.ai/backend.ai-dev/manager/src/ai/backend/manager/container_registry/base.py", line 95, in rescan_single_registry\n |     tg.create_task(self._scan_image(sess, image))\n |   File "/Users/uchanlee/.pyenv/versions/3.9.5/envs/venv-bbmn3abm-manager/lib/python3.9/site-packages/aiotools/taskgroup.py", line 199, in __aexit__\n |     raise me from None\n\n')
(base) uchanlee@iyuchan-ui-MacBookPro manager % backend.ai mgr etcd put config/docker/registry/cr.lablup.ai "https://registry-1.docker.io"

image

  • solution
    So I insert asyncio.sleep() in the fetch loop. I solved the problem, But I think that any idea is better than me exists. Welcome any other idea!

Copy link
Member

@achimnol achimnol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleeping 3 seconds is too arbitrary.
How about using asyncio.Semaphore?

@achimnol
Copy link
Member

achimnol commented Mar 2, 2022

Let's rewrite using https://aiolimiter.readthedocs.io/en/latest/ to limit both the number of concurrent requests in a moment and the number requests within a unit period.

@codecov
Copy link

codecov bot commented Mar 3, 2022

Codecov Report

Patch coverage has no change and project coverage change: -0.01 ⚠️

Comparison is base (aa580cc) 48.87% compared to head (b2faf6c) 48.86%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #501      +/-   ##
==========================================
- Coverage   48.87%   48.86%   -0.01%     
==========================================
  Files          54       54              
  Lines        9025     9024       -1     
==========================================
- Hits         4411     4410       -1     
  Misses       4614     4614              

see 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

* Also apply different rate limit configs for Docker Hub
@achimnol
Copy link
Member

achimnol commented Mar 3, 2022

We need to make the rate limiter instances persistent to cope with multiple rescan requests.

@achimnol
Copy link
Member

achimnol commented Mar 4, 2022

This requires the following design to cope with HA setup (multi-node multi-process architecture) of managers:

  • The image rescan operation should be run inside a global lock per registry URL.
    • If it is difficult to define individual global locks for different regisry URLs, we could trade-off cross-registry concurrency and use a single global lock.
  • The rate limiter state must be saved and loaded from a file so that restarting manager could preserve the rate limiting counters. We could use pickle and /tmp for simple implementation.

* TODO: we need a per-registry global lock support
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants