fix: improve rescan image task to prevent too many requests #501

youngjun0627 · 2021-12-04T09:23:23Z

Problem

If use public docker registry(not stable image in cr.backend.ai), we request info to docker.hub. In execution, docker.hub disconnected our connection for rescan-image task, because too many requests is occured. The following output shows the problem.

Traceback (most recent call last):
  File "/Users/uchanlee/backend.ai/backend.ai-dev/manager/src/ai/backend/manager/cli/etcd.py", line 293, in _impl
    await shared_config.rescan_images(registry)
  File "/Users/uchanlee/backend.ai/backend.ai-dev/manager/src/ai/backend/manager/config.py", line 681, in rescan_images
    tg.create_task(scanner.rescan_single_registry(reporter))
  File "/Users/uchanlee/.pyenv/versions/3.9.5/envs/venv-bbmn3abm-manager/lib/python3.9/site-packages/aiotools/taskgroup.py", line 199, in __aexit__
    raise me from None
ai.backend.common.logging.PickledException: TaskGroupError('unhandled errors in a TaskGroup; 1 sub errors: (TaskGroupError)\n + TaskGroupError: unhandled errors in a TaskGroup; 1 sub errors: (TaskGroupError)\n + TaskGroupError: unhandled errors in a TaskGroup; 2 sub errors: (ClientResponseError)\n + ClientResponseError: 429, message=\'Too Many Requests\', url=URL(\'https://registry-1.docker.io/v2/lablup/kernel-nodejs/manifests/10-alpine\')\n |   File "/Users/uchanlee/backend.ai/backend.ai-dev/manager/src/ai/backend/manager/container_registry/base.py", line 160, in _scan_tag\n |     resp.raise_for_status()\n |   File "/Users/uchanlee/.pyenv/versions/3.9.5/envs/venv-bbmn3abm-manager/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1004, in raise_for_status\n |     raise ClientResponseError(\n\n\n + ClientResponseError: 429, message=\'Too Many Requests\', url=URL(\'https://registry-1.docker.io/v2/lablup/kernel-nodejs/manifests/6-alpine\')\n |   File "/Users/uchanlee/backend.ai/backend.ai-dev/manager/src/ai/backend/manager/container_registry/base.py", line 160, in _scan_tag\n |     resp.raise_for_status()\n |   File "/Users/uchanlee/.pyenv/versions/3.9.5/envs/venv-bbmn3abm-manager/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1004, in raise_for_status\n |     raise ClientResponseError(\n\n\n |   File "/Users/uchanlee/backend.ai/backend.ai-dev/manager/src/ai/backend/manager/container_registry/base.py", line 139, in _scan_image\n |     tg.create_task(self._scan_tag(sess, rqst_args, image, tag))\n |   File "/Users/uchanlee/.pyenv/versions/3.9.5/envs/venv-bbmn3abm-manager/lib/python3.9/site-packages/aiotools/taskgroup.py", line 199, in __aexit__\n |     raise me from None\n\n\n |   File "/Users/uchanlee/backend.ai/backend.ai-dev/manager/src/ai/backend/manager/container_registry/base.py", line 95, in rescan_single_registry\n |     tg.create_task(self._scan_image(sess, image))\n |   File "/Users/uchanlee/.pyenv/versions/3.9.5/envs/venv-bbmn3abm-manager/lib/python3.9/site-packages/aiotools/taskgroup.py", line 199, in __aexit__\n |     raise me from None\n\n')
(base) uchanlee@iyuchan-ui-MacBookPro manager % backend.ai mgr etcd put config/docker/registry/cr.lablup.ai "https://registry-1.docker.io"

solution
So I insert asyncio.sleep() in the fetch loop. I solved the problem, But I think that any idea is better than me exists. Welcome any other idea!

achimnol

Sleeping 3 seconds is too arbitrary.
How about using asyncio.Semaphore?

achimnol · 2022-03-02T15:16:56Z

Let's rewrite using https://aiolimiter.readthedocs.io/en/latest/ to limit both the number of concurrent requests in a moment and the number requests within a unit period.

codecov · 2022-03-03T03:57:17Z

Codecov Report

Patch coverage has no change and project coverage change: -0.01 ⚠️

Comparison is base (aa580cc) 48.87% compared to head (b2faf6c) 48.86%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #501      +/-   ##
==========================================
- Coverage   48.87%   48.86%   -0.01%     
==========================================
  Files          54       54              
  Lines        9025     9024       -1     
==========================================
- Hits         4411     4410       -1     
  Misses       4614     4614

see 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

* Also apply different rate limit configs for Docker Hub

achimnol · 2022-03-03T04:28:00Z

We need to make the rate limiter instances persistent to cope with multiple rescan requests.

achimnol · 2022-03-04T03:45:44Z

This requires the following design to cope with HA setup (multi-node multi-process architecture) of managers:

The image rescan operation should be run inside a global lock per registry URL.
- If it is difficult to define individual global locks for different regisry URLs, we could trade-off cross-registry concurrency and use a single global lock.
The rate limiter state must be saved and loaded from a file so that restarting manager could preserve the rate limiting counters. We could use pickle and /tmp for simple implementation.

* TODO: we need a per-registry global lock support

fix: improve rescan image task to prevent too many requests

ad6803c

achimnol suggested changes Jan 21, 2022

View reviewed changes

achimnol added the refactor label Feb 18, 2022

Merge branch 'main' into fix/improve_rescan_image

b65e413

feat: Reimplement using aiolimiter

d805ed2

* Also apply different rate limit configs for Docker Hub

fix: Add pickle-based rate limiter state mgmt

b2faf6c

* TODO: we need a per-registry global lock support

achimnol mentioned this pull request Apr 19, 2022

Leader election-based distributed timer and image rescan rate limiting lablup/backend.ai#415

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve rescan image task to prevent too many requests #501

fix: improve rescan image task to prevent too many requests #501

youngjun0627 commented Dec 4, 2021

achimnol left a comment

achimnol commented Mar 2, 2022

codecov bot commented Mar 3, 2022 •

edited

Loading

achimnol commented Mar 3, 2022 •

edited

Loading

achimnol commented Mar 4, 2022

fix: improve rescan image task to prevent too many requests #501

Are you sure you want to change the base?

fix: improve rescan image task to prevent too many requests #501

Conversation

youngjun0627 commented Dec 4, 2021

achimnol left a comment

Choose a reason for hiding this comment

achimnol commented Mar 2, 2022

codecov bot commented Mar 3, 2022 • edited Loading

Codecov Report

achimnol commented Mar 3, 2022 • edited Loading

achimnol commented Mar 4, 2022

codecov bot commented Mar 3, 2022 •

edited

Loading

achimnol commented Mar 3, 2022 •

edited

Loading