Restructured the fetch stage to be a Composite Beam transform and separated the request & download stage of fetching #148

mahrsee1997 · 2022-04-20T20:29:28Z

No description provided.

…arated the request & download stage of fetching

alxmrs

Overall, the pipeline improvements look good! Here's my early feedback for the current draft.

alxmrs · 2022-04-20T21:15:18Z

weather_dl/download_pipeline/clients.py

+    def fetch(self, dataset: str, selection: t.Dict) -> None:
+        pass
+
+    def download(self, dataset: str, result: t.Dict, output: str) -> None:
+        pass


These should raise a NotImplementedError.

alxmrs · 2022-04-20T21:16:06Z

weather_dl/download_pipeline/clients.py

@@ -152,6 +170,90 @@ def __exit__(self, exc_type, exc_value, traceback):
        self._redirector.__exit__(exc_type, exc_value, traceback)


+class APIRequestExtended(api.APIRequest):


How about we include "MARS" in the name of this class?

Naming discussion. I kind of like "extended", but is there a better name? What about "Adapter"? Any other ideas?

On second thought, adding "MARS" to the name of the class may not be necessary.

Named SplitMARSRequest.

alxmrs · 2022-04-20T21:16:25Z

weather_dl/download_pipeline/clients.py

+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)


You can omit the constructor if it just calls super.

alxmrs · 2022-04-20T21:21:43Z

weather_dl/download_pipeline/clients.py

+            tries = 0
+            while size != result["size"] and tries < 10:
+                size = self._transfer(
+                    urljoin(self.url, result["href"]), target, result["size"]
+                )
+                if size != result["size"] and tries < 10:
+                    tries += 1
+                    self.log("Transfer interrupted, resuming in 60s...")
+                    time.sleep(60)
+                else:
+                    break


Ideally, we could update or replace this block with code that is capable of downloading more than 1 MB of data at a time. It'd be nice if we could do something like this, for example:

weather-tools/weather_mv/loader_pipeline/sinks.py

Line 90 in 4aa9819

shutil.copyfileobj(source_file, dest_file, DEFAULT_READ_BUFFER_SIZE)

Yes, necessary code changes have been done.

alxmrs · 2022-04-20T21:22:33Z

weather_dl/download_pipeline/clients.py

+                open(target, "w").close()
+
+            size = -1
+            tries = 0


Following up with the comment below: We could use Beam's retry logic with exponential backoff (via the decorator) instead of re-using ECMWF's implementation. WDYT?

Yes, necessary code changes have been done.

alxmrs · 2022-04-20T21:34:44Z

weather_dl/download_pipeline/fetcher.py

@@ -78,22 +171,30 @@ def fetch_data(self, config: Config, *, worker_name: str = 'default') -> None:

        with self.manifest.transact(config.selection, target, config.user_id):
            with tempfile.NamedTemporaryFile() as temp:
-                logger.info(f'[{worker_name}] Fetching data for {target!r}.')
+                logger.info(f'[{worker_name}] Fetching and Downloading data for {target!r}.')


petty nit: can you use an & instead of "and" here :) ?

alxmrs · 2022-04-20T21:39:27Z

weather_dl/download_pipeline/fetcher.py

+            with tempfile.NamedTemporaryFile() as temp:
+                logger.info(f'[{worker_name}] Fetching data for {target!r}.')
+                result = self.fetch(client, config.dataset, config.selection)
+                yield (result, config, worker_name, temp.name, target)


Consider creating an internal data class for passing along this result.

Also: Are you sure that using the temp.name is safe? It's possible that the temporary file will disappear.

Furthermore, looking at the code: do you need to create a temporary file here? If you don't need it for the fetch, it's probably safer to move this to the next stage.

Maybe we don't need the dataclass since we can probably simplify what's returned. For example, We don't need to pass the target since that can be derived from the config. So, I see the tuple consisting of three parts: config, worker_name, result.

(you can choose your favorite order for these).

alxmrs · 2022-04-20T21:40:34Z

weather_dl/download_pipeline/fetcher.py

+        client = CLIENTS[self.client_name](config)
+        target = prepare_target_name(config)
+
+        with self.manifest.transact(config.selection, target, config.user_id):


A complication that I didn't anticipate until now: It would probably be best if we updated the manifest to distinguish between retrieved and downloaded. WDYT?

Made necessary changes. Added a new class variable in DownloadStatus named stage which represents the current stage of the request.

alxmrs · 2022-04-20T21:45:09Z

weather_dl/download_pipeline/fetcher.py

+    @retry_with_exponential_backoff
+    def upload(self, src: str, dest: str) -> None:
+        """Upload blob to cloud storage, with retries."""
+        with io.FileIO(src, 'rb') as src_:


can this be with open(src, 'rb')?

alxmrs · 2022-04-20T21:47:05Z

weather_dl/download_pipeline/pipeline.py

@@ -129,6 +130,8 @@ def run(argv: t.List[str], save_main_session: bool = True) -> PipelineArgs:
                        help='Number of concurrent requests to make per API key. '
                             'Default: make an educated guess per client & config. '
                             'Please see the client documentation for more details.')
+    parser.add_argument('-o', '--optimise-download', action='store_true', default=False,
+                        help="Optimised the downloads.")


Make sure you cross apply the description of what this does here.

…r cosmetic changes.

…llelism and updated the manifest.

…e docstring in RetrieveData

…allelism

@mahrsee1997

I'm taking a leaf from @mahrsee1997's PR #148 so that we can copy data from the MARS server faster (using a larger buffer size). Thanks for the primary contribution here, Rahul. * restructured the fetch stage to be a Composite Beam transform and separated the request & download stage of fetching * retry logic of downloads for MARS client & other cosmetic changes. * Remove fetch / dl split * retrieve in two steps. * rm fetch + dl methods. * Fix: `nim_requests_per_key` does not require class construction. * fix lint: removed unused import. * add support for aria2 for faster download * code changes as per Alex feedback. Co-authored-by: mahrsee1997 <rahul@infocusp.in>

restructured the fetch stage to be a Composite Beam transform and sep…

dc280a8

…arated the request & download stage of fetching

mahrsee1997 requested a review from alxmrs April 20, 2022 20:29

alxmrs reviewed Apr 20, 2022

View reviewed changes

mahrsee1997 added 2 commits April 22, 2022 16:31

refactored transfer & retry logic of downloads for MARS client & othe…

398f125

…r cosmetic changes.

added beam.Reshuffle after fetch & download stages for adjusting para…

2c15074

…llelism and updated the manifest.

mahrsee1997 requested a review from alxmrs April 22, 2022 18:39

mahrsee1997 added 5 commits April 25, 2022 19:38

removal of client_name from class attributes of Upload and updated th…

400f56d

…e docstring in RetrieveData

combine the fetch+download step to check the performance

fc7d7b3

added a log at the start of the download operation.

e0df8f7

added beam.Reshuffle after 'EachPartition' transform to adjusting par…

dcf6a3a

…allelism

updated the test cases

9ab0f61

alxmrs mentioned this pull request Sep 19, 2022

Faster data transfers from MARS. #235

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructured the fetch stage to be a Composite Beam transform and separated the request & download stage of fetching #148

Restructured the fetch stage to be a Composite Beam transform and separated the request & download stage of fetching #148

mahrsee1997 commented Apr 20, 2022

alxmrs left a comment

alxmrs Apr 20, 2022

alxmrs Apr 20, 2022

alxmrs Apr 20, 2022

alxmrs Apr 20, 2022

mahrsee1997 Apr 22, 2022

alxmrs Apr 20, 2022

alxmrs Apr 20, 2022

mahrsee1997 Apr 22, 2022

alxmrs Apr 20, 2022

mahrsee1997 Apr 22, 2022

alxmrs Apr 20, 2022

alxmrs Apr 20, 2022

alxmrs Apr 20, 2022

alxmrs Apr 20, 2022

alxmrs Apr 20, 2022

mahrsee1997 Apr 22, 2022 •

edited

Loading

alxmrs Apr 20, 2022

alxmrs Apr 20, 2022

		@@ -152,6 +170,90 @@ def __exit__(self, exc_type, exc_value, traceback):
		self._redirector.__exit__(exc_type, exc_value, traceback)


		class APIRequestExtended(api.APIRequest):

		def __init__(self, args, *kwargs):
		super().__init__(args, *kwargs)

Restructured the fetch stage to be a Composite Beam transform and separated the request & download stage of fetching #148

Are you sure you want to change the base?

Restructured the fetch stage to be a Composite Beam transform and separated the request & download stage of fetching #148

Conversation

mahrsee1997 commented Apr 20, 2022

alxmrs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahrsee1997 Apr 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahrsee1997 Apr 22, 2022 •

edited

Loading