docs: clean up (#7)

ersilia-os · Dec 11, 2023 · 1064b0c · 1064b0c
1 parent 1b24337
commit 1064b0c
Show file tree

Hide file tree

Showing 7 changed files with 88 additions and 138 deletions.
diff --git a/.github/workflows/predict.yml b/.github/workflows/predict.yml
@@ -44,7 +44,7 @@ jobs:
 
       - name: Get input library
         run: |
-          aws s3 cp s3://isaura-bucket/reference_library.csv reference_library.csv
+          aws s3 cp s3://precalculations-bucket/reference_library.csv reference_library.csv
 
       - name: Split library by partition variables
         run: |
@@ -110,4 +110,4 @@ jobs:
           echo "${{ inputs.model-id }} succesfully fetched and served"
           ersilia api -i $(printf "partition_%04d.csv" $index) -o ../${{ inputs.SHA }}$(printf "_%04d.csv" $numerator)
 
-          aws s3 cp ../${{ inputs.SHA }}$(printf "_%04d.csv" $numerator) s3://isaura-bucket/out/${{ inputs.model-id }}/${{ inputs.SHA }}$(printf "_%04d.csv" $numerator)
+          aws s3 cp ../${{ inputs.SHA }}$(printf "_%04d.csv" $numerator) s3://precalculations-bucket/out/${{ inputs.model-id }}/${{ inputs.SHA }}$(printf "_%04d.csv" $numerator)
diff --git a/README.md b/README.md
@@ -1,20 +1,101 @@
 # Ersilia Model Precalculation Pipeline
-### A collaboration between [GDI](https://github.com/good-data-institute) and Ersilia
+
+[![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-370/) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?logo=Python&logoColor=white)](https://github.com/psf/black)
+
+
 
 This repository contains code and github workflows for precalculating and storing Ersilia model predictions in AWS.
 
 See [CONTRIBUTING.md](CONTRIBUTING.md) to get started working on this repo.
 
+## Using the Batch Inference Pipeline
+
+### Triggering a pipeline run
+
+The workflow "Run Inference in Parallel" can be triggered from the GitHub UI. `Actions` > `Run Inference in Parallel` > `Run workflow`. Then, simply enter the ID of the Ersilia Model Hub model that you want to run.
+
+### Querying the precalculation database
+
+Predictions end up being written to DynamoDB, where they can be retrieved via the precalculations API endpoint. Find the endpoint URL on AWS in the API Gateway console.
+
+To query the endpoint, we need:
+
+1. an API key
+2. Ersilia model ID for desired model
+3. InChiKey(s) of desired inputs
+
+
+The request body has the following schema:
+```
+{
+  "$schema": "http://json-schema.org/draft/2020-12/schema#",
+  "type": "object",
+  "properties": {
+    "modelId": {
+      "type": "string"
+    },
+    "inputKeyArray": {
+      "type": "array",
+      "items": {
+        "type": "string"
+      }
+    }
+  },
+  "required": ["modelId", "inputKeyArray"]
+}
+```
+example:
+```
+{
+    "modelId": "eos92sw",
+    "inputKeyArray":[
+            "PCQFQFRJSWBMEL-UHFFFAOYSA-N",
+            "MRSBJIAZTHGJAP-UHFFFAOYSA-N"
+        ]
+}
+```
+
+## Architecture and Cloud Infrastructure
+
+![architecture diagram](docs/architecture-diagram.png)
+
+Key components:
+- inference and serving compute; GitHub Actions workers
+- prediction bulk storage; S3 Bucket
+- prediction database; DynamoDB
+- serverless API; Lambda + API Gateway
+
+All AWS components are managed via IaC with [AWS CDK](https://aws.amazon.com/cdk/). See [infra/precalculator](infra/precalculator/README.md) for details on how to validate and deploy infrastructure for this project.
+
 ## Github Actions Workflows
 
 ### Prediction
 
 During this workflow, we call the Ersilia model hub for a given model ID and generate predictions on the reference library. The predictions are saved as CSV files in S3.
 
+This works by pulling the [Ersilia Model Hub](https://github.com/ersilia-os/ersilia) onto a GitHub Ubuntu worker and running inference for a slice of the reference library. Predictions are saved to S3 via the AWS CLI.
+
 ### Serving
 
 This workflow reads the generated predictions from S3, validates and formats the data, then finally writes it in batches to DynamoDB.
 
+This uses the python package `precalculator` developed in this repo. The package includes:
+
+- validation of input data with `pydantic` and `pandera`
+- testing with `pytest`
+- batch writing to DynamoDB with `boto3`
+
 ### Full Precalculation Pipeline
 
 The full pipeline calls the predict and serve actions in sequence. Both jobs are parallelised across up to 50 workers as they are both compute-intensive processes.
+
+`predict-parallel.yml` implements this FULL pipeline ("Run Inference in Parallel") in a manner which avoids the 6-hour time out limit individual workflows.
+
+
+---
+
+##### A collaboration between [GDI](https://github.com/good-data-institute) and [Ersilia](https://github.com/ersilia-os)
+
+<div id="top"></div>
+<img src="https://avatars.githubusercontent.com/u/75648991?s=200&v=4" height="50" style="margin-right: 20px">
+<img src="https://raw.githubusercontent.com/ersilia-os/ersilia/master/assets/Ersilia_Plum.png" height="50">
diff --git a/docs/architecture-diagram.png b/docs/architecture-diagram.png
diff --git a/infra/precalculator/README.md b/infra/precalculator/README.md
@@ -1,7 +1,7 @@
 
-# Welcome to your CDK Python project!
+# Ersilia Precalculatro - CDK Python project
 
-This is a blank project for CDK development with Python.
+This is a project for CDK development with Python.
 
 The `cdk.json` file tells the CDK Toolkit how to execute your app.
 

diff --git a/workflows/run-inference-instance.yml b/workflows/run-inference-instance.yml
diff --git a/workflows/run-inference-pipeline.yml b/workflows/run-inference-pipeline.yml
diff --git a/.github/workflows/run-precalc-pipeline.yml → workflows/run-precalc-pipeline.yml b/.github/workflows/run-precalc-pipeline.yml → workflows/run-precalc-pipeline.yml
@@ -1,3 +1,5 @@
+# ARCHIVED - replaced by predict-parallel.py
+
 name: Run Full Pre-calculation Pipeline
 on:
   workflow_dispatch: