Skip to content

Commit

Permalink
docs: clean up (#7)
Browse files Browse the repository at this point in the history
  • Loading branch information
kartikey-vyas authored Dec 11, 2023
1 parent 1b24337 commit 1064b0c
Show file tree
Hide file tree
Showing 7 changed files with 88 additions and 138 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/predict.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ jobs:

- name: Get input library
run: |
aws s3 cp s3://isaura-bucket/reference_library.csv reference_library.csv
aws s3 cp s3://precalculations-bucket/reference_library.csv reference_library.csv
- name: Split library by partition variables
run: |
Expand Down Expand Up @@ -110,4 +110,4 @@ jobs:
echo "${{ inputs.model-id }} succesfully fetched and served"
ersilia api -i $(printf "partition_%04d.csv" $index) -o ../${{ inputs.SHA }}$(printf "_%04d.csv" $numerator)
aws s3 cp ../${{ inputs.SHA }}$(printf "_%04d.csv" $numerator) s3://isaura-bucket/out/${{ inputs.model-id }}/${{ inputs.SHA }}$(printf "_%04d.csv" $numerator)
aws s3 cp ../${{ inputs.SHA }}$(printf "_%04d.csv" $numerator) s3://precalculations-bucket/out/${{ inputs.model-id }}/${{ inputs.SHA }}$(printf "_%04d.csv" $numerator)
83 changes: 82 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,101 @@
# Ersilia Model Precalculation Pipeline
### A collaboration between [GDI](https://github.com/good-data-institute) and Ersilia

[![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-370/) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?logo=Python&logoColor=white)](https://github.com/psf/black)



This repository contains code and github workflows for precalculating and storing Ersilia model predictions in AWS.

See [CONTRIBUTING.md](CONTRIBUTING.md) to get started working on this repo.

## Using the Batch Inference Pipeline

### Triggering a pipeline run

The workflow "Run Inference in Parallel" can be triggered from the GitHub UI. `Actions` > `Run Inference in Parallel` > `Run workflow`. Then, simply enter the ID of the Ersilia Model Hub model that you want to run.

### Querying the precalculation database

Predictions end up being written to DynamoDB, where they can be retrieved via the precalculations API endpoint. Find the endpoint URL on AWS in the API Gateway console.

To query the endpoint, we need:

1. an API key
2. Ersilia model ID for desired model
3. InChiKey(s) of desired inputs


The request body has the following schema:
```
{
"$schema": "http://json-schema.org/draft/2020-12/schema#",
"type": "object",
"properties": {
"modelId": {
"type": "string"
},
"inputKeyArray": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["modelId", "inputKeyArray"]
}
```
example:
```
{
"modelId": "eos92sw",
"inputKeyArray":[
"PCQFQFRJSWBMEL-UHFFFAOYSA-N",
"MRSBJIAZTHGJAP-UHFFFAOYSA-N"
]
}
```

## Architecture and Cloud Infrastructure

![architecture diagram](docs/architecture-diagram.png)

Key components:
- inference and serving compute; GitHub Actions workers
- prediction bulk storage; S3 Bucket
- prediction database; DynamoDB
- serverless API; Lambda + API Gateway

All AWS components are managed via IaC with [AWS CDK](https://aws.amazon.com/cdk/). See [infra/precalculator](infra/precalculator/README.md) for details on how to validate and deploy infrastructure for this project.

## Github Actions Workflows

### Prediction

During this workflow, we call the Ersilia model hub for a given model ID and generate predictions on the reference library. The predictions are saved as CSV files in S3.

This works by pulling the [Ersilia Model Hub](https://github.com/ersilia-os/ersilia) onto a GitHub Ubuntu worker and running inference for a slice of the reference library. Predictions are saved to S3 via the AWS CLI.

### Serving

This workflow reads the generated predictions from S3, validates and formats the data, then finally writes it in batches to DynamoDB.

This uses the python package `precalculator` developed in this repo. The package includes:

- validation of input data with `pydantic` and `pandera`
- testing with `pytest`
- batch writing to DynamoDB with `boto3`

### Full Precalculation Pipeline

The full pipeline calls the predict and serve actions in sequence. Both jobs are parallelised across up to 50 workers as they are both compute-intensive processes.

`predict-parallel.yml` implements this FULL pipeline ("Run Inference in Parallel") in a manner which avoids the 6-hour time out limit individual workflows.


---

##### A collaboration between [GDI](https://github.com/good-data-institute) and [Ersilia](https://github.com/ersilia-os)

<div id="top"></div>
<img src="https://avatars.githubusercontent.com/u/75648991?s=200&v=4" height="50" style="margin-right: 20px">
<img src="https://raw.githubusercontent.com/ersilia-os/ersilia/master/assets/Ersilia_Plum.png" height="50">
Binary file added docs/architecture-diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions infra/precalculator/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

# Welcome to your CDK Python project!
# Ersilia Precalculatro - CDK Python project

This is a blank project for CDK development with Python.
This is a project for CDK development with Python.

The `cdk.json` file tells the CDK Toolkit how to execute your app.

Expand Down
104 changes: 0 additions & 104 deletions workflows/run-inference-instance.yml

This file was deleted.

29 changes: 0 additions & 29 deletions workflows/run-inference-pipeline.yml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# ARCHIVED - replaced by predict-parallel.py

name: Run Full Pre-calculation Pipeline
on:
workflow_dispatch:
Expand Down

0 comments on commit 1064b0c

Please sign in to comment.