All commands from this README should be run from the project's root directory.
Start the dev server for local development:
sportal up
Run a command inside the docker container:
sportal run-api [command]
Or run the tests:
sportal test-api
Note that the tests are run with the Django unittest runner, so specific modules, classes, or methods may be specified in the standard unittest manner: https://docs.python.org/3/library/unittest.html#unittest-test-discovery. For example:
sportal test-api scpca_portal.test.serializers.test_project.TestProjectSerializer
will run all the tests in the TestProjectSerializer class.
See
sportal -h
For more commands.
The dev server runs by default on port 8000 with the docs being served at 8001. If these ports are already in use on your local machine, you can run them at different ports with:
HTTP_PORT=8002 DOCS_PORT=8003 sportal up
A postgres command line client can be started by running:
sportal postgres-cli
You can use this to make a curl request to the API like so:
curl http://0.0.0.0:8000/v1/projects/
Computed files won't provide a download_url
unless an API token is provided.
To get and activate an API token, make a request similar to:
curl http://0.0.0.0:8000/v1/tokens/ -X POST -d '{"is_activated": true}' -H "Content-Type: application/json"
Which should return something like
{
"id": "30e429fd-ded5-4c7d-84a7-84c702f596c1",
"is_activated": true,
"terms_and_conditions": "PLACEHOLDER"
}
This id
can then be provided as the value for the API-KEY
header in a request to the /v1/computed-files/
endpoint like so:
curl http://0.0.0.0:8000/v1/computed-files/1/ -H 'API-KEY: 658f859a-b9d0-4b44-be3d-dad9db57164a'
download_url
can only be retrieved for ComputedFiles one at a time.
Before data can be processed, the OriginalFile
table must be populated and synced via the sync-original-files
command. This command builds a local representation of all objects available in the default (or passed) s3 input bucket, and is considered the single source of truth for input files throughout the codebase.
Syncing is carried out as follows:
sportal sync-original-files
By default the sync_original_files
command uses the default bucket defined in the config file associated with the environment calling the command. This can be overriden by passing the --bucket bucket-name
flag to sync the files of an alternative bucket.
In the rare case where all files have been deleted from the requested bucket, the --allow-bucket-wipe
flag must be explictly passed in order for all bucket files in the OriginalFile table to be wiped.
There are two independent workflows carried out within the data processing pipeline:
- Loading metadata and populating the database
- Generating computed files and populating s3
To exclusively run the load metadata workflow, call:
sportal load-metadata
To exclusively run the generate computed files workflow, call:
sportal generate-computed-files
To run them both successively, one after the next, call:
sportal load-data
Calling just sportal load-data
will populate your local database by pulling metadata from the scpca-portal-inputs
bucket, and generate computed files locally. To save time, by default it will not package up the actual data in that bucket and upload it to scpca-local-data
.
If you would like to update the data in the scpca-local-data
bucket, you can do so with the following command:
sportal load-data --update-s3
By default the command also will only look for new projects. If you would like to reimport existing projects you can run
sportal load-data --reload-existing
or to reimport and upload all projects:
sportal load-data --reload-existing --update-s3
If you would like to update a specific project use --scpca-project-id flag:
sportal load-data --scpca-project-id SCPCP000001
If you would like to purge a project and remove its files from the S3 bucket, you can use:
sportal manage-api purge_project --scpca-project-id SCPCP000001 --delete-from-s3
The --clean-up-input-data
flag can help you control the projects input data size. If flag is set the input data cleanup process will be run for each project right after its processing is over.
sportal load-data --clean-up-input-data --reload-all --update-s3
The --clean-up-output-data
flag can help you control the projects output data size. If flag is set the output (no longer needed) data cleanup process will be run for each project right after its processing is over.
sportal load-data --clean-up-output-data --reload-all --update-s3
The --max-workers
flag can be used for setting a number of simultaneously processed projects/samples to speed up the data loading process. The provided number will be used to spawn threads within two separate thread pool executors -- for project and sample processing.
sportal load-data --max-workers 10 --reload-existing --update-s3
Of all of the above mentioned flags, a subset of them can be called in the load-metadata
command, while another subset can be called with the generate-computed-files
command. Below is a list of which commands are compatible with which command.
load_metadata flags
- input-bucket-name
- clean-up-input-data
- reload-existing
- scpca-project-id
- update-s3
generate_computed_files flags
- clean-up-input-data
- clean-up-output-data
- max-workers
- scpca-project-id
- update-s3
There are two options available for processing data in the Cloud:
- Running
load_data
on the API instance (or a combination ofload_metadata
andgenerate_computed_files
) - Running
dispatch_to_batch
on the API instance, which kicks off processing on AWS Batch resources
Due to the fact that processing on Batch is ~10x faster than processing on the API, we recommend using Batch for processing.
To run a command in production, there is a run_command.sh
script that is created on the API instance. It passes any arguments through to the manage.py
script, making the following acceptable ./run_command.sh load_data --reload-all
.
As mentioned in the above Local Data Management - Syncing the OriginalFile Table section, the OriginalFile
table must be populated before data can be processed via the sync_original_files
command.
Syncing is carried out as follows:
./run_command.sh sync_original_files
Details of the sync_original_files
can be found in the Syncing the OriginalFile table header in the Local Data Management section above.
The following code can be used to process projects on the API, one by one, with a minimum disk space footprint:
for i in $(seq -f "%02g" 1 25); do
./run_command.sh load_data --clean-up-input-data --clean-up-output-data --reload-existing --scpca-project-id SCPCP0000$i
done
Alternatively, for a more granular approach, first run load_metadata
, and thereafter generate_computed_files
, as follows:
./run_command.sh load_metadata --clean-up-input-data --reload-existing
for i in $(seq -f "%02g" 1 25);
./run_command.sh generate_computed_files --clean-up-input-data --clean-up-output-data --scpca-project-id SCPCP0000$i
done
Note: Running load_data
in production defaults to uploading completed computed files to S3. This is to help prevent the S3 bucket data from accidentally becoming out of sync with the database.
The following code is used for processing projects via AWS Batch:
./run_command.sh dispatch_to_batch
By default the dispatch_to_batch
command will look at all projects and filter out all projects that already have at least 1 computed file. AWS Batch jobs will be dispatched and create valid computed files for matching projects that were not filtered out.
You can override this filter by passing the --regenerate-all
flag. This will dispatch jobs independent of existing computed files. Any existing computed files will be purged before new ones are generated to replace them.
You can limit the scope of this command to only apply to a specific project by passing the --project-id <SCPCP999999>
flag. This can be used in conjunction with --regenerate-all
if you want to ignore existing computed files for that project.
To purge a project from the database (and from S3 if so desired), run the following command:
./run_command.sh purge_project --scpca-id SCPCP000001 --delete-from-s3
To deploy the API to AWS follow the directions for doing so in the infrastructure README.
Once you have completed a deploy you can replace with 0.0.0.0:8000
in the requests above with the elastic_ip_address
output by terraform.