Merging develop into master (#3)

* Updated README.md * Base extractor - Initial cut (#1) * Initial commit of base for extractors * Fixed dockerfile * Added README * Changed extractor to transformer in readme * Updated README * Fleshing out functionality * Updates for file list support * Variable name clarification * Bug fixes * Optimizing Dockerfile * Fixed typo * Merging into Develop (#2) * Initial commit of base for extractors * Fixed dockerfile * Added README * Changed extractor to transformer in readme * Updated README * Fleshing out functionality * Updates for file list support * Variable name clarification * Bug fixes * Optimizing Dockerfile * Fixed typo * Debugging * Removed TERRA REF left-over commented out code * Adding command line parameter storage * Added text abour return codes
AgPipeline · Oct 30, 2019 · 52dfa72 · 52dfa72
1 parent 05ad4eb
commit 52dfa72
Show file tree

Hide file tree

Showing 7 changed files with 523 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -1,2 +1,24 @@
-# docker-support
-Provides basic docker support
+# Docker support
+This repo is used for developing base Docker images that can be used for further development of code.
+
+We are providing these to promote the development of code/container templates to reduce the cost of adding functionality to the processing pipeline.
+
+It is expected that all derived docker images will have their own repositories instead of residing here.
+See [Contributing](#contributing) below for more information on how to name your derived repos.
+
+## Contributing <a name="contributing" />
+We welcome the addition of other base docker images to this repo.
+
+**But first**, if you are finding that the code provided in the `base-image` folder is not meeting your needs, please file an [feature request](https://github.com/AgPipeline/computing-pipeline/issues/new/choose) first so that we can try and address your needs.
+
+Please be sure to clearly label your folders for the environment you are targeting; starting folder names with 'aws', 'clowder', or 'cyverse' for example.
+If you are thinking of creating an environment specific folder, please consider putting it into its own repository first, using the just mentioned naming convention, to keep this one as clean as possible.
+
+Folder beginning with 'base' are reserved to those images that are not particular to any single environment.
+
+Be sure to read the [organization documentation](https://github.com/AgPipeline/Organization-info) on how to contribute.
+
+## Documenting
+Every folder in this repo must have a README.md clearly explaining the interface for derived images, how to create a derived image, and other information on how to use the images created.
+Providing a quick start guide with links to more detailed information a good approach for some situations.
+The goal is to provide documentation for users of these base images that makes it easy for them to be used.
diff --git a/base-image/Dockerfile b/base-image/Dockerfile
@@ -0,0 +1,29 @@
+FROM ubuntu:18.04
+LABEL maintainer="Chris Schnaufer <schnaufer@email.arizona.edu>"
+
+# Install any programs needed
+RUN useradd -u 49044 extractor \
+    && mkdir /home/extractor
+
+RUN chown -R extractor /home/extractor \
+    && chgrp -R extractor /home/extractor 
+
+RUN apt-get update && \
+    apt-get upgrade -y && \
+    apt-get install -y --no-install-recommends \
+        python3 \
+        python3-pip && \
+    apt-get autoremove -y && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+RUN python3 -m pip install --upgrade --no-cache-dir pip 
+
+RUN python3 -m pip install --upgrade --no-cache-dir setuptools
+
+COPY *.py /home/extractor/
+RUN chmod +x /home/extractor/entrypoint.py
+
+USER extractor
+ENTRYPOINT ["/home/extractor/entrypoint.py"]
+
diff --git a/base-image/README.md b/base-image/README.md
@@ -0,0 +1,115 @@
+# Base Image
+This code is intended to be used as the basis for derived transformers and docker images
+
+- The file named entrypoint.py is expected to be kept for all transformers.
+
+- For each environment (such as Clowder, TERRA REF, CyVerse) the transformer_class.py file is replaced.
+
+- For each transformer the transformer.py file is replaced.
+
+- Additionally, the entrypoint.py script can be called from a different script allowing pre- and post- processing (see [entrypoint.py](#entrypoint) below).
+
+It is expected that this arrangement will provide reusable code not only within a single environment, but across transformers in different environments as well.
+
+## Quick Start
+Create a new repository to hold the code specific to your environment or transformer.
+
+For a new environment:
+1. create a new transformer_class.py file specific for your environment
+2. fill and create in any methods and data as necessary to support transformers
+3. if using Docker images, create a new dockerfile that uses the base_image Docker image as its starting point, add needed executables and libraries, and overwrite the existing transformer_classs.py file in your new image
+
+For a new transformer:
+1. create a new transformer.py file specific for your transformer with the needed function signatures
+2. add the code to do your work
+3. if using Docker images, create a new dockerfile that uses the appropriate starting docker image, add needed executables and libraries, and overwrite the existing transformer.py file in your new image
+
+## Meet the Files
+- Dockerfile: contains the build instructions for a docker image
+- configuration.py: contains configuration information for transformers. Can be overridden by derived code as long as existing variables aren't lost
+- entrypoint.py: entrypoint for the transformers and docker images. More on this file below
+- transformer.py: stub of expected transformer interface. More on this file below as well
+- transformer_class.py: stub of class used to provide environment for code in transfomer.py
+
+### configuration.py
+Unless documented here, the contents of this file are required by `entrypoint.py`.
+If you are replacing this file with your own version, be sure to keep existing code (and its associated comments).
+
+### entrypoint.py <a name="entrypoint" />
+This file can be executed as an independent script, or called by other Python code.
+If calling into this script, the entry point is a function named `do_work`.
+The `do_work` function expects to get an instance or `argparse.ArgumentParser` passed in as its first parameter.
+Additional named parameters can also be passed in as kwargs; these are then passed to the new instance of transformer_class.Transformer at initialization.
+
+Calling `do_work` returns a dict of the result.
+Briefly, the 'code' key of the return value indicates the result of the call, and the presence of an 'error' key indicates an error ocurred.
+
+To provide environmental context to a transformer, the transformer_class.py file can be replaced with something more meaningful.
+The transformer_class.py file in this repo defines a class that has methods that will be called by entrypoint.py if they're defined.
+The class methods are not required but can provide convenient hooks for customization.
+An instance of this class is passed to the transformer code in [transformer.py](#transformer)
+
+### transformer.py <a name="transformer" />
+This is the file that performs all the work.
+It is expected that this file will be replaced with a meaningful one for particular transformers.
+The transformer.py file in this repo contains the functions that can be called by the main transformer script [entrypoint.py](#entrypoint).
+The only required function in this file is the `perform_process` function.
+
+### transformer_class.py <a name="transformer_class" />
+This is the file that provides the environment for transformers.
+It is expected that for different environments, this file will be replaced with a meaningful one.
+For example, in the CyVerse environment this file could be replaced with one containing iRODS support for any files generated by the transformer.
+
+It is the responsibility of this class to appropriately handle any command line arguments for the transformer instance.
+The easiest way to achieve this is to store the parameters as part of the class instance.
+
+## Transformer Control Flow
+In this section we cover the flow of control for a transformer.
+We assume that this transformer is started by running the [entrypoint.py](#entrypoint) script.
+
+1. Initialization of Parameters: 
+The first thing that happens is the initialization of an instance of `argparse.ArgumentParser` and the creation of a `transformer_class.Transformer` instance.
+The entrypoint.py script adds its parameters, followed by the transformer_class.Transformer instance, and finally the transformer can add theirs.
+The parse_args() method is called on the ArgumentParser instance and the resulting argument values are stored in memory.
+
+2. Loading of Metadata: 
+One of the parameters required by entrypoint is the path to a JSON file containing metadata.
+After the parameters are parsed, the entire contents of the JSON file are loaded and stored in memory.
+
+3. Getting Parameters for transformer function calls: 
+If the transformer_class.Transformer instance has a method named `get_transformer_params()` it is called with the command line arguments and the loaded metadata.
+The dictionary returned by get_transformer_params() is used to pass parameters to the functions defined in [transformer.py](#transformer).
+This allows the customization of parameters between an environment and a transformer.
+If get_transformer_params() is not defined by transformer_class.Transformer, no additional parameters are passed to the transformer functions.
+
+4. Check to Continue:
+If the transformer.py file has a function named `check_continue` it will be called getting passed the transformer_class.Transformer instance and any parameters defined in the above step.
+The return from the check_continue() function is used to determine if processing should continue.
+If the function is not defined, processing will continue automatically.
+
+5. Processsing:
+The `perform_process` function in transformer.py is called getting passed the transformer_class.Transformer instance and any parameters previously defined.
+
+6. Result Handling:
+The result of the above steps may produce warnings, errors, or successful results.
+These results can be stored in a file, printed to standard output, and/or returned to the caller of `do_work`.
+In the default case that we're exploring here, the return value from do_work is ignored.
+
+## Defined Command Line Parameters
+The following command line parameters are defined for all transformers.
+
+* --debug, -d: (optional parameter) enable debug level logging messages
+* -h: (optional parameter) display help message (automatically defined by argparse)
+* --info, -i: (optional parameter) enable info level logging messages
+* --result: (optional parameter) how to handle the result of processing; one or more comma-separated strings of: all, file, print
+* --metadata: mandatory path to file containing JSON metadata
+* --working_space: path to folder to use as a working space and file store
+* the "file_list" argument contains all additional parameters (which are assumed to be file names but may not be)
+
+*Pro Tip* - Use the `-h` parameter against the script or docker container to see all the command line options for a transformer.
+
+## Conventions
+**error return code ranges**: 
+- [entrypoint.py](#entrypoint) returns error values in the range of `-1` to `-99`
+- [transformer_class.py](#transformer_class) returns error values in the range of `-100` to `-999`
+- [transformer.py](#transformer) returns error values in the range of `-1000` and greater
diff --git a/base-image/configuration.py b/base-image/configuration.py
@@ -0,0 +1,8 @@
+"""Contains transformer configuration information
+"""
+
+# The version number of the transformer
+TRANSFORMER_VERSION = "1.0"
+
+# The transformer description
+TRANSFORMER_DESCRIPTION = ""