Skip to content
agorajek edited this page Mar 8, 2011 · 58 revisions

Contents

  1. Introduction
  2. Naming Conventions
  3. Functions and Languages
  4. Function Name Overloadin
  5. Guide to Driver UDFs

1. Introduction

The purpose of this document is to define the SQL Interface for MADlib algorithms.

2. Naming Conventions

Names should use lower case characters separated with underscores.

This is applicable to all database objects (tables, views, functions, function parameters, datatypes, operators, etc).

Schema

All database objects should be created in the default MADlib schema. Use MADLIB_SCHEMA as the schema prefix for your tables/views/functions etc in any scripts. This literal will be replaced during the installation with the target schema name (configured by the user in Config.yml). Code examples below use prefix madlib for illustration purposes only.

Functions & Aggregates

All non-user facing routines should be named with a "__" (double underscore) prefix to make the catalog easier to read.

Module specific routines ideally should have a common prefix based on the module they belong to, for example:

  • Multi-Linear Regression function could start with mregr_:
madlib.mregr_coef(...)
  • Naive-Bayes classification function could start with nb_classify_:
madlib.nb_classify_create_view(...)

General purpose routines should be named without a reference to any module and should be written inside one of the MADlib support modules (TO BE DEFINED), for example:

  • Function returning the key of the row for which value is maximal:
madlib.__argmax (integer key, float8 value)

3. Functions and Languages

First of all, to simplify this guide, we'd like to divide the user defined functions into three categories:

  • UDAs - user defined aggregates, which perform a single scan of the data source and return an aggregated value for a group of rows. All UDA component functions should be written in PL/C (C/C++) for performance and portability reasons.

  • Row Level UDFs - functions that operate on the arguments only and do not generate any SQL statements. These functions generate a result for each argument set, or for each tuple they are executed on. Recommended language is the same as for UDAs.

  • Driver UDFs - functions that usually drive an execution of an data processing algorithm, and may perform multiple SQL operations including data modification. In order to make this part of the code portable we suggest using PL/Python wrapper functions based on plan Python modules. The DB access inside the Python modules should be implemented using plpy.py module that will serve as the abstraction layer.

NOTE: This topic is cover in much more details in Design Patterns & Best Practices.

4. Function Name Overloading

Below suggestions on name overloading apply to all above mentioned types of user defined functions.

Data Types

Some platforms (like PostgreSQL) allow to use ANYELEMENT/ANYARRAY datatype which can be used by MADlib routines (whenever makes sense) in order to minimize the code duplication.

If ANYELEMENT/ANYARRAY functionality is not available or not feasible function name overloading can be used for different argument data types. For example, function F1 from module M1 can have the following versions:

  • TEXT data type example:
madlib.m1_f1( arg1 TEXT)
  • NUMERIC data type example:
madlib.m1_f1( arg1 BIGINT/FLOAT/etc.)

Argument Sets

Overloading mechanism should also be used for different sets of parameters. For example, if (reqarg1, ..., reqargN) is a set of required parameters for function F1 from module M1, then the following definitions would be correct:

  • A version for required arguments only:
madlib.m1_f1( reqarg1, ..., reqargN)
  • A version for both required and optional arguments:
madlib.m1_f1( reqarg1, ..., reqargN, optarg1, ..., optargN)

5. Guide to Driver UDFs

  • Should follow the naming conventions described in section 2.
  • Should follow the function overloading rules as described in section 4. On Greenplum and PostgreSQL this can be achieved via pl/Python wrapper UDFs based on the same main Python code.

5.1. Input Definition


Parameters of the execution should be supplied directly in the function call (as opposed to passing a reference ID to a parameter set stored in a table), for example:

SELECT madlib.m1_f1( par1 TEXT/INT/etc, par2 TEXT[]/INT[]/etc, ...)

Data should be passed to the function by in a form of a text argument schema.table representing an existing table or a view, which:

  • Can be located in any schema as long as database user executing the function has read permissions.
  • Should be defined in the method documentation, for example:
TABLE|VIEW (
  col_x INT, 
  col_y FLOAT, 
  col_z TEXT
)
  • The input relation and it's attributes needed by the function should be validated using helper primitives. See section 5.4. for more information.

5.2. Output Definition


Summary

Each Driver UDF should return some summary output in the form of a predefined record/row. Each attribute of the result should be clearly defined in the method documentation. For example, output of k-means clustering algorithm could look like this:

 clusters | pct_input_used | output_schema | cluster_table | point_table 
----------+----------------+---------------+---------------+-------------
       10 |            100 | my_schema     | my_centroids  | my_points

The above output can be achieved in the following way:

  1. Create data type for the return set madlib.results
CREATE TYPE madlib.kmeans_result AS (
  clusters          INTEGER,
  pct_input_used    FLOAT,
  output_schema     TEXT,
  cluster_table     TEXT,
  point_table       TEXT
);
  1. Assuming the implementation is done in the recommended language (see section 3 for recommended implementation) use yield and return to generate a single row of output inside a Python routine:
CREATE OR REPLACE FUNCTION madlib.kmeans_dummy()
  RETURNS SETOF madlib.kmeans_result
AS $$
  yield ( [ 10, 100.0, 'my_schema', 'my_centroids', 'my_points' ]);
  return;
$$ LANGUAGE plpythonu;

Returning Small Data Set

When dealing with small output it is recommended to store it in predefined table(s) located in madlib schema. These tables should be created during the installation of the module and should have a PK column set defined to make the retrieval of the results possible and easy. The PK value corresponding to the current run should be returned in the output summary.

Returning Large Data Set

When dealing with large output it is better to direct it into a separate user defined table. These tables should be created during the execution of the function. Their names should be provided by the user as function arguments and don't need to be returned in the output. In a rare occurrence of returning variable number of tables from a method, they can be auto generated based on a prefix given by the user.

5.3. Logging Style


  • ERROR

If the function encounters a problem it should raise an error using plpy.error( message) function. This will ensure the proper end of the execution and error propagation to the calling environment.

  • INFO

During SQL command line execution of long running functions it may be useful to generate some log output to indicate the current state of the process. This log information could be potentially lost if the SQL calls are generated from automated scripts or GUI tools, so it should not carry any critical information which is not included in execution summary. Use plpy.info( message) function to properly generate information log. Example log output:

SQL> select madlib.kmeans_run( 'my_schema.data_set_1', 10, 1, 'run1', 'my_schema');
INFO: Parameters:
INFO:  * k = 10 (number of centroids)
INFO:  * input_table = my_schema.data_set_1
INFO:  * goodness = 1 (GOF test on)
INFO:  * run_id = run1
INFO:  * output_schema = my_schema
INFO: Seeding 10 centroids...
INFO: Using sample data set for analysis... (9200 out of 10000 points)
INFO: ...Iteration 1
INFO: ...Iteration 2
INFO: Exit reason: fraction of reassigned nodes is smaller than the limit: 0.001
INFO: Expanding cluster assignment to all points...
INFO: Calculating goodness of fit...
...

5.4. Parameter Validation


Parameter validation should be performed in each MADlib function with the help of support module (TO BE DEFINED). Below is a sample list of common validation tests:

For simple arguments (scalar, array):

  • data type check
def Boolean check_type( arg_value, expected_type) 
  • value range check
def Boolean check_value_range( arg_value, expected_type, min, max) 

For table/view reference arguments:

  • existence check (including schema)
def Boolean check_rel_existence( relation_name) 
  • row count sanity check (in rare cases)
def Boolean check_rel_rowcount( relation_name, min, max) 
  • expected structure check
def Boolean check_rel_column( relation_name, column_name, data_type)

5.5. Multi-User and Multi-Session Execution


In order to avoid unpleasant situations of overwriting or losing results MADlib functions should be ready for execution in multi-session or multi-user environment. Hence the following requirements should be met:

  • Input relations (tables or views) should be used for read only purposes.

  • Any common results table should be locked in ACCESS SHARE MODE.

LOCK TABLE madlib.sv_model IN ACCESS SHARE MODE;

Any execution specific tables should be locked in EXCLUSIVE MODE after creation.

LOCK TABLE my_schema.my_output_1 IN EXCLUSIVE MODE;

There is no need to release LOCKS as they will persist anyway until the end of the main UDF.

  • Common output tables should have a PK column set to ensure data safety. The unique ID can be supplied by the user as an argument or, if not provided, should be auto-generated from a sequence. Separate sequences should be created for each table during module installation, for example to create a sequence for madlib.sv_model table you can run:
CREATE SEQUENCE madlib_sv_model START 1;

Then the unique ID can be generated in the following way:

SELECT nextval('madlib_sv_model');
 nextval 
---------
       1
SELECT nextval('madlib_sv_model');
 nextval 
---------
       2