SQL API Guide

1. Introduction

The purpose of this document is to define the SQL Interface for MADlib algorithms.

2. Naming Conventions

Names should use lower case characters separated with underscores.

This is applicable to all database objects (tables, views, functions, function parameters, datatypes, operators, etc).

2.1. Schema

All database objects should be created in the default MADlib schema. Use MADLIB_SCHEMA as the schema prefix for your tables/views/functions etc in any scripts. This literal will be replaced during the installation with the target schema name (configured by the user in Config.yml). Code examples below use prefix madlib for illustration purposes only.

2.2. Functions & Aggregates

All non-user facing routines should be named with a "__" (double underscore) prefix to make the catalog easier to read.

Module specific routines should have a SHORT and COMMON prefix based on the module they belong to, for example:

Multi-Linear Regression functions could start with mregr_:

madlib.mregr_coef(...)

Naive-Bayes classification functions could start with nb_:

madlib.nb_create_view(...)

See the current function catalog for more examples (\dfa madlib.*).

General purpose routines should be named without a reference to any module and should be written inside one of the MADlib support modules (TO BE DEFINED), for example:

Function returning the key of the row for which value is maximal:

madlib.__argmax (integer key, float8 value)

3. Functions and Languages

First of all, to simplify this guide, we'd like to divide the user defined functions into three categories:

UDAs - user defined aggregates, which perform a single scan of the data source and return an aggregated value for a group of rows. All UDA component functions should be written in PL/C (C/C++) for performance and portability reasons.
Row Level UDFs - functions that operate on the arguments only and do not generate any SQL statements. These functions generate a result for each argument set, or for each tuple they are executed on. Recommended language is the same as for UDAs.
Driver UDFs - functions that usually drive an execution of an data processing algorithm, and may perform multiple SQL operations including data modification. In order to make this part of the code portable we suggest using PL/Python wrapper functions based on plan Python modules. The DB access inside the Python modules should be implemented using using "classic" PyGreSQL interface (http://www.pygresql.org/pg.html).

NOTE: This topic is cover in much more details in Design Patterns & Best Practices.

4. Function Name Overloading

Below suggestions on name overloading apply to all above mentioned types of user defined functions.

Data Types

Some platforms (like PostgreSQL) allow to use ANYELEMENT/ANYARRAY datatype which can be used by MADlib routines (whenever makes sense) in order to minimize the code duplication.

If ANYELEMENT/ANYARRAY functionality is not available or not feasible function name overloading can be used for different argument data types. For example, function F1 from module M1 can have the following versions:

TEXT data type example:

madlib.m1_f1( arg1 TEXT)

NUMERIC data type example:

madlib.m1_f1( arg1 BIGINT/FLOAT/etc.)

Argument Sets

Overloading mechanism should also be used for different sets of parameters. For example, if (reqarg1, ..., reqargN) is a set of required parameters for function F1 from module M1, then the following definitions would be correct:

A version for required arguments only:

madlib.m1_f1( reqarg1, ..., reqargN)

A version for both required and optional arguments:

madlib.m1_f1( reqarg1, ..., reqargN, optarg1, ..., optargN)

5. Guide to Driver UDFs

Should follow the naming conventions described in section 2.
Should follow the function overloading rules as described in section 4. On Greenplum and PostgreSQL this can be achieved via pl/Python wrapper UDFs based on the same main Python code.

5.1. Input Definition

Parameters of the execution should be supplied directly in the function call (as opposed to passing a reference ID to a parameter set stored in a table), for example:

SELECT madlib.m1_f1( par1 TEXT/INT/etc, par2 TEXT[]/INT[]/etc, ...)

Data should be passed to the function by in a form of a text argument schema.table representing an existing table or a view, which:

Can be located in any schema as long as database user executing the function has read permissions.
Should be defined in the method documentation, for example:

TABLE|VIEW (
  col_x INT, 
  col_y FLOAT, 
  col_z TEXT
)

The input relation and it's attributes needed by the function should be validated using helper primitives. See section 5.4. for more information.

5.2. Output Definition

Summary

Each Driver UDF should return a summary output in the form of a predefined record/row. Each attribute of the result should be clearly defined in the method documentation. If any tables/views are created or populated during the execution their full name should be returned in the summary. For example, output of k-means clustering algorithm could look like this:

 clusters | pct_input_used | output_schema | cluster_table | point_table 
----------+----------------+---------------+---------------+-------------
       10 |            100 | my_schema     | my_centroids  | my_points

The above output can be achieved in the following way:

Create data type for the return set madlib.results

CREATE TYPE madlib.kmeans_result AS (
  clusters          INTEGER,
  pct_input_used    FLOAT,
  output_schema     TEXT,
  cluster_table     TEXT,
  point_table       TEXT
);

Assuming the implementation is done in the recommended language (see section 3 for recommended implementation) use yield and return to generate a single row of output inside a Python routine:

CREATE OR REPLACE FUNCTION madlib.kmeans_dummy()
  RETURNS SETOF madlib.kmeans_result
AS $$
  yield ( [ 10, 100.0, 'my_schema', 'my_centroids', 'my_points' ]);
  return;
$$ LANGUAGE plpythonu;

Returning Model Coeficients

...

Returning Data Sets

...

5.3. Logging Style

ERROR

If the function encounters a problem it should raise an error using plpy.error( message) function (see section 6.1). This will ensure the proper end of the execution and error propagation to the calling environment.

INFO

During SQL command line execution of long running functions it may be useful to generate some log output to indicate the current state of the process. This log information could be potentially lost if the SQL calls are generated from automated scripts or GUI tools, so it should not carry any critical information which is not included in execution summary. Use plpy.info( message) function (see section 6.1) to properly generate information log. Example log output:

SQL> select madlib.kmeans_run( 'my_schema.data_set_1', 10, 1, 'run1', 'my_schema');
INFO: Parameters:
INFO:  * k = 10 (number of centroids)
INFO:  * input_table = my_schema.data_set_1
INFO:  * goodness = 1 (GOF test on)
INFO:  * run_id = run1
INFO:  * output_schema = my_schema
INFO: Seeding 10 centroids...
INFO: Using sample data set for analysis... (9200 out of 10000 points)
INFO: ...Iteration 1
INFO: ...Iteration 2
INFO: Exit reason: fraction of reassigned nodes is smaller than the limit: 0.001
INFO: Expanding cluster assignment to all points...
INFO: Calculating goodness of fit...
...

5.4. Parameter Validation

Parameter validation should be performed in each function to avoid any preventable errors.

For simple arguments (scalar, array) sanity checks should be done by the author. Some common parameters with known value domains should be validated using SQL domains, for example:

Percent

CREATE DOMAIN percent AS INTEGER
CHECK(
   VALUE >= 0 AND VALUE <= 100
);

For table/view and column arguments please see section 6.2 (describing usage of the helper.py module).

5.5. Multi-User and Multi-Session Execution

In order to avoid unpleasant situations of overwriting or losing results MADlib functions should be ready for execution in multi-session or multi-user environment. Hence the following requirements should be met:

Input relations (tables or views) should be used for read only purposes.
Any common results table should be locked in ACCESS SHARE MODE.

LOCK TABLE madlib.sv_model IN ACCESS SHARE MODE;

Any execution specific tables should be locked in EXCLUSIVE MODE after creation.

LOCK TABLE my_schema.my_output_1 IN EXCLUSIVE MODE;

There is no need to release LOCKS as they will persist anyway until the end of the main UDF.

Common output tables should have a PK column set to ensure data safety. The unique ID can be supplied by the user as an argument or, if not provided, should be auto-generated from a sequence. Separate sequences should be created for each table during module installation, for example to create a sequence for madlib.sv_model table you can run:

CREATE SEQUENCE madlib_sv_model START 1;

Then the unique ID can be generated in the following way:

SELECT nextval('madlib_sv_model');
 nextval 
---------
       1
SELECT nextval('madlib_sv_model');
 nextval 
---------
       2

6. Support Modules

A set of Python functions to make coding easier.

6.1. Module: plpy.py

This module serves as the database access abstraction layer. Even though currently not used this module will provide easy portability between various MADlib platforms and interfaces. To clarify: PostrgreSQL PL/Python language currently uses internal plpy.py module to implement seamless DB access (using "classic" PyGreSQL interface - see http://www.pygresql.org/pg.html). By adding MADlib version of plpy.py we'll be able to easier port code written for MADlib.

Currently implemented basic functionality:

def connect ( dbname, host, port, user, passwd)
def close()
def execute( sql)
def info( msg)
def error( msg)

6.2. Module: helper.py (TO DO)

This module consists of a set of functions to support common data validation and database object management tasks.

Current functions:

table/view existence check (including schema)

def Boolean check_rel_exist( relation_name)

expected structure check

def Boolean check_rel_column( relation_name, column_name, data_type)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly