-
Notifications
You must be signed in to change notification settings - Fork 0
SQL API Guide
- Introduction
- Naming Conventions
- Functions and Languages
- Function Name Overloadin
- Guide to Driver UDFs
- Support Modules
The purpose of this document is to define the SQL Interface for MADlib algorithms.
Names should use lower case characters separated with underscores.
This is applicable to all database objects (tables, views, functions, function parameters, datatypes, operators, etc).
All database objects should be created in the default MADlib
schema. Use MADLIB_SCHEMA
as the schema prefix for your tables/views/functions etc in any scripts.
This literal will be replaced during the installation with the target schema name
(configured by the user in Config.yml
). Code examples below use prefix madlib
for illustration purposes only.
All non-user facing routines should be named with a "__" (double underscore) prefix to make the catalog easier to read.
Module specific routines should have a SHORT and COMMON prefix based on the module they belong to, for example:
- Multi-Linear Regression functions could start with
mregr_
:
madlib.mregr_coef(...)
- Naive-Bayes classification functions could start with
nb_
:
madlib.nb_create_view(...)
See the current function catalog for more examples (\dfa madlib.*
).
General purpose routines should be named without a reference to any module and should be written inside one of the MADlib support modules (TO BE DEFINED), for example:
- Function returning the key of the row for which value is maximal:
madlib.__argmax (integer key, float8 value)
First of all, to simplify this guide, we'd like to divide the user defined functions into three categories:
-
UDAs - user defined aggregates, which perform a single scan of the data source and return an aggregated value for a group of rows. All UDA component functions should be written in PL/C (C/C++) for performance and portability reasons.
-
Row Level UDFs - functions that operate on the arguments only and do not generate any SQL statements. These functions generate a result for each argument set, or for each tuple they are executed on. Recommended language is the same as for UDAs.
-
Driver UDFs - functions that usually drive an execution of an data processing algorithm, and may perform multiple SQL operations including data modification. In order to make this part of the code portable we suggest using PL/Python wrapper functions based on plan Python modules. The DB access inside the Python modules should be implemented using using "classic" PyGreSQL interface (http://www.pygresql.org/pg.html).
NOTE: This topic is cover in much more details in Design Patterns & Best Practices.
Below suggestions on name overloading apply to all above mentioned types of user defined functions.
Some platforms (like PostgreSQL) allow to use ANYELEMENT/ANYARRAY datatype which can be used by MADlib routines (whenever makes sense) in order to minimize the code duplication.
If ANYELEMENT/ANYARRAY functionality is not available or not feasible function name overloading can be used for different argument data types. For example, function F1 from module M1 can have the following versions:
- TEXT data type example:
madlib.m1_f1( arg1 TEXT)
- NUMERIC data type example:
madlib.m1_f1( arg1 BIGINT/FLOAT/etc.)
Overloading mechanism should also be used for different sets of parameters.
For example, if (reqarg1, ..., reqargN)
is a set of required parameters for
function F1 from module M1, then the following definitions would be correct:
- A version for required arguments only:
madlib.m1_f1( reqarg1, ..., reqargN)
- A version for both required and optional arguments:
madlib.m1_f1( reqarg1, ..., reqargN, optarg1, ..., optargN)
- Should follow the naming conventions described in section 2.
- Should follow the function overloading rules as described in section 4. On Greenplum and PostgreSQL this can be achieved via pl/Python wrapper UDFs based on the same main Python code.
Parameters of the execution should be supplied directly in the function call (as opposed to passing a reference ID to a parameter set stored in a table), for example:
SELECT madlib.m1_f1( par1 TEXT/INT/etc, par2 TEXT[]/INT[]/etc, ...)
Data should be passed to the function by in a form of a text argument schema.table
representing an existing table or a view, which:
- Can be located in any schema as long as database user executing the function has read permissions.
- Should be defined in the method documentation, for example:
TABLE|VIEW (
col_x INT,
col_y FLOAT,
col_z TEXT
)
- The input relation and it's attributes needed by the function should be validated using helper primitives. See section 5.4. for more information.
Summary
Each Driver UDF should return a summary output in the form of a predefined record/row. Each attribute of the result should be clearly defined in the method documentation. If any tables/views are created or populated during the execution their full name should be returned in the summary. For example, output of k-means clustering algorithm could look like this:
clusters | pct_input_used | output_schema | cluster_table | point_table
----------+----------------+---------------+---------------+-------------
10 | 100 | my_schema | my_centroids | my_points
The above output can be achieved in the following way:
- Create data type for the return set
madlib.results
CREATE TYPE madlib.kmeans_result AS (
clusters INTEGER,
pct_input_used FLOAT,
output_schema TEXT,
cluster_table TEXT,
point_table TEXT
);
- Assuming the implementation is done in the recommended language (see section 3 for recommended implementation) use
yield
andreturn
to generate a single row of output inside a Python routine:
CREATE OR REPLACE FUNCTION madlib.kmeans_dummy()
RETURNS SETOF madlib.kmeans_result
AS $$
yield ( [ 10, 100.0, 'my_schema', 'my_centroids', 'my_points' ]);
return;
$$ LANGUAGE plpythonu;
Returning Model Coeficients
...
Returning Data Sets
...
- ERROR
If the function encounters a problem it should raise an error using plpy.error( message)
function (see section 6.1). This will ensure
the proper end of the execution and error propagation to the calling environment.
- INFO
If desired, long-running methods can send runtime status to the log. But be aware that this information may not be propagated to clients in many cases, and it will enlarge the stored log file. Informational loggin should be turned off by default, and activated only with explicit user command (verbose parameter). Use plpy.info( message)
function (see section 6.1) to properly generate information log. Example log output:
SQL> select madlib.kmeans_run( 'my_schema.data_set_1', 10, 1, 'run1', 'my_schema');
INFO: Parameters:
INFO: * k = 10 (number of centroids)
INFO: * input_table = my_schema.data_set_1
INFO: * goodness = 1 (GOF test on)
INFO: * run_id = run1
INFO: * output_schema = my_schema
INFO: Seeding 10 centroids...
INFO: Using sample data set for analysis... (9200 out of 10000 points)
INFO: ...Iteration 1
INFO: ...Iteration 2
INFO: Exit reason: fraction of reassigned nodes is smaller than the limit: 0.001
INFO: Expanding cluster assignment to all points...
INFO: Calculating goodness of fit...
...
Parameter validation should be performed in each function to avoid any preventable errors.
For simple arguments (scalar, array) sanity checks should be done by the author. Some common parameters with known value domains should be validated using SQL domains, for example:
- Percent
CREATE DOMAIN percent AS INTEGER
CHECK(
VALUE >= 0 AND VALUE <= 100
);
For table/view and column arguments please see section 6.2 (describing usage of the helper.py module).
In order to avoid unpleasant situations of overwriting or losing results MADlib functions should be ready for execution in multi-session or multi-user environment. Hence the following requirements should be met:
-
Input relations (tables or views) should be used for read only purposes.
-
Any common results table should be locked in ACCESS SHARE MODE.
LOCK TABLE madlib.sv_model IN ACCESS SHARE MODE;
Any execution specific tables should be locked in EXCLUSIVE MODE after creation.
LOCK TABLE my_schema.my_output_1 IN EXCLUSIVE MODE;
There is no need to release LOCKS as they will persist anyway until the end of the main UDF.
- Common output tables should have a PK column set to ensure data safety. The unique ID can be supplied by the user as an
argument or, if not provided, should be auto-generated from a sequence. Separate sequences should be created for each table during module installation, for example to create a sequence for
madlib.sv_model
table you can run:
CREATE SEQUENCE madlib_sv_model START 1;
Then the unique ID can be generated in the following way:
SELECT nextval('madlib_sv_model');
nextval
---------
1
SELECT nextval('madlib_sv_model');
nextval
---------
2
A set of Python functions to make coding easier.
This module serves as the database access abstraction layer. Even though currently not used this module will provide easy portability between various MADlib platforms and interfaces. To clarify: PostrgreSQL PL/Python language currently uses internal plpy.py module to implement seamless DB access (using "classic" PyGreSQL interface - see http://www.pygresql.org/pg.html). By adding MADlib version of plpy.py we'll be able to easier port code written for MADlib.
Currently implemented basic functionality:
def connect ( dbname, host, port, user, passwd)
def close()
def execute( sql)
def info( msg)
def error( msg)
This module consists of a set of functions to support common data validation and database object management tasks.
Current functions:
- table/view existence check (including schema)
def Boolean __check_rel_exist( relation_name)
- expected structure check
def Boolean __check_rel_column( relation_name, column_name, data_type)