-
Notifications
You must be signed in to change notification settings - Fork 0
SQL API Guide
- Introduction
- Naming Conventions
- Functions and Languages
- Function Name Overloadin
- Guide to Driver UDFs
The purpose of this document is to define the SQL Interface for MADlib algorithms.
Names should use lower case characters separated with underscores.
This is applicable to all database objects (tables, views, functions, function parameters, datatypes, operators, etc).
All database objects should be created in the default MADlib
schema. Use MADLIB_SCHEMA
as the schema prefix for your tables/views/functions etc in any scripts.
This literal will be replaced during the installation with the target schema name
(configured by the user in Config.yml
). Code examples below use prefix madlib
for illustration purposes only.
All non-user facing routines should be named with a "__" (double underscore) prefix to make the catalog easier to read.
Module specific routines should have a SHORT and COMMON prefix based on the module they belong to, for example:
- Multi-Linear Regression functions could start with
mregr_
:
madlib.mregr_coef(...)
- Naive-Bayes classification functions could start with
nb_
:
madlib.nb_create_view(...)
See the current function catalog for more examples (\dfa madlib.*
).
General purpose routines should be named without a reference to any module and should be written inside one of the MADlib support modules (TO BE DEFINED), for example:
- Function returning the key of the row for which value is maximal:
madlib.__argmax (integer key, float8 value)
First of all, to simplify this guide, we'd like to divide the user defined functions into three categories:
-
UDAs - user defined aggregates, which perform a single scan of the data source and return an aggregated value for a group of rows. All UDA component functions should be written in PL/C (C/C++) for performance and portability reasons.
-
Row Level UDFs - functions that operate on the arguments only and do not generate any SQL statements. These functions generate a result for each argument set, or for each tuple they are executed on. Recommended language is the same as for UDAs.
-
Driver UDFs - functions that usually drive an execution of an data processing algorithm, and may perform multiple SQL operations including data modification. In order to make this part of the code portable we suggest using PL/Python wrapper functions based on plan Python modules. The DB access inside the Python modules should be implemented using
plpy.py
module that will serve as the abstraction layer.
NOTE: This topic is cover in much more details in Design Patterns & Best Practices.
Below suggestions on name overloading apply to all above mentioned types of user defined functions.
Some platforms (like PostgreSQL) allow to use ANYELEMENT/ANYARRAY datatype which can be used by MADlib routines (whenever makes sense) in order to minimize the code duplication.
If ANYELEMENT/ANYARRAY functionality is not available or not feasible function name overloading can be used for different argument data types. For example, function F1 from module M1 can have the following versions:
- TEXT data type example:
madlib.m1_f1( arg1 TEXT)
- NUMERIC data type example:
madlib.m1_f1( arg1 BIGINT/FLOAT/etc.)
Overloading mechanism should also be used for different sets of parameters.
For example, if (reqarg1, ..., reqargN)
is a set of required parameters for
function F1 from module M1, then the following definitions would be correct:
- A version for required arguments only:
madlib.m1_f1( reqarg1, ..., reqargN)
- A version for both required and optional arguments:
madlib.m1_f1( reqarg1, ..., reqargN, optarg1, ..., optargN)
- Should follow the naming conventions described in section 2.
- Should follow the function overloading rules as described in section 4. On Greenplum and PostgreSQL this can be achieved via pl/Python wrapper UDFs based on the same main Python code.
Parameters of the execution should be supplied directly in the function call (as opposed to passing a reference ID to a parameter set stored in a table), for example:
SELECT madlib.m1_f1( par1 TEXT/INT/etc, par2 TEXT[]/INT[]/etc, ...)
Data should be passed to the function by in a form of a text argument schema.table
representing an existing table or a view, which:
- Can be located in any schema as long as database user executing the function has read permissions.
- Should be defined in the method documentation, for example:
TABLE|VIEW (
col_x INT,
col_y FLOAT,
col_z TEXT
)
- The input relation and it's attributes needed by the function should be validated using helper primitives. See section 5.4. for more information.
Summary
Each Driver UDF should return some summary output in the form of a predefined record/row. Each attribute of the result should be clearly defined in the method documentation. For example, output of k-means clustering algorithm could look like this:
clusters | pct_input_used | output_schema | cluster_table | point_table
----------+----------------+---------------+---------------+-------------
10 | 100 | my_schema | my_centroids | my_points
The above output can be achieved in the following way:
- Create data type for the return set
madlib.results
CREATE TYPE madlib.kmeans_result AS (
clusters INTEGER,
pct_input_used FLOAT,
output_schema TEXT,
cluster_table TEXT,
point_table TEXT
);
- Assuming the implementation is done in the recommended language (see section 3 for recommended implementation) use
yield
andreturn
to generate a single row of output inside a Python routine:
CREATE OR REPLACE FUNCTION madlib.kmeans_dummy()
RETURNS SETOF madlib.kmeans_result
AS $$
yield ( [ 10, 100.0, 'my_schema', 'my_centroids', 'my_points' ]);
return;
$$ LANGUAGE plpythonu;
Returning Small Data Set
When dealing with small output it is recommended to store it in predefined table(s) located in madlib
schema.
These tables should be created during the installation of the module and should have a PK column set defined
to make the retrieval of the results possible and easy. The PK value corresponding to the current run should be
returned in the output summary.
Returning Large Data Set
When dealing with large output it is better to direct it into a separate user defined table. These tables should be created during the execution of the function. Their names should be provided by the user as function arguments and don't need to be returned in the output. In a rare occurrence of returning variable number of tables from a method, they can be auto generated based on a prefix given by the user.
- ERROR
If the function encounters a problem it should raise an error using plpy.error( message)
function. This will ensure
the proper end of the execution and error propagation to the calling environment.
- INFO
During SQL command line execution of long running functions it may be useful to generate some log output to indicate
the current state of the process. This log information could be potentially lost if the SQL calls are generated from
automated scripts or GUI tools, so it should not carry any critical information which is not included in execution summary.
Use plpy.info( message)
function to properly generate information log. Example log output:
SQL> select madlib.kmeans_run( 'my_schema.data_set_1', 10, 1, 'run1', 'my_schema');
INFO: Parameters:
INFO: * k = 10 (number of centroids)
INFO: * input_table = my_schema.data_set_1
INFO: * goodness = 1 (GOF test on)
INFO: * run_id = run1
INFO: * output_schema = my_schema
INFO: Seeding 10 centroids...
INFO: Using sample data set for analysis... (9200 out of 10000 points)
INFO: ...Iteration 1
INFO: ...Iteration 2
INFO: Exit reason: fraction of reassigned nodes is smaller than the limit: 0.001
INFO: Expanding cluster assignment to all points...
INFO: Calculating goodness of fit...
...
Parameter validation should be performed in each MADlib function with the help of support module (TO BE DEFINED). Below is a sample list of common validation tests:
For simple arguments (scalar, array):
- data type check
def Boolean check_type( arg_value, expected_type)
- value range check
def Boolean check_value_range( arg_value, expected_type, min, max)
For table/view reference arguments:
- existence check (including schema)
def Boolean check_rel_existence( relation_name)
- row count sanity check (in rare cases)
def Boolean check_rel_rowcount( relation_name, min, max)
- expected structure check
def Boolean check_rel_column( relation_name, column_name, data_type)
In order to avoid unpleasant situations of overwriting or losing results MADlib functions should be ready for execution in multi-session or multi-user environment. Hence the following requirements should be met:
-
Input relations (tables or views) should be used for read only purposes.
-
Any common results table should be locked in ACCESS SHARE MODE.
LOCK TABLE madlib.sv_model IN ACCESS SHARE MODE;
Any execution specific tables should be locked in EXCLUSIVE MODE after creation.
LOCK TABLE my_schema.my_output_1 IN EXCLUSIVE MODE;
There is no need to release LOCKS as they will persist anyway until the end of the main UDF.
- Common output tables should have a PK column set to ensure data safety. The unique ID can be supplied by the user as an
argument or, if not provided, should be auto-generated from a sequence. Separate sequences should be created for each table during module installation, for example to create a sequence for
madlib.sv_model
table you can run:
CREATE SEQUENCE madlib_sv_model START 1;
Then the unique ID can be generated in the following way:
SELECT nextval('madlib_sv_model');
nextval
---------
1
SELECT nextval('madlib_sv_model');
nextval
---------
2