pydqc.infer_schema.infer_schema(_data, fname, output_root='', sample_size=1.0, type_threshold=0.5, n_jobs=1, base_schema=None, base_schema_feature_colname='column', base_schema_dtype_colname='type')
function: infer data types for all columns for the input table
Parameters:
- _data: pandas DataFrame
- data table to infer
- fname: string
- the output file name
- output_root: string, default=''
- the root directory for the output file
- sample_size: int or float(<= 1.0), default=1.0
- int: number of sample rows to infer the data type (useful for large tables)
- float: sample size in percentage
- type_threshold: float(<= 1.0), default=0.5
- threshold for inferring data type
- n_jobs: int, default=1
- the number of jobs to run in parallel
- base_schema: pandas DataFrame, default=None
- data schema to base on
- base_schema_feature_colname: string
- feature_colname in base schema
- base_schema_dtype_colname: string
- dtype_colname in base schema
Example:
import pandas as pd
from pydqc import infer_schema, data_summary, data_compare
data_2016 = pd.read_csv('data/properties_2016.csv')
infer_schema.infer_schema(_data=data_2016, fname='properties_2016', output_root='output/',
sample_size=1.0, type_threshold=0.5, n_jobs=1,
base_schema=None, base_schema_feature_colname='column', base_schema_dtype_colname='type')
# with base schema
data_2016_schema = pd.read_excel('output/data_schema_properties_2016_mdf.xlsx')
data_2017 = pd.read_csv('data/properties_2017.csv')
infer_schema.infer_schema(_data=data_2017, fname='properties_2017_sample', output_root='output/',
sample_size=0.1, type_threshold=0.5, n_jobs=1,
base_schema=data_2016_schema, base_schema_feature_colname='column',
base_schema_dtype_colname='type')
pydqc.data_summary.data_summary(table_schema, _table, fname, sample_size=1.0, sample_rows=100, feature_colname='column', dtype_colname='type', output_root='', n_jobs=1)
function: summary basic information of all columns in a data table based on the provided data schema
Parameters:
- table_schema: pandas DataFrame
- schema of the table, should contain data types of each column
- _table: pandas DataFrame
- the data table
- fname: string
- the output file name
- sample_size: integer or float(<=1.0), default=1.0
- int: number of sample rows to do the summary (useful for large tables)
- float: sample size in percentage
- sample_rows: integer, default=100
number of rows to get data samples - feature_colname: string
- name of the column for feature
- dtype_colname: string
- name of the column for data type
- output_root: string, default=''
- the root directory for the output file
- n_jobs: int, default=1
- the number of jobs to run in parallel
Example:
import pandas as pd
from pydqc import infer_schema, data_summary, data_compare
data_2016 = pd.read_csv('data/properties_2016.csv')
# we should use the modified data schema
data_2016_schema = pd.read_excel('output/data_schema_properties_2016_mdf.xlsx')
data_summary.data_summary(table_schema=data_2016_schema, _table=data_2016, fname='properties_2016',
sample_size=1.0, feature_colname='column', dtype_colname='type',
output_root='output/', n_jobs=1)
pydqc.data_summary.data_summary_notebook(table_schema, _table, fname, sample=False, feature_colname='column', dtype_colname='type', output_root='')
function: automatically generate ipynb for data summary
Parameters:
- table_schema: pandas DataFrame
- schema of the table, should contain data types of each column
- _table: pandas DataFrame
- the data table
- fname: string
- the output file name
- sample: boolean, default=False
- whether to do sampling on the original data
- feature_colname: string
- name of the column for feature
- dtype_colname: string
- name of the column for data type
- output_root: string, default=''
- the root directory for the output file
Example:
import pandas as pd
from pydqc import infer_schema, data_summary, data_compare
data_2016 = pd.read_csv('data/properties_2016.csv')
# we should use the modified data schema
data_2016_schema = pd.read_excel('output/data_schema_properties_2016_mdf.xlsx')
data_summary.data_summary_notebook(table_schema=data_2016_schema, _table=data_2016, fname='properties_2016',
sample=False, feature_colname='column', dtype_colname='type', output_root='output/')
function: draw pretty distribution graph for a column
Parameters:
- _value_df: pandas DataFrame
- slice of dataframe containing enough information to check
- col: string
- name of column to check
- figsize: tuple, default=None
- figure size
- date_flag: bool, default=False
- whether it is checking date features
Example:
import pandas as pd
from pydqc import infer_schema, data_summary, data_compare
table = pd.read_csv('../data/properties_2016.csv')
col="basementsqft"
value_df = table[[col]].copy()
distribution_summary_pretty(value_df, col, figsize=None, date_flag=False)
pydqc.data_compare.data_compare(_table1, _table2, _schema1, _schema2, fname, sample_size=1.0, feature_colname1='column', feature_colname2='column', dtype_colname1='type', dtype_colname2='type', output_root='', n_jobs=1)
function: compare values of same columns between two tables
Parameters:
- _table1: pandas DataFrame
- one of the two tables to compare
- _table2: pandas DataFrame
- one of the two tables to compare
- _schema1: pandas DataFrame
- data schema (contains column names and corresponding data types) for _table1
- _schema2: pandas DataFrame
- data schema (contains column names and corresponding data types) for _table2
- fname: string
- the output file name
- sample_size: integer or float(<=1.0), default=1.0
- int: number of sample rows to do the comparison (useful for large tables)
- float: sample size in percentage
- feature_colname1: string, default='column'
- name of the column for feature of _table1
- feature_colname2: string, default='column'
- name of the column for feature of _table2
- dtype_colname1: string, default='type'
- name of the column for data type of _table1
- dtype_colname2: string, default='type'
- name of the column for data type of _table2
- output_root: string, default=''
- the root directory for the output file
- n_jobs: int, default=1
- the number of jobs to run in parallel
Example:
import pandas as pd
from pydqc import infer_schema, data_summary, data_compare
data_2016 = pd.read_csv('data/properties_2016.csv')
data_2017 = pd.read_csv('data/properties_2017.csv')
# we should use the modified data schema
data_2016_schema = pd.read_excel('output/data_schema_properties_2016_mdf.xlsx')
data_2017_schema = pd.read_excel('output/data_schema_properties_2017_mdf.xlsx')
data_compare.data_compare(_table1=data_2016, _table2=data_2017, _schema1=data_2016_schema, _schema2=data_2017_schema,
fname='properties_2016', sample_size=1.0, feature_colname1='column', feature_colname2='column',
dtype_colname1='type', dtype_colname2='type', output_root='output/', n_jobs=1)
pydqc.data_compare.data_compare_notebook(_table1, _table2, _schema1, _schema2, fname, sample=False, feature_colname1='column', feature_colname2='column', dtype_colname1='type', dtype_colname2='type', output_root='')
function: automatically generate ipynb for data comparison
Parameters:
- _table1: pandas DataFrame
- one of the two tables to compare
- _table2: pandas DataFrame
- one of the two tables to compare
- _schema1: pandas DataFrame
- data schema (contains column names and corresponding data types) for _table1
- _schema2: pandas DataFrame
- data schema (contains column names and corresponding data types) for _table2
- fname: string
- the output file name
- sample: boolean, default=False
- whether to do sampling on the original data
- feature_colname1: string, default='column'
- name of the column for feature of _table1
- feature_colname2: string, default='column'
- name of the column for feature of _table2
- dtype_colname1: string, default='type'
- name of the column for data type of _table1
- dtype_colname2: string, default='type'
- name of the column for data type of _table2
- output_root: string, default=''
- the root directory for the output file
Example:
import pandas as pd
from pydqc import infer_schema, data_summary, data_compare
data_2016 = pd.read_csv('data/properties_2016.csv')
data_2017 = pd.read_csv('data/properties_2017.csv')
# we should use the modified data schema
data_2016_schema = pd.read_excel('output/data_schema_properties_2016_mdf.xlsx')
data_2017_schema = pd.read_excel('output/data_schema_properties_2017_mdf.xlsx')
data_compare.data_compare_notebook(_table1=data_2016, _table2=data_2017, _schema1=data_2016_schema, _schema2=data_2017_schema,
fname='properties_2016', sample=False, feature_colname1='column', feature_colname2='column',
dtype_colname1='type', dtype_colname2='type', output_root='output/')
function: draw pretty distribution graph for comparing a column between two tables
Parameters:
- _df1: pandas DataFrame
- slice of table1 containing enough information to check
- _df2: pandas DataFrame
- slice of table2 containing enough information to check
- col: string
- name of column to check
- figsize: tuple, default=None
- figure size
- date_flag: bool, default=False
- whether it is checking date features
Example:
import pandas as pd
from pydqc import infer_schema, data_summary, data_compare
table1 = pd.read_csv('data/properties_2016.csv')
table2 = pd.read_csv('data/properties_2017.csv')
# we should use the modified data schema
col="bathroomcnt"
df1 = table1[[col]].copy()
df2 = table2[[col]].copy()
distribution_compare_pretty(df1, df2, col, figsize=None, date_flag=False)
pydqc.data_consist.data_consist(_table1, _table2, _key1, _key2, _schema1, _schema2, fname, sample_size=1.0, feature_colname1='column', feature_colname2='column', dtype_colname1='type', dtype_colname2='type', output_root='', keep_images=False, n_jobs=1)
function: check consistency for same columns between two tables
Parameters:
- _table1: pandas DataFrame
- one of the two tables to compare
- _table2: pandas DataFrame
- one of the two tables to compare
- _key1: string
- key for table1
- _key2: string
- key for table2
- _schema1: pandas DataFrame
- data schema (contains column names and corresponding data types) for _table1
- _schema2: pandas DataFrame
- data schema (contains column names and corresponding data types) for _table2
- fname: string
- the output file name
- sample_size: integer or float(<=1.0), default=1.0
- int: number of sample rows to do the comparison (useful for large tables)
- float: sample size in percentage
- feature_colname1: string, default='column'
- name of the column for feature of _table1
- feature_colname2: string, default='column'
- name of the column for feature of _table2
- dtype_colname1: string, default='type'
- name of the column for data type of _table1
- dtype_colname2: string, default='type'
- name of the column for data type of _table2
- output_root: string, default=''
- the root directory for the output file
- keep_images: boolean, default=False
- whether to keep all generated images
- n_jobs: int, default=1
- the number of jobs to run in parallel
Example:
import pandas as pd
from pydqc import infer_schema, data_summary, data_compare, data_consist
data_2016 = pd.read_csv('data/properties_2016.csv')
data_2017 = pd.read_csv('data/properties_2017.csv')
# we should use the modified data schema
data_2016_schema = pd.read_excel('output/data_schema_properties_2016_mdf.xlsx')
data_2017_schema = pd.read_excel('output/data_schema_properties_2017_mdf.xlsx')
data_consist.data_consist(_table1=data_2016, _table2=data_2017, _schema1=data_2016_schema, _schema2=data_2017_schema,
_key1='parcelid', _key2='parcelid',
fname='properties_2016', sample_size=1.0, feature_colname1='column', feature_colname2='column',
dtype_colname1='type', dtype_colname2='type', output_root='output/', n_jobs=1)
pydqc.data_consist.data_consist_notebook(_table1, _table2, _key1, _key2, _schema1, _schema2, fname, feature_colname1='column', feature_colname2='column', dtype_colname1='type', dtype_colname2='type', output_root='')
function: automatically generate ipynb for data consistency check
Parameters:
- _table1: pandas DataFrame
- one of the two tables to compare
- _table2: pandas DataFrame
- one of the two tables to compare
- _key1: string
- key for table1
- _key2: string
- key for table2
- _schema1: pandas DataFrame
- data schema (contains column names and corresponding data types) for _table1
- _schema2: pandas DataFrame
- data schema (contains column names and corresponding data types) for _table2
- fname: string
- the output file name
- sample: boolean, default=False
- whether to do sampling on the original data
- feature_colname1: string, default='column'
- name of the column for feature of _table1
- feature_colname2: string, default='column'
- name of the column for feature of _table2
- dtype_colname1: string, default='type'
- name of the column for data type of _table1
- dtype_colname2: string, default='type'
- name of the column for data type of _table2
- output_root: string, default=''
- the root directory for the output file
Example:
import pandas as pd
from pydqc import infer_schema, data_summary, data_compare, data_consist
data_2016 = pd.read_csv('data/properties_2016.csv')
data_2017 = pd.read_csv('data/properties_2017.csv')
# we should use the modified data schema
data_2016_schema = pd.read_excel('output/data_schema_properties_2016_mdf.xlsx')
data_2017_schema = pd.read_excel('output/data_schema_properties_2017_mdf.xlsx')
data_consist.data_consist_notebook(_table1=data_2016, _table2=data_2017, _key1='parcelid', _key2='parcelid',
_schema1=data_2016_schema, _schema2=data_2017_schema,
fname='properties', feature_colname1='column', feature_colname2='column',
dtype_colname1='type', dtype_colname2='type', output_root='output/')
pydqc.data_consist.numeric_consist_pretty(_df1, _df2, _key1, _key2, col, figsize=None, date_flag=False)
function: draw pretty consist graph for numeric columns
Parameters:
- _df1: pandas DataFrame
- slice of table1 containing enough information to check
- _df2: pandas DataFrame
- slice of table2 containing enough information to check
- _key1: string
- key for table1
- _key2: string
- key for table2
- col: string
- name of column to check
- figsize: tuple, default=None
- figure size
- date_flag: bool, default=False
- whether it is checking date features
Example:
import pandas as pd
from pydqc import infer_schema, data_summary, data_compare, data_consist
table1 = pd.read_csv('data/properties_2016.csv')
table2 = pd.read_csv('data/properties_2017.csv')
# we should use the modified data schema
col="bathroomcnt"
df1 = table1[[col]].copy()
df2 = table2[[col]].copy()
data_consist.numeric_consist_pretty(_df1, _df2, _key1, _key2, col, figsize=None, date_flag=False)