From 14e6994daf5fcd259f57a0668ac0b06bf936e352 Mon Sep 17 00:00:00 2001
From: zhu0619 <zhu0619@users.noreply.github.com>
Date: Fri, 2 Aug 2024 17:53:20 +0000
Subject: [PATCH] Deployed 3b8792a to main with MkDocs 1.6.0 and mike 2.1.2

---
 main/index.html               | 21 ++++++++++++++++++++-
 main/search/search_index.json |  2 +-
 2 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/main/index.html b/main/index.html
index 03f1500..c111c4a 100644
--- a/main/index.html
+++ b/main/index.html
@@ -964,7 +964,26 @@ <h1 id="introduction">Introduction</h1>
 <p>Welcome to the Auroris - Simplifying Drug Discovery Data Curation</p>
 <hr />
 <h2 id="what-is-auroris">What is Auroris?</h2>
-<p>Auroris is a comprehensive Python library designed to assist researchers and scientists in managing, cleaning, and preparing data relevant to drug discovery. Our mission is to implement a range of techniques to handle, transform, filter, analyze, or visualize the diverse data types commonly encountered in drug discovery.</p>
+<p>Auroris is a Python library designed to assist researchers and scientists in managing, cleaning, and preparing data relevant to drug discovery. Auroris will implement a range of techniques to handle, transform, filter, analyze, or visualize the diverse data types commonly encountered in drug discovery. </p>
+<p>Currently, Auroris supports curation for small molecules, with plans to extend to other modalities in drug discovery. The curation module for small molecules includes:</p>
+<ul>
+<li>
+<p>🗄️ Molecule Standardization: Ensures that each molecule is represented in a uniform and unambiguous form.</p>
+</li>
+<li>
+<p>🏷️ Detection of Duplicate Molecules with Contradictory Labels: Identifies and resolves inconsistencies in activity data for each molecule.</p>
+</li>
+<li>
+<p>⛰️ Detection of Activity Cliffs Between Stereoisomers: Identifies significant differences in activity between stereoisomers.</p>
+</li>
+<li>
+<p>🔍Outlier Detection and Visualization: Detects and visualizes outliers in molecular activity data.</p>
+</li>
+<li>
+<p>📽️ Visualization of Molecular Distribution in Chemical Space: Provides graphical representations of molecular distributions.</p>
+</li>
+</ul>
+<p>Reproducibility and transparency are core to the mission of Polaris. That’s why with Auroris, you can also automatically generate detailed reports summarizing the changes that happened to a dataset during curation. Through an intuitive API, you can easily define complex curation workflows. Once defined, that workflow is serializable and thus reproducible so you can transparently share how you curated the dataset.</p>
 <h2 id="where-to-next">Where to next?</h2>
 <hr />
 <p><strong><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512"><!--! Font Awesome Free 6.6.0 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2024 Fonticons, Inc.--><path d="M156.6 384.9 125.7 354c-8.5-8.5-11.5-20.8-7.7-32.2 3-8.9 7-20.5 11.8-33.8H24c-8.6 0-16.6-4.6-20.9-12.1s-4.2-16.7.2-24.1l52.5-88.5c13-21.9 36.5-35.3 61.9-35.3H200c2.4-4 4.8-7.7 7.2-11.3C289.1-4.1 411.1-8.1 483.9 5.3c11.6 2.1 20.6 11.2 22.8 22.8 13.4 72.9 9.3 194.8-111.4 276.7-3.5 2.4-7.3 4.8-11.3 7.2v82.3c0 25.4-13.4 49-35.3 61.9l-88.5 52.5c-7.4 4.4-16.6 4.5-24.1.2S224 496.7 224 488V380.8c-14.1 4.9-26.4 8.9-35.7 11.9-11.2 3.6-23.4.5-31.8-7.8zM384 168a40 40 0 1 0 0-80 40 40 0 1 0 0 80z"/></svg></span>  Quickstart</strong></p>
diff --git a/main/search/search_index.json b/main/search/search_index.json
index ae9879b..eb3c2db 100644
--- a/main/search/search_index.json
+++ b/main/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Introduction","text":"<p>Welcome to the Auroris - Simplifying Drug Discovery Data Curation</p>"},{"location":"index.html#what-is-auroris","title":"What is Auroris?","text":"<p>Auroris is a comprehensive Python library designed to assist researchers and scientists in managing, cleaning, and preparing data relevant to drug discovery. Our mission is to implement a range of techniques to handle, transform, filter, analyze, or visualize the diverse data types commonly encountered in drug discovery.</p>"},{"location":"index.html#where-to-next","title":"Where to next?","text":"<p>  Quickstart</p> <p>Dive deeper into the Auroris code and learn how to curate data for your ML-powered drug discovery program. </p> <p> Let's get started</p> <p>  API Reference</p> <p>Explore the technical documentation here to delve into the inner workings of the code. Gain insights into the intricate details of how different methods and classes function.</p> <p> Let's get started</p> <p>  Community</p> <p>We're excited to have you join us in revolutionizing drug discovery data curation! Explore Auroris and the broader Polaris ecosystem it is part of, provide feedback, share your use cases, and collaborate with us to enhance and expand the capabilities of Auroris for the benefit of the drug discovery community.</p> <p> Let's get started</p>"},{"location":"api/actions.html","title":"Actions","text":""},{"location":"api/actions.html#auroris.curation.actions.BaseAction","title":"auroris.curation.actions.BaseAction","text":"<p>               Bases: <code>BaseModel</code>, <code>ABC</code></p> <p>An action in the curation process.</p> The importance of reproducibility <p>One of the main goals in designing <code>auroris</code> is to make it easy to reproduce the curation process. Reproducibility is key to scientific research. This is why a BaseAction needs to be serializable and uniquely identified by a <code>name</code>.</p> <p>Attributes:</p> Name Type Description <code>name</code> <code>str</code> <p>The name that uniquely identifies the action. This is used to serialize and deserialize the action.</p> <code>prefix</code> <code>str</code> <p>This prefix is used when an action adds columns to a dataset. If not set, it defaults to the name in uppercase.</p>"},{"location":"api/actions.html#auroris.curation.actions.StereoIsomerACDetection","title":"StereoIsomerACDetection","text":"<p>               Bases: <code>BaseAction</code></p> <p>Automatic detection of activity shift between stereoisomers.</p> <p>See <code>auroris.curation.functional.detect_streoisomer_activity_cliff</code> for the docs of the <code>stereoisomer_id_col</code>, <code>y_cols</code> and <code>threshold</code> attributes</p> <p>Attributes:</p> Name Type Description <code>mol_col</code> <code>Optional[str]</code> <p>Column with the SMILES or RDKit Molecule objects. If specified, will be used to render an image for the activity cliffs.</p>"},{"location":"api/actions.html#auroris.curation.actions.Deduplication","title":"Deduplication","text":"<p>               Bases: <code>BaseAction</code></p> <p>Automatic detection of outliers.</p> <p>See <code>auroris.curation.functional.deduplicate</code> for the docs of the <code>deduplicate_on</code>, <code>y_cols</code>, <code>keep</code> and <code>method</code> attributes</p>"},{"location":"api/actions.html#auroris.curation.actions.Discretization","title":"Discretization","text":"<p>               Bases: <code>BaseAction</code></p> <p>Thresholding bioactivity columns to binary or multiclass labels.</p> <p>See <code>auroris.curation.functional.discretize</code> for the docs of the <code>thresholds</code>, <code>inplace</code>, <code>allow_nan</code> and <code>label_order</code> attributes</p> <p>Attributes:</p> Name Type Description <code>input_column</code> <code>str</code> <p>The column to discretize.</p> <code>log_scale</code> <code>bool</code> <p>Whether a visual depiction of the discretization should be on a log scale.</p>"},{"location":"api/actions.html#auroris.curation.actions.ContinuousDistributionVisualization","title":"ContinuousDistributionVisualization","text":"<p>               Bases: <code>BaseAction</code></p> <p>Visualize one or more continuous distribution(s).</p> <p>See <code>auroris.visualization.visualize_continuous_distribution</code> for the docs of the <code>log_scale</code> and <code>bins</code> attributes</p> <p>Attributes:</p> Name Type Description <code>y_cols</code> <code>List[str]</code> <p>The columns whose distributions should be visualized.</p>"},{"location":"api/actions.html#auroris.curation.actions.MoleculeCuration","title":"MoleculeCuration","text":"<p>               Bases: <code>BaseAction</code></p> <p>Automated molecule curation and chemistry space distribution.</p> <p>See <code>auroris.curation.functional.curate_molecules</code> for the docs of the <code>remove_stereo</code>, <code>fix_mol</code>, <code>count_stereoisomers</code>, and <code>count_stereocenters</code> attributes</p> <p>Attributes:</p> Name Type Description <code>input_column</code> <code>str</code> <p>The name of the column that has the molecules (either <code>dm.Mol</code> objects or SMILES).</p> <code>X_col</code> <code>Optional[str]</code> <p>Column with custom features for each of the molecules. If None, will use ECFP.</p> <code>y_cols</code> <code>Optional[Union[str, List[str]]]</code> <p>Column names for bioactivities, which will be used to colorcode the chemical space visualization.</p>"},{"location":"api/actions.html#auroris.curation.actions.OutlierDetection","title":"OutlierDetection","text":"<p>               Bases: <code>BaseAction</code></p> <p>Automatic detection of outliers.</p> <p>See <code>auroris.curation.functional.detect_outliers</code> for the docs of the <code>method</code> and <code>kwargs</code> attributes</p> <p>Attributes:</p> Name Type Description <code>columns</code> <code>List[str]</code> <p>The columns for which to detect outliers.</p>"},{"location":"api/curator.html","title":"Curator","text":""},{"location":"api/curator.html#auroris.curation.Curator","title":"auroris.curation.Curator","text":"<p>               Bases: <code>BaseModel</code></p> <p>A curator is a serializable collection of actions that are applied to a dataset.</p> <p>Attributes:</p> Name Type Description <code>steps</code> <code>List[BaseAction]</code> <p>Ordered list of curation actions to apply to the dataset.</p> <code>src_dataset_path</code> <code>Optional[str]</code> <p>An optional path to load the source dataset from. Can be used to specify a reproducible workflow.</p> <code>verbosity</code> <code>VerbosityLevel</code> <p>Verbosity level for logging.</p> <code>parallelized_kwargs</code> <code>dict</code> <p>Keyword arguments to affect parallelization in the steps.</p>"},{"location":"api/curator.html#auroris.curation.Curator.transform","title":"transform","text":"<pre><code>transform(dataset: Optional[pd.DataFrame] = None) -&gt; Tuple[pd.DataFrame, CurationReport]\n</code></pre> <p>Runs the curation process.</p> <p>Parameters:</p> Name Type Description Default <code>dataset</code> <code>Optional[DataFrame]</code> <p>The dataset to be curated. If <code>src_dataset_path</code> is set, this parameter is ignored.</p> <code>None</code> <p>Returns:</p> Type Description <code>Tuple[DataFrame, CurationReport]</code> <p>A tuple of the curated dataset and a report summarizing the changes made.</p>"},{"location":"api/curator.html#auroris.curation.Curator.load_dataset","title":"load_dataset  <code>staticmethod</code>","text":"<pre><code>load_dataset(path: str)\n</code></pre> <p>Loads a dataset, to be curated, from a path.</p> File-format support <p>This currently only supports CSV and Parquet files and uses the default parameters for <code>pd.read_csv</code> and <code>pd.read_parquet</code>. If you need more flexibility, consider loading the data yourself and passing it directly to <code>Curator.transform(dataset=...)</code>.</p>"},{"location":"api/curator.html#auroris.curation.Curator.from_json","title":"from_json  <code>classmethod</code>","text":"<pre><code>from_json(path: str)\n</code></pre> <p>Loads a curation workflow from a JSON file.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>str</code> <p>The path to load from</p> required"},{"location":"api/curator.html#auroris.curation.Curator.to_json","title":"to_json","text":"<pre><code>to_json(path: str)\n</code></pre> <p>Saves the curation workflow to a JSON file.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>str</code> <p>The destination to save to.</p> required"},{"location":"api/functional.html","title":"Curation","text":""},{"location":"api/functional.html#auroris.curation.functional.detect_streoisomer_activity_cliff","title":"detect_streoisomer_activity_cliff","text":"<pre><code>detect_streoisomer_activity_cliff(dataset: pd.DataFrame, stereoisomer_id_col: str, y_cols: List[str], threshold: float = 2.0, prefix: str = 'AC_') -&gt; pd.DataFrame\n</code></pre> <p>Detect activity cliff among stereoisomers based on classification label or pre-defined threshold for continuous values.</p> <p>Parameters:</p> Name Type Description Default <code>dataset</code> <code>DataFrame</code> <p>Dataframe</p> required <code>stereoisomer_id_col</code> <code>str</code> <p>Column which identifies the stereoisomers</p> required <code>y_cols</code> <code>List[str]</code> <p>List of columns for bioactivities</p> required <code>threshold</code> <code>float</code> <p>Threshold to identify the activity cliff. Currently, the difference of zscores between isomers are used for identification.</p> <code>2.0</code> <code>prefix</code> <code>str</code> <p>Prefix for the adding columns</p> <code>'AC_'</code>"},{"location":"api/functional.html#auroris.curation.functional.deduplicate","title":"deduplicate","text":"<pre><code>deduplicate(dataset: pd.DataFrame, deduplicate_on: Optional[Union[str, List[str]]] = None, y_cols: Optional[Union[str, List[str]]] = None, keep: Literal['first', 'last'] = 'first', method: Literal['mean', 'median'] = 'median') -&gt; pd.DataFrame\n</code></pre> <p>Deduplicate a dataframe.</p> <p>If <code>deduplicate_on</code> specifies a subset of all columns in the dataset and <code>y_cols</code> specifies a set of non-overlapping columns, data will be grouped by <code>deduplicate_on</code> and the <code>y_cols</code> will be aggregated to a single value per group according to <code>method</code>.</p> <p>Parameters:</p> Name Type Description Default <code>dataset</code> <code>DataFrame</code> <p>The dataset to deduplicate.</p> required <code>deduplicate_on</code> <code>Optional[Union[str, List[str]]]</code> <p>A subset of the columns to deduplicate on (can be default).</p> <code>None</code> <code>y_cols</code> <code>Optional[Union[str, List[str]]]</code> <p>The columns to aggregate.</p> <code>None</code> <code>keep</code> <code>Literal['first', 'last']</code> <p>Whether to keep the first or last copy of the duplicates.</p> <code>'first'</code> <code>method</code> <code>Literal['mean', 'median']</code> <p>The method to aggregate the data.</p> <code>'median'</code>"},{"location":"api/functional.html#auroris.curation.functional.discretize","title":"discretize","text":"<pre><code>discretize(X: np.ndarray, thresholds: Union[np.ndarray, list], inplace: bool = False, allow_nan: bool = True, label_order: Literal['ascending', 'descending'] = 'ascending') -&gt; np.ndarray\n</code></pre> <p>Thresholding of array-like or scipy.sparse matrix into binary or multiclass labels.</p> <p>Parameters:</p> Name Type Description Default <code>X</code> <p>The data to discretize, element by element. scipy.sparse matrices should be in CSR or CSC format to avoid an un-necessary copy.</p> required <code>thresholds</code> <code>Union[ndarray, list]</code> <p>Interval boundaries that include the right bin edge.</p> required <code>inplace</code> <code>bool</code> <p>Set to True to perform inplace discretization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR / CSC matrix and if axis is 1).</p> <code>False</code> <code>allow_nan</code> <code>bool</code> <p>Set to True to allow nans in the array for discretization. Otherwise, an error will be raised instead.</p> <code>True</code> <code>label_order</code> <code>Literal['ascending', 'descending']</code> <p>The continuous values are discretized to labels 0, 1, 2, .., N with respect to given threshold bins [threshold_1, threshold_2,.., threshould_n]. When set to 'ascending', the class label is in ascending order with the threshold bins that <code>0</code> represents negative class or lower class, while 1, 2, 3 are for higher classes. When set to 'descending' the class label is in ascending order with the threshold bins. Sometimes the positive labels are on the left side of provided threshold. E.g. For binarization with threshold [0.5],  the positive label is defined by<code>X &lt; 0.5</code>. In this case, <code>label_order</code> should be <code>descending</code>.</p> <code>'ascending'</code> <p>Returns:</p> Name Type Description <code>X_tr</code> <code>ndarray</code> <p>The transformed data.</p>"},{"location":"api/functional.html#auroris.curation.functional.curate_molecules","title":"curate_molecules","text":"<pre><code>curate_molecules(mols: List[Union[str, dm.Mol]], progress: bool = True, remove_stereo: bool = False, fix_mol: bool = True, count_stereoisomers: bool = True, count_stereocenters: bool = True, **parallelized_kwargs) -&gt; Tuple\n</code></pre> <p>Curate a list of molecules.</p> <p>Parameters:</p> Name Type Description Default <code>mols</code> <code>List[Union[str, Mol]]</code> <p>List of molecules.</p> required <code>progress</code> <code>bool</code> <p>Whether show curation progress.</p> <code>True</code> <code>fix_mol</code> <code>bool</code> <p>Whether fix the error in molecule.</p> <code>True</code> <code>remove_stereo</code> <code>bool</code> <p>Whether remove stereo chemistry information from molecule.</p> <code>False</code> <code>count_stereoisomers</code> <code>bool</code> <p>Whether count the number of stereoisomers of molecule.</p> <code>True</code> <code>count_stereocenters</code> <code>bool</code> <p>Whether count the number of stereocenters of molecule.</p> <code>True</code> <p>Returns:</p> Name Type Description <code>mol_dict</code> <code>Tuple</code> <p>Dictionary of molecule and additional metadata</p> <code>num_invalid</code> <code>Tuple</code> <p>Number of inv\u00df\u00dfalid molecules</p>"},{"location":"api/functional.html#auroris.curation.functional.detect_outliers","title":"detect_outliers","text":"<pre><code>detect_outliers(X: np.ndarray, method: OutlierDetectionMethod = 'zscore', **kwargs: Any)\n</code></pre> <p>Functional interface for detecting outliers</p> <p>Parameters:</p> Name Type Description Default <code>X</code> <code>ndarray</code> <p>The observations that we want to classify as inliers or outliers.</p> required <code>method</code> <code>OutlierDetectionMethod</code> <p>The method to use for outlier detection.</p> <code>'zscore'</code> <code>**kwargs</code> <code>Any</code> <p>Keyword arguments for the outlier detection method.</p> <code>{}</code>"},{"location":"api/types.html","title":"Types","text":""},{"location":"api/types.html#auroris.types","title":"auroris.types","text":""},{"location":"api/types.html#auroris.types.VerbosityLevel","title":"VerbosityLevel","text":"<p>               Bases: <code>IntEnum</code></p> <p>The different verbosity levels</p>"},{"location":"api/utils.html","title":"Utils","text":""},{"location":"api/utils.html#auroris.utils.is_regression","title":"is_regression","text":"<pre><code>is_regression(values: np.ndarray) -&gt; bool\n</code></pre> <p>Whether the input values are for regreesion</p>"},{"location":"api/utils.html#auroris.utils.fig2img","title":"fig2img","text":"<pre><code>fig2img(fig: Figure) -&gt; ImageType\n</code></pre> <p>Convert a Matplotlib figure to a PIL Image</p>"},{"location":"api/utils.html#auroris.utils.img2bytes","title":"img2bytes","text":"<pre><code>img2bytes(image: ImageType)\n</code></pre> <p>Convert png image to bytes</p>"},{"location":"api/utils.html#auroris.utils.bytes2img","title":"bytes2img","text":"<pre><code>bytes2img(image_bytes: ByteString)\n</code></pre> <p>Convert bytes to PIL image</p>"},{"location":"api/utils.html#auroris.utils.save_image","title":"save_image","text":"<pre><code>save_image(image: ImageType, path: str)\n</code></pre> <p>Save an image to a fsspec-compatible path</p>"},{"location":"api/utils.html#auroris.utils.is_parquet_file","title":"is_parquet_file","text":"<pre><code>is_parquet_file(path)\n</code></pre> <p>Verify parquet file without actually loading it.</p>"},{"location":"api/visualization.html","title":"Visualization","text":""},{"location":"api/visualization.html#auroris.visualization.visualize_chemspace","title":"visualize_chemspace","text":"<pre><code>visualize_chemspace(X: np.ndarray, y: Optional[Union[List[np.ndarray], np.ndarray]] = None, labels: Optional[List[str]] = None, n_cols: int = 2, fig_base_size: float = 8, w_h_ratio: float = 0.5, dpi: int = 150, seaborn_theme: Optional[str] = 'whitegrid', plot_kwargs: dict = None, umap_kwargs: dict = None)\n</code></pre> <p>Plot the coverage in chemical space. Also, color based on the target values.</p> <p>Parameters:</p> Name Type Description Default <code>X</code> <code>ndarray</code> <p>Array the molecular features.</p> required <code>y</code> <code>Optional[Union[List[ndarray], ndarray]]</code> <p>A list of arrays with the target values.</p> <code>None</code> <code>labels</code> <code>Optional[List[str]]</code> <p>Optional list of labels for each set of features.</p> <code>None</code> <code>n_cols</code> <code>int</code> <p>Number of columns in the subplots.</p> <code>2</code> <code>fig_base_size</code> <code>float</code> <p>Base size of the plots.</p> <code>8</code> <code>w_h_ratio</code> <code>float</code> <p>Width/height ratio.</p> <code>0.5</code> <code>dpi</code> <code>int</code> <p>DPI value of the figure.</p> <code>150</code> <code>seaborn_theme</code> <code>Optional[str]</code> <p>Seaborn theme.</p> <code>'whitegrid'</code> <code>plot_kwargs</code> <code>dict</code> <p>seaborn plot arguments.</p> <code>None</code> <code>umap_kwargs</code> <code>dict</code> <p>Keyword arguments for the UMAP algorithm.</p> <code>None</code>"},{"location":"api/visualization.html#auroris.visualization.visualize_continuous_distribution","title":"visualize_continuous_distribution","text":"<pre><code>visualize_continuous_distribution(data: np.ndarray, log_scale: bool = False, bins: Optional[Sequence[float]] = None)\n</code></pre> <p>KDE plot the distribution of the column in <code>data</code> with colored sections under the KDE curve.</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>ndarray</code> <p>A 1D numpy array with the values to plot the distribution for.</p> required <code>log_scale</code> <code>bool</code> <p>Whether to plot the x-axis in log scale.</p> <code>False</code> <code>bins</code> <code>Optional[Sequence[float]]</code> <p>The bin boundaries to color the area under the KDE curve.</p> <code>None</code>"},{"location":"api/visualization.html#auroris.visualization.visualize_distribution_with_outliers","title":"visualize_distribution_with_outliers","text":"<pre><code>visualize_distribution_with_outliers(values: np.ndarray, is_outlier: Optional[List[bool]] = None, title: str = 'Probability Plot')\n</code></pre> <p>Visualize the distribution of the data and highlight the potential outliers.</p> <p>Parameters:</p> Name Type Description Default <code>values</code> <code>ndarray</code> <p>Values for visulization.</p> required <code>is_outlier</code> <code>Optional[List[bool]]</code> <p>List of outlier flag.</p> <code>None</code> <code>title</code> <code>str</code> <p>Title of plot</p> <code>'Probability Plot'</code>"},{"location":"tutorials/getting_started.html","title":"Getting Started","text":"<p>In short</p> <p>This tutorial gives an overview of the basic concepts in the `auroris` library.</p> <p>On the nuances of curation</p> <p>How to best curate a dataset is highly situation-dependent. The `auroris` library includes some useful tools, but blindly applying them won't necessarily lead to good datasets. To learn more, visit the Polaris Hub for extensive resources and documentation on dataset curation and more.</p> <p>Data curation is concerned with analyzing and processing an existing dataset to maximize its quality. Within drug discovery, this can imply many things, such as filtering out outliers or flagging activity-cliffs. High-quality, well-curated datasets are the foundation upon which we can build realistic, impactful benchmarks for drug discovery. This notebook demonstrates how to curate your dataset with the Polaris data curation API for small molecules.</p> In\u00a0[3]: Copied! <pre>import datamol as dm\n</pre> import datamol as dm In\u00a0[4]: Copied! <pre># Load your data set\n# See more details of the dataset at https://docs.datamol.io/stable/api/datamol.data.html\ndata = dm.data.solubility()\ndata.head(5)\n</pre> # Load your data set # See more details of the dataset at https://docs.datamol.io/stable/api/datamol.data.html data = dm.data.solubility() data.head(5) Out[4]: mol ID NAME SOL SOL_classification smiles split 0 &lt;rdkit.Chem.rdchem.Mol object at 0x173b7c2e0&gt; 1 n-pentane -3.18 (A) low CCCCC train 1 &lt;rdkit.Chem.rdchem.Mol object at 0x173b7c430&gt; 2 cyclopentane -2.64 (B) medium C1CCCC1 train 2 &lt;rdkit.Chem.rdchem.Mol object at 0x173b7c4a0&gt; 3 n-hexane -3.84 (A) low CCCCCC train 3 &lt;rdkit.Chem.rdchem.Mol object at 0x173b7c510&gt; 4 2-methylpentane -3.74 (A) low CCCC(C)C train 4 &lt;rdkit.Chem.rdchem.Mol object at 0x173b7c580&gt; 6 2,2-dimethylbutane -3.55 (A) low CCC(C)(C)C train In\u00a0[5]: Copied! <pre>from auroris.curation import Curator\nfrom auroris.curation.actions import MoleculeCuration, OutlierDetection, Discretization\n\n# Define the curation workflow\ncurator = Curator(\n    steps=[\n        MoleculeCuration(input_column=\"smiles\"),\n        OutlierDetection(method=\"zscore\", columns=[\"SOL\"]),\n        Discretization(input_column=\"SOL\", thresholds=[-3]),\n    ],\n    parallelized_kwargs={\"n_jobs\": -1},\n)\n\n# Run the curation\ndataset, report = curator(data)\n</pre> from auroris.curation import Curator from auroris.curation.actions import MoleculeCuration, OutlierDetection, Discretization  # Define the curation workflow curator = Curator(     steps=[         MoleculeCuration(input_column=\"smiles\"),         OutlierDetection(method=\"zscore\", columns=[\"SOL\"]),         Discretization(input_column=\"SOL\", thresholds=[-3]),     ],     parallelized_kwargs={\"n_jobs\": -1}, )  # Run the curation dataset, report = curator(data) <pre>2024-08-02 12:26:54.316 | INFO     | auroris.curation._curator:transform:106 - Performing step: mol_curation\n2024-08-02 12:27:12.343 | INFO     | auroris.curation._curator:transform:106 - Performing step: outlier_detection\n2024-08-02 12:27:12.400 | INFO     | auroris.curation._curator:transform:106 - Performing step: discretize\n</pre> <p>The report can be exported (\"broadcaster\") to a variety of different formats. Let's simply log it to the CLI for now.</p> In\u00a0[6]: Copied! <pre>from auroris.report.broadcaster import LoggerBroadcaster\n\nbroadcaster = LoggerBroadcaster(report)\nbroadcaster.broadcast()\n</pre> from auroris.report.broadcaster import LoggerBroadcaster  broadcaster = LoggerBroadcaster(report) broadcaster.broadcast() <pre>===== Curation Report =====\nTime: 2024-08-02 12:26:54\nVersion: 0.1.4.dev0+g7127343.d20240707\n===== mol_curation =====\n[LOG]: Couldn't preprocess 18 / 1282 molecules.\n[LOG]: New column added: MOL_smiles\n[LOG]: New column added: MOL_molhash_id\n[LOG]: New column added: MOL_molhash_id_no_stereo\n[LOG]: New column added: MOL_num_stereoisomers\n[LOG]: New column added: MOL_num_undefined_stereoisomers\n[LOG]: New column added: MOL_num_defined_stereo_center\n[LOG]: New column added: MOL_num_undefined_stereo_center\n[LOG]: New column added: MOL_num_stereo_center\n[LOG]: New column added: MOL_undefined_E_D\n[LOG]: New column added: MOL_undefined_E/Z\n[LOG]: Default `ecfp` fingerprint is used to visualize the chemical space.\n[LOG]: Molecules with undefined stereocenter detected: 253.\n[IMG]: Dimensions 1200 x 600\n[IMG]: Dimensions 1200 x 2400\n===== outlier_detection =====\n[LOG]: New column added: OUTLIER_SOL\n[LOG]: Found 7 potential outliers with respect to the SOL column for review.\n[IMG]: Dimensions 1200 x 600\n===== discretize =====\n[LOG]: New column added: CLS_SOL\n[IMG]: Dimensions 1200 x 600\n===== Curation Report END =====\n</pre> <p>We can see that there is also images in the report! More advanced broadcasters will display these, such as the <code>HTMLBroadcaster</code>.</p> In\u00a0[7]: Copied! <pre>from auroris.report.broadcaster import HTMLBroadcaster\nimport tempfile\n\ntemp_dir = tempfile.TemporaryDirectory().name\n\nbroadcaster = HTMLBroadcaster(report=report, destination=temp_dir, embed_images=True)\nbroadcaster.broadcast()\n</pre> from auroris.report.broadcaster import HTMLBroadcaster import tempfile  temp_dir = tempfile.TemporaryDirectory().name  broadcaster = HTMLBroadcaster(report=report, destination=temp_dir, embed_images=True) broadcaster.broadcast() Out[7]: <pre>'/var/folders/_7/ffxc1f251dbb5msn977xl4sm0000gr/T/tmps2tt3jrb/index.html'</pre> <p>One can review the above HTML report with embedded visualizations and share it with collaborators.</p> <p>Let's also look at a single row of the new curated dataset!</p> In\u00a0[8]: Copied! <pre>dataset.iloc[0]\n</pre> dataset.iloc[0] Out[8]: <pre>mol                                &lt;rdkit.Chem.rdchem.Mol object at 0x173b7c2e0&gt;\nID                                                                             1\nNAME                                                                   n-pentane\nSOL                                                                        -3.18\nSOL_classification                                                       (A) low\nsmiles                                                                     CCCCC\nsplit                                                                      train\nMOL_smiles                                                                 CCCCC\nMOL_molhash_id                          3cb2e0cf1b50d8f954891abc5dcce90d543cd3d7\nMOL_molhash_id_no_stereo                36551d628217a351e720cdbe676fca3067730a91\nMOL_num_stereoisomers                                                        1.0\nMOL_num_undefined_stereoisomers                                              1.0\nMOL_num_defined_stereo_center                                                0.0\nMOL_num_undefined_stereo_center                                              0.0\nMOL_num_stereo_center                                                        0.0\nMOL_undefined_E_D                                                          False\nMOL_undefined_E/Z                                                              0\nOUTLIER_SOL                                                                False\nCLS_SOL                                                                      0.0\nName: 0, dtype: object</pre> In\u00a0[9]: Copied! <pre>from auroris.curation.functional import detect_outliers\nfrom auroris.visualization import visualize_distribution_with_outliers\n\ny = dataset[\"SOL\"].values\nis_outlier = detect_outliers(y, method=\"zscore\")\nvisualize_distribution_with_outliers(y, is_outlier);\n</pre> from auroris.curation.functional import detect_outliers from auroris.visualization import visualize_distribution_with_outliers  y = dataset[\"SOL\"].values is_outlier = detect_outliers(y, method=\"zscore\") visualize_distribution_with_outliers(y, is_outlier); <p>Depending on the type of bioactivity and its distribution, the above plot helps to highlight data points that are potential outliers (data outside the acceptable range) or strong signals.</p> <p>Reviewing these data points, and removing them if they are truely outliers, can be beneficial for QSAR modeling.</p> <p>The End.</p>"},{"location":"tutorials/getting_started.html#curating-a-toy-dataset","title":"Curating a toy dataset\u00b6","text":"<p>Let's learn about the basic concepts of the <code>auroris</code> library by curating a toy dataset. For the sake of simplicity, we will use the solubility dataset from Datamol. It is worth noting that this dataset is only meant to be used as a toy dataset for pedagogic and testing purposes. It is not a dataset for benchmarking, analysis or model training. Curation can only take us so far. For impactful benchmarks, we rely on high-quality data sources to begin with.</p>"},{"location":"tutorials/getting_started.html#using-the-curator-api","title":"Using the <code>Curator</code> API\u00b6","text":"<p>The recommended way to specify curation workflows is through the <code>Curator</code> API:</p> <ul> <li>A <code>Curator</code> object defines a number of curation steps.</li> <li>Each step should inherit from <code>auroris.curation.actions.BaseAction</code>.</li> <li>The <code>Curator</code> object is serializable. You can thus easily save and load it from JSON, which makes it easy to reproduce a curation workflow.</li> <li>Finally, the <code>Curator</code> produces a <code>CurationReport</code> which summarizes the changes made to a dataset.</li> </ul> <p>Let's define a simple workflow with three steps:</p> <ol> <li>Curate the chemical structures</li> <li>Detect outliers</li> <li>Bin the regression column</li> </ol>"},{"location":"tutorials/getting_started.html#using-the-functional-api","title":"Using the functional API\u00b6","text":"<p><code>auroris</code> provides a functional API to easily and quickly run some curation steps. Let's look at an oulier detection example.</p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Introduction","text":"<p>Welcome to the Auroris - Simplifying Drug Discovery Data Curation</p>"},{"location":"index.html#what-is-auroris","title":"What is Auroris?","text":"<p>Auroris is a Python library designed to assist researchers and scientists in managing, cleaning, and preparing data relevant to drug discovery. Auroris will implement a range of techniques to handle, transform, filter, analyze, or visualize the diverse data types commonly encountered in drug discovery. </p> <p>Currently, Auroris supports curation for small molecules, with plans to extend to other modalities in drug discovery. The curation module for small molecules includes:</p> <ul> <li> <p>\ud83d\uddc4\ufe0f Molecule Standardization: Ensures that each molecule is represented in a uniform and unambiguous form.</p> </li> <li> <p>\ud83c\udff7\ufe0f Detection of Duplicate Molecules with Contradictory Labels: Identifies and resolves inconsistencies in activity data for each molecule.</p> </li> <li> <p>\u26f0\ufe0f Detection of Activity Cliffs Between Stereoisomers: Identifies significant differences in activity between stereoisomers.</p> </li> <li> <p>\ud83d\udd0dOutlier Detection and Visualization: Detects and visualizes outliers in molecular activity data.</p> </li> <li> <p>\ud83d\udcfd\ufe0f Visualization of Molecular Distribution in Chemical Space: Provides graphical representations of molecular distributions.</p> </li> </ul> <p>Reproducibility and transparency are core to the mission of Polaris. That\u2019s why with Auroris, you can also automatically generate detailed reports summarizing the changes that happened to a dataset during curation. Through an intuitive API, you can easily define complex curation workflows. Once defined, that workflow is serializable and thus reproducible so you can transparently share how you curated the dataset.</p>"},{"location":"index.html#where-to-next","title":"Where to next?","text":"<p>  Quickstart</p> <p>Dive deeper into the Auroris code and learn how to curate data for your ML-powered drug discovery program. </p> <p> Let's get started</p> <p>  API Reference</p> <p>Explore the technical documentation here to delve into the inner workings of the code. Gain insights into the intricate details of how different methods and classes function.</p> <p> Let's get started</p> <p>  Community</p> <p>We're excited to have you join us in revolutionizing drug discovery data curation! Explore Auroris and the broader Polaris ecosystem it is part of, provide feedback, share your use cases, and collaborate with us to enhance and expand the capabilities of Auroris for the benefit of the drug discovery community.</p> <p> Let's get started</p>"},{"location":"api/actions.html","title":"Actions","text":""},{"location":"api/actions.html#auroris.curation.actions.BaseAction","title":"auroris.curation.actions.BaseAction","text":"<p>               Bases: <code>BaseModel</code>, <code>ABC</code></p> <p>An action in the curation process.</p> The importance of reproducibility <p>One of the main goals in designing <code>auroris</code> is to make it easy to reproduce the curation process. Reproducibility is key to scientific research. This is why a BaseAction needs to be serializable and uniquely identified by a <code>name</code>.</p> <p>Attributes:</p> Name Type Description <code>name</code> <code>str</code> <p>The name that uniquely identifies the action. This is used to serialize and deserialize the action.</p> <code>prefix</code> <code>str</code> <p>This prefix is used when an action adds columns to a dataset. If not set, it defaults to the name in uppercase.</p>"},{"location":"api/actions.html#auroris.curation.actions.StereoIsomerACDetection","title":"StereoIsomerACDetection","text":"<p>               Bases: <code>BaseAction</code></p> <p>Automatic detection of activity shift between stereoisomers.</p> <p>See <code>auroris.curation.functional.detect_streoisomer_activity_cliff</code> for the docs of the <code>stereoisomer_id_col</code>, <code>y_cols</code> and <code>threshold</code> attributes</p> <p>Attributes:</p> Name Type Description <code>mol_col</code> <code>Optional[str]</code> <p>Column with the SMILES or RDKit Molecule objects. If specified, will be used to render an image for the activity cliffs.</p>"},{"location":"api/actions.html#auroris.curation.actions.Deduplication","title":"Deduplication","text":"<p>               Bases: <code>BaseAction</code></p> <p>Automatic detection of outliers.</p> <p>See <code>auroris.curation.functional.deduplicate</code> for the docs of the <code>deduplicate_on</code>, <code>y_cols</code>, <code>keep</code> and <code>method</code> attributes</p>"},{"location":"api/actions.html#auroris.curation.actions.Discretization","title":"Discretization","text":"<p>               Bases: <code>BaseAction</code></p> <p>Thresholding bioactivity columns to binary or multiclass labels.</p> <p>See <code>auroris.curation.functional.discretize</code> for the docs of the <code>thresholds</code>, <code>inplace</code>, <code>allow_nan</code> and <code>label_order</code> attributes</p> <p>Attributes:</p> Name Type Description <code>input_column</code> <code>str</code> <p>The column to discretize.</p> <code>log_scale</code> <code>bool</code> <p>Whether a visual depiction of the discretization should be on a log scale.</p>"},{"location":"api/actions.html#auroris.curation.actions.ContinuousDistributionVisualization","title":"ContinuousDistributionVisualization","text":"<p>               Bases: <code>BaseAction</code></p> <p>Visualize one or more continuous distribution(s).</p> <p>See <code>auroris.visualization.visualize_continuous_distribution</code> for the docs of the <code>log_scale</code> and <code>bins</code> attributes</p> <p>Attributes:</p> Name Type Description <code>y_cols</code> <code>List[str]</code> <p>The columns whose distributions should be visualized.</p>"},{"location":"api/actions.html#auroris.curation.actions.MoleculeCuration","title":"MoleculeCuration","text":"<p>               Bases: <code>BaseAction</code></p> <p>Automated molecule curation and chemistry space distribution.</p> <p>See <code>auroris.curation.functional.curate_molecules</code> for the docs of the <code>remove_stereo</code>, <code>fix_mol</code>, <code>count_stereoisomers</code>, and <code>count_stereocenters</code> attributes</p> <p>Attributes:</p> Name Type Description <code>input_column</code> <code>str</code> <p>The name of the column that has the molecules (either <code>dm.Mol</code> objects or SMILES).</p> <code>X_col</code> <code>Optional[str]</code> <p>Column with custom features for each of the molecules. If None, will use ECFP.</p> <code>y_cols</code> <code>Optional[Union[str, List[str]]]</code> <p>Column names for bioactivities, which will be used to colorcode the chemical space visualization.</p>"},{"location":"api/actions.html#auroris.curation.actions.OutlierDetection","title":"OutlierDetection","text":"<p>               Bases: <code>BaseAction</code></p> <p>Automatic detection of outliers.</p> <p>See <code>auroris.curation.functional.detect_outliers</code> for the docs of the <code>method</code> and <code>kwargs</code> attributes</p> <p>Attributes:</p> Name Type Description <code>columns</code> <code>List[str]</code> <p>The columns for which to detect outliers.</p>"},{"location":"api/curator.html","title":"Curator","text":""},{"location":"api/curator.html#auroris.curation.Curator","title":"auroris.curation.Curator","text":"<p>               Bases: <code>BaseModel</code></p> <p>A curator is a serializable collection of actions that are applied to a dataset.</p> <p>Attributes:</p> Name Type Description <code>steps</code> <code>List[BaseAction]</code> <p>Ordered list of curation actions to apply to the dataset.</p> <code>src_dataset_path</code> <code>Optional[str]</code> <p>An optional path to load the source dataset from. Can be used to specify a reproducible workflow.</p> <code>verbosity</code> <code>VerbosityLevel</code> <p>Verbosity level for logging.</p> <code>parallelized_kwargs</code> <code>dict</code> <p>Keyword arguments to affect parallelization in the steps.</p>"},{"location":"api/curator.html#auroris.curation.Curator.transform","title":"transform","text":"<pre><code>transform(dataset: Optional[pd.DataFrame] = None) -&gt; Tuple[pd.DataFrame, CurationReport]\n</code></pre> <p>Runs the curation process.</p> <p>Parameters:</p> Name Type Description Default <code>dataset</code> <code>Optional[DataFrame]</code> <p>The dataset to be curated. If <code>src_dataset_path</code> is set, this parameter is ignored.</p> <code>None</code> <p>Returns:</p> Type Description <code>Tuple[DataFrame, CurationReport]</code> <p>A tuple of the curated dataset and a report summarizing the changes made.</p>"},{"location":"api/curator.html#auroris.curation.Curator.load_dataset","title":"load_dataset  <code>staticmethod</code>","text":"<pre><code>load_dataset(path: str)\n</code></pre> <p>Loads a dataset, to be curated, from a path.</p> File-format support <p>This currently only supports CSV and Parquet files and uses the default parameters for <code>pd.read_csv</code> and <code>pd.read_parquet</code>. If you need more flexibility, consider loading the data yourself and passing it directly to <code>Curator.transform(dataset=...)</code>.</p>"},{"location":"api/curator.html#auroris.curation.Curator.from_json","title":"from_json  <code>classmethod</code>","text":"<pre><code>from_json(path: str)\n</code></pre> <p>Loads a curation workflow from a JSON file.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>str</code> <p>The path to load from</p> required"},{"location":"api/curator.html#auroris.curation.Curator.to_json","title":"to_json","text":"<pre><code>to_json(path: str)\n</code></pre> <p>Saves the curation workflow to a JSON file.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>str</code> <p>The destination to save to.</p> required"},{"location":"api/functional.html","title":"Curation","text":""},{"location":"api/functional.html#auroris.curation.functional.detect_streoisomer_activity_cliff","title":"detect_streoisomer_activity_cliff","text":"<pre><code>detect_streoisomer_activity_cliff(dataset: pd.DataFrame, stereoisomer_id_col: str, y_cols: List[str], threshold: float = 2.0, prefix: str = 'AC_') -&gt; pd.DataFrame\n</code></pre> <p>Detect activity cliff among stereoisomers based on classification label or pre-defined threshold for continuous values.</p> <p>Parameters:</p> Name Type Description Default <code>dataset</code> <code>DataFrame</code> <p>Dataframe</p> required <code>stereoisomer_id_col</code> <code>str</code> <p>Column which identifies the stereoisomers</p> required <code>y_cols</code> <code>List[str]</code> <p>List of columns for bioactivities</p> required <code>threshold</code> <code>float</code> <p>Threshold to identify the activity cliff. Currently, the difference of zscores between isomers are used for identification.</p> <code>2.0</code> <code>prefix</code> <code>str</code> <p>Prefix for the adding columns</p> <code>'AC_'</code>"},{"location":"api/functional.html#auroris.curation.functional.deduplicate","title":"deduplicate","text":"<pre><code>deduplicate(dataset: pd.DataFrame, deduplicate_on: Optional[Union[str, List[str]]] = None, y_cols: Optional[Union[str, List[str]]] = None, keep: Literal['first', 'last'] = 'first', method: Literal['mean', 'median'] = 'median') -&gt; pd.DataFrame\n</code></pre> <p>Deduplicate a dataframe.</p> <p>If <code>deduplicate_on</code> specifies a subset of all columns in the dataset and <code>y_cols</code> specifies a set of non-overlapping columns, data will be grouped by <code>deduplicate_on</code> and the <code>y_cols</code> will be aggregated to a single value per group according to <code>method</code>.</p> <p>Parameters:</p> Name Type Description Default <code>dataset</code> <code>DataFrame</code> <p>The dataset to deduplicate.</p> required <code>deduplicate_on</code> <code>Optional[Union[str, List[str]]]</code> <p>A subset of the columns to deduplicate on (can be default).</p> <code>None</code> <code>y_cols</code> <code>Optional[Union[str, List[str]]]</code> <p>The columns to aggregate.</p> <code>None</code> <code>keep</code> <code>Literal['first', 'last']</code> <p>Whether to keep the first or last copy of the duplicates.</p> <code>'first'</code> <code>method</code> <code>Literal['mean', 'median']</code> <p>The method to aggregate the data.</p> <code>'median'</code>"},{"location":"api/functional.html#auroris.curation.functional.discretize","title":"discretize","text":"<pre><code>discretize(X: np.ndarray, thresholds: Union[np.ndarray, list], inplace: bool = False, allow_nan: bool = True, label_order: Literal['ascending', 'descending'] = 'ascending') -&gt; np.ndarray\n</code></pre> <p>Thresholding of array-like or scipy.sparse matrix into binary or multiclass labels.</p> <p>Parameters:</p> Name Type Description Default <code>X</code> <p>The data to discretize, element by element. scipy.sparse matrices should be in CSR or CSC format to avoid an un-necessary copy.</p> required <code>thresholds</code> <code>Union[ndarray, list]</code> <p>Interval boundaries that include the right bin edge.</p> required <code>inplace</code> <code>bool</code> <p>Set to True to perform inplace discretization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR / CSC matrix and if axis is 1).</p> <code>False</code> <code>allow_nan</code> <code>bool</code> <p>Set to True to allow nans in the array for discretization. Otherwise, an error will be raised instead.</p> <code>True</code> <code>label_order</code> <code>Literal['ascending', 'descending']</code> <p>The continuous values are discretized to labels 0, 1, 2, .., N with respect to given threshold bins [threshold_1, threshold_2,.., threshould_n]. When set to 'ascending', the class label is in ascending order with the threshold bins that <code>0</code> represents negative class or lower class, while 1, 2, 3 are for higher classes. When set to 'descending' the class label is in ascending order with the threshold bins. Sometimes the positive labels are on the left side of provided threshold. E.g. For binarization with threshold [0.5],  the positive label is defined by<code>X &lt; 0.5</code>. In this case, <code>label_order</code> should be <code>descending</code>.</p> <code>'ascending'</code> <p>Returns:</p> Name Type Description <code>X_tr</code> <code>ndarray</code> <p>The transformed data.</p>"},{"location":"api/functional.html#auroris.curation.functional.curate_molecules","title":"curate_molecules","text":"<pre><code>curate_molecules(mols: List[Union[str, dm.Mol]], progress: bool = True, remove_stereo: bool = False, fix_mol: bool = True, count_stereoisomers: bool = True, count_stereocenters: bool = True, **parallelized_kwargs) -&gt; Tuple\n</code></pre> <p>Curate a list of molecules.</p> <p>Parameters:</p> Name Type Description Default <code>mols</code> <code>List[Union[str, Mol]]</code> <p>List of molecules.</p> required <code>progress</code> <code>bool</code> <p>Whether show curation progress.</p> <code>True</code> <code>fix_mol</code> <code>bool</code> <p>Whether fix the error in molecule.</p> <code>True</code> <code>remove_stereo</code> <code>bool</code> <p>Whether remove stereo chemistry information from molecule.</p> <code>False</code> <code>count_stereoisomers</code> <code>bool</code> <p>Whether count the number of stereoisomers of molecule.</p> <code>True</code> <code>count_stereocenters</code> <code>bool</code> <p>Whether count the number of stereocenters of molecule.</p> <code>True</code> <p>Returns:</p> Name Type Description <code>mol_dict</code> <code>Tuple</code> <p>Dictionary of molecule and additional metadata</p> <code>num_invalid</code> <code>Tuple</code> <p>Number of inv\u00df\u00dfalid molecules</p>"},{"location":"api/functional.html#auroris.curation.functional.detect_outliers","title":"detect_outliers","text":"<pre><code>detect_outliers(X: np.ndarray, method: OutlierDetectionMethod = 'zscore', **kwargs: Any)\n</code></pre> <p>Functional interface for detecting outliers</p> <p>Parameters:</p> Name Type Description Default <code>X</code> <code>ndarray</code> <p>The observations that we want to classify as inliers or outliers.</p> required <code>method</code> <code>OutlierDetectionMethod</code> <p>The method to use for outlier detection.</p> <code>'zscore'</code> <code>**kwargs</code> <code>Any</code> <p>Keyword arguments for the outlier detection method.</p> <code>{}</code>"},{"location":"api/types.html","title":"Types","text":""},{"location":"api/types.html#auroris.types","title":"auroris.types","text":""},{"location":"api/types.html#auroris.types.VerbosityLevel","title":"VerbosityLevel","text":"<p>               Bases: <code>IntEnum</code></p> <p>The different verbosity levels</p>"},{"location":"api/utils.html","title":"Utils","text":""},{"location":"api/utils.html#auroris.utils.is_regression","title":"is_regression","text":"<pre><code>is_regression(values: np.ndarray) -&gt; bool\n</code></pre> <p>Whether the input values are for regreesion</p>"},{"location":"api/utils.html#auroris.utils.fig2img","title":"fig2img","text":"<pre><code>fig2img(fig: Figure) -&gt; ImageType\n</code></pre> <p>Convert a Matplotlib figure to a PIL Image</p>"},{"location":"api/utils.html#auroris.utils.img2bytes","title":"img2bytes","text":"<pre><code>img2bytes(image: ImageType)\n</code></pre> <p>Convert png image to bytes</p>"},{"location":"api/utils.html#auroris.utils.bytes2img","title":"bytes2img","text":"<pre><code>bytes2img(image_bytes: ByteString)\n</code></pre> <p>Convert bytes to PIL image</p>"},{"location":"api/utils.html#auroris.utils.save_image","title":"save_image","text":"<pre><code>save_image(image: ImageType, path: str)\n</code></pre> <p>Save an image to a fsspec-compatible path</p>"},{"location":"api/utils.html#auroris.utils.is_parquet_file","title":"is_parquet_file","text":"<pre><code>is_parquet_file(path)\n</code></pre> <p>Verify parquet file without actually loading it.</p>"},{"location":"api/visualization.html","title":"Visualization","text":""},{"location":"api/visualization.html#auroris.visualization.visualize_chemspace","title":"visualize_chemspace","text":"<pre><code>visualize_chemspace(X: np.ndarray, y: Optional[Union[List[np.ndarray], np.ndarray]] = None, labels: Optional[List[str]] = None, n_cols: int = 2, fig_base_size: float = 8, w_h_ratio: float = 0.5, dpi: int = 150, seaborn_theme: Optional[str] = 'whitegrid', plot_kwargs: dict = None, umap_kwargs: dict = None)\n</code></pre> <p>Plot the coverage in chemical space. Also, color based on the target values.</p> <p>Parameters:</p> Name Type Description Default <code>X</code> <code>ndarray</code> <p>Array the molecular features.</p> required <code>y</code> <code>Optional[Union[List[ndarray], ndarray]]</code> <p>A list of arrays with the target values.</p> <code>None</code> <code>labels</code> <code>Optional[List[str]]</code> <p>Optional list of labels for each set of features.</p> <code>None</code> <code>n_cols</code> <code>int</code> <p>Number of columns in the subplots.</p> <code>2</code> <code>fig_base_size</code> <code>float</code> <p>Base size of the plots.</p> <code>8</code> <code>w_h_ratio</code> <code>float</code> <p>Width/height ratio.</p> <code>0.5</code> <code>dpi</code> <code>int</code> <p>DPI value of the figure.</p> <code>150</code> <code>seaborn_theme</code> <code>Optional[str]</code> <p>Seaborn theme.</p> <code>'whitegrid'</code> <code>plot_kwargs</code> <code>dict</code> <p>seaborn plot arguments.</p> <code>None</code> <code>umap_kwargs</code> <code>dict</code> <p>Keyword arguments for the UMAP algorithm.</p> <code>None</code>"},{"location":"api/visualization.html#auroris.visualization.visualize_continuous_distribution","title":"visualize_continuous_distribution","text":"<pre><code>visualize_continuous_distribution(data: np.ndarray, log_scale: bool = False, bins: Optional[Sequence[float]] = None)\n</code></pre> <p>KDE plot the distribution of the column in <code>data</code> with colored sections under the KDE curve.</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>ndarray</code> <p>A 1D numpy array with the values to plot the distribution for.</p> required <code>log_scale</code> <code>bool</code> <p>Whether to plot the x-axis in log scale.</p> <code>False</code> <code>bins</code> <code>Optional[Sequence[float]]</code> <p>The bin boundaries to color the area under the KDE curve.</p> <code>None</code>"},{"location":"api/visualization.html#auroris.visualization.visualize_distribution_with_outliers","title":"visualize_distribution_with_outliers","text":"<pre><code>visualize_distribution_with_outliers(values: np.ndarray, is_outlier: Optional[List[bool]] = None, title: str = 'Probability Plot')\n</code></pre> <p>Visualize the distribution of the data and highlight the potential outliers.</p> <p>Parameters:</p> Name Type Description Default <code>values</code> <code>ndarray</code> <p>Values for visulization.</p> required <code>is_outlier</code> <code>Optional[List[bool]]</code> <p>List of outlier flag.</p> <code>None</code> <code>title</code> <code>str</code> <p>Title of plot</p> <code>'Probability Plot'</code>"},{"location":"tutorials/getting_started.html","title":"Getting Started","text":"<p>In short</p> <p>This tutorial gives an overview of the basic concepts in the `auroris` library.</p> <p>On the nuances of curation</p> <p>How to best curate a dataset is highly situation-dependent. The `auroris` library includes some useful tools, but blindly applying them won't necessarily lead to good datasets. To learn more, visit the Polaris Hub for extensive resources and documentation on dataset curation and more.</p> <p>Data curation is concerned with analyzing and processing an existing dataset to maximize its quality. Within drug discovery, this can imply many things, such as filtering out outliers or flagging activity-cliffs. High-quality, well-curated datasets are the foundation upon which we can build realistic, impactful benchmarks for drug discovery. This notebook demonstrates how to curate your dataset with the Polaris data curation API for small molecules.</p> In\u00a0[3]: Copied! <pre>import datamol as dm\n</pre> import datamol as dm In\u00a0[4]: Copied! <pre># Load your data set\n# See more details of the dataset at https://docs.datamol.io/stable/api/datamol.data.html\ndata = dm.data.solubility()\ndata.head(5)\n</pre> # Load your data set # See more details of the dataset at https://docs.datamol.io/stable/api/datamol.data.html data = dm.data.solubility() data.head(5) Out[4]: mol ID NAME SOL SOL_classification smiles split 0 &lt;rdkit.Chem.rdchem.Mol object at 0x173b7c2e0&gt; 1 n-pentane -3.18 (A) low CCCCC train 1 &lt;rdkit.Chem.rdchem.Mol object at 0x173b7c430&gt; 2 cyclopentane -2.64 (B) medium C1CCCC1 train 2 &lt;rdkit.Chem.rdchem.Mol object at 0x173b7c4a0&gt; 3 n-hexane -3.84 (A) low CCCCCC train 3 &lt;rdkit.Chem.rdchem.Mol object at 0x173b7c510&gt; 4 2-methylpentane -3.74 (A) low CCCC(C)C train 4 &lt;rdkit.Chem.rdchem.Mol object at 0x173b7c580&gt; 6 2,2-dimethylbutane -3.55 (A) low CCC(C)(C)C train In\u00a0[5]: Copied! <pre>from auroris.curation import Curator\nfrom auroris.curation.actions import MoleculeCuration, OutlierDetection, Discretization\n\n# Define the curation workflow\ncurator = Curator(\n    steps=[\n        MoleculeCuration(input_column=\"smiles\"),\n        OutlierDetection(method=\"zscore\", columns=[\"SOL\"]),\n        Discretization(input_column=\"SOL\", thresholds=[-3]),\n    ],\n    parallelized_kwargs={\"n_jobs\": -1},\n)\n\n# Run the curation\ndataset, report = curator(data)\n</pre> from auroris.curation import Curator from auroris.curation.actions import MoleculeCuration, OutlierDetection, Discretization  # Define the curation workflow curator = Curator(     steps=[         MoleculeCuration(input_column=\"smiles\"),         OutlierDetection(method=\"zscore\", columns=[\"SOL\"]),         Discretization(input_column=\"SOL\", thresholds=[-3]),     ],     parallelized_kwargs={\"n_jobs\": -1}, )  # Run the curation dataset, report = curator(data) <pre>2024-08-02 12:26:54.316 | INFO     | auroris.curation._curator:transform:106 - Performing step: mol_curation\n2024-08-02 12:27:12.343 | INFO     | auroris.curation._curator:transform:106 - Performing step: outlier_detection\n2024-08-02 12:27:12.400 | INFO     | auroris.curation._curator:transform:106 - Performing step: discretize\n</pre> <p>The report can be exported (\"broadcaster\") to a variety of different formats. Let's simply log it to the CLI for now.</p> In\u00a0[6]: Copied! <pre>from auroris.report.broadcaster import LoggerBroadcaster\n\nbroadcaster = LoggerBroadcaster(report)\nbroadcaster.broadcast()\n</pre> from auroris.report.broadcaster import LoggerBroadcaster  broadcaster = LoggerBroadcaster(report) broadcaster.broadcast() <pre>===== Curation Report =====\nTime: 2024-08-02 12:26:54\nVersion: 0.1.4.dev0+g7127343.d20240707\n===== mol_curation =====\n[LOG]: Couldn't preprocess 18 / 1282 molecules.\n[LOG]: New column added: MOL_smiles\n[LOG]: New column added: MOL_molhash_id\n[LOG]: New column added: MOL_molhash_id_no_stereo\n[LOG]: New column added: MOL_num_stereoisomers\n[LOG]: New column added: MOL_num_undefined_stereoisomers\n[LOG]: New column added: MOL_num_defined_stereo_center\n[LOG]: New column added: MOL_num_undefined_stereo_center\n[LOG]: New column added: MOL_num_stereo_center\n[LOG]: New column added: MOL_undefined_E_D\n[LOG]: New column added: MOL_undefined_E/Z\n[LOG]: Default `ecfp` fingerprint is used to visualize the chemical space.\n[LOG]: Molecules with undefined stereocenter detected: 253.\n[IMG]: Dimensions 1200 x 600\n[IMG]: Dimensions 1200 x 2400\n===== outlier_detection =====\n[LOG]: New column added: OUTLIER_SOL\n[LOG]: Found 7 potential outliers with respect to the SOL column for review.\n[IMG]: Dimensions 1200 x 600\n===== discretize =====\n[LOG]: New column added: CLS_SOL\n[IMG]: Dimensions 1200 x 600\n===== Curation Report END =====\n</pre> <p>We can see that there is also images in the report! More advanced broadcasters will display these, such as the <code>HTMLBroadcaster</code>.</p> In\u00a0[7]: Copied! <pre>from auroris.report.broadcaster import HTMLBroadcaster\nimport tempfile\n\ntemp_dir = tempfile.TemporaryDirectory().name\n\nbroadcaster = HTMLBroadcaster(report=report, destination=temp_dir, embed_images=True)\nbroadcaster.broadcast()\n</pre> from auroris.report.broadcaster import HTMLBroadcaster import tempfile  temp_dir = tempfile.TemporaryDirectory().name  broadcaster = HTMLBroadcaster(report=report, destination=temp_dir, embed_images=True) broadcaster.broadcast() Out[7]: <pre>'/var/folders/_7/ffxc1f251dbb5msn977xl4sm0000gr/T/tmps2tt3jrb/index.html'</pre> <p>One can review the above HTML report with embedded visualizations and share it with collaborators.</p> <p>Let's also look at a single row of the new curated dataset!</p> In\u00a0[8]: Copied! <pre>dataset.iloc[0]\n</pre> dataset.iloc[0] Out[8]: <pre>mol                                &lt;rdkit.Chem.rdchem.Mol object at 0x173b7c2e0&gt;\nID                                                                             1\nNAME                                                                   n-pentane\nSOL                                                                        -3.18\nSOL_classification                                                       (A) low\nsmiles                                                                     CCCCC\nsplit                                                                      train\nMOL_smiles                                                                 CCCCC\nMOL_molhash_id                          3cb2e0cf1b50d8f954891abc5dcce90d543cd3d7\nMOL_molhash_id_no_stereo                36551d628217a351e720cdbe676fca3067730a91\nMOL_num_stereoisomers                                                        1.0\nMOL_num_undefined_stereoisomers                                              1.0\nMOL_num_defined_stereo_center                                                0.0\nMOL_num_undefined_stereo_center                                              0.0\nMOL_num_stereo_center                                                        0.0\nMOL_undefined_E_D                                                          False\nMOL_undefined_E/Z                                                              0\nOUTLIER_SOL                                                                False\nCLS_SOL                                                                      0.0\nName: 0, dtype: object</pre> In\u00a0[9]: Copied! <pre>from auroris.curation.functional import detect_outliers\nfrom auroris.visualization import visualize_distribution_with_outliers\n\ny = dataset[\"SOL\"].values\nis_outlier = detect_outliers(y, method=\"zscore\")\nvisualize_distribution_with_outliers(y, is_outlier);\n</pre> from auroris.curation.functional import detect_outliers from auroris.visualization import visualize_distribution_with_outliers  y = dataset[\"SOL\"].values is_outlier = detect_outliers(y, method=\"zscore\") visualize_distribution_with_outliers(y, is_outlier); <p>Depending on the type of bioactivity and its distribution, the above plot helps to highlight data points that are potential outliers (data outside the acceptable range) or strong signals.</p> <p>Reviewing these data points, and removing them if they are truely outliers, can be beneficial for QSAR modeling.</p> <p>The End.</p>"},{"location":"tutorials/getting_started.html#curating-a-toy-dataset","title":"Curating a toy dataset\u00b6","text":"<p>Let's learn about the basic concepts of the <code>auroris</code> library by curating a toy dataset. For the sake of simplicity, we will use the solubility dataset from Datamol. It is worth noting that this dataset is only meant to be used as a toy dataset for pedagogic and testing purposes. It is not a dataset for benchmarking, analysis or model training. Curation can only take us so far. For impactful benchmarks, we rely on high-quality data sources to begin with.</p>"},{"location":"tutorials/getting_started.html#using-the-curator-api","title":"Using the <code>Curator</code> API\u00b6","text":"<p>The recommended way to specify curation workflows is through the <code>Curator</code> API:</p> <ul> <li>A <code>Curator</code> object defines a number of curation steps.</li> <li>Each step should inherit from <code>auroris.curation.actions.BaseAction</code>.</li> <li>The <code>Curator</code> object is serializable. You can thus easily save and load it from JSON, which makes it easy to reproduce a curation workflow.</li> <li>Finally, the <code>Curator</code> produces a <code>CurationReport</code> which summarizes the changes made to a dataset.</li> </ul> <p>Let's define a simple workflow with three steps:</p> <ol> <li>Curate the chemical structures</li> <li>Detect outliers</li> <li>Bin the regression column</li> </ol>"},{"location":"tutorials/getting_started.html#using-the-functional-api","title":"Using the functional API\u00b6","text":"<p><code>auroris</code> provides a functional API to easily and quickly run some curation steps. Let's look at an oulier detection example.</p>"}]}
\ No newline at end of file