README.Rmd

---
output: 
  github_document: 
    toc: true
    toc_depth: 5
editor_options: 
  markdown: 
    wrap: 72
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%", 
  dpi = 300
)
```

# cytofin

CytofIn (**CyTOF** **In**tegration) is an R package for homogenizing and
normalizing heterogeneous [mass cytometry
(CyTOF)](https://pubmed.ncbi.nlm.nih.gov/21551058/) data from diverse
data sources. Specifically, `CytofIn` provides functions that perform the
following tasks:

-   **Dataset homogenization** - CyTOF datasets that were collected
    separately may differ in which markers were included in their
    antibody panels; in addition, they may use different naming
    conventions for their panels' shared markers. Thus, data mining
    across multiple CyTOF datasets requires **homogenization,** the
    process of aligning each dataset's antibody panels so that they can
    be analyzed together. In `CytofIn`, data homogenization (i.e. panel
    alignment) is performed with the `cytofin_homogenize` function that
    leverages user-provided panel information to combine datasets.
-   **Dataset normalization** - Combined analysis of multiple CyTOF
    datasets is likely to be confounded by dataset-to-dataset batch
    effects due to differences in instrumentation and experimental
    protocols between groups. To normalize multiple CyTOF datasets with
    respect to these batch effects, `CytofIn` provides 3 functions:
    `cytofin_prep_anchors`, `cytofin_normalize`, and
    `cytofin_normalize_nrs`.
-   **Visualization** - After batch normalization, the means and
    standard deviations for each of the input .fcs files (as well as
    their associated anchors) can be visualized using the
    `cytofin_make_plots` function.

The general CytofIn workflow unfolds in 3 steps. First, users align the
panels of the CyTOF datasets being integrated using
`cytofin_homogenize()`. Second, users generate reference statistics from
"generalized anchors" identified on each CyTOF plate (see below) using
`cytofin_prep_anchors()`. Finally, users can then normalize/batch
correct the datasets relative to one another using their choice of
`cytofin_normalize()` or `cytofin_normalize_nrs()`, each of which
performs the normalization procedure differently (see below).

## Installation

To install CytofIn, run the following code:

```{r, eval = FALSE}
library(devtools)
install_github("bennyyclo/Cytofin")
```

To attach the CytofIn package to your current R session, run the
following line:

```{r}
library(cytofin)
```

## Data for this vignette

### Establishing a root directory

For the sake of this vignette, we will work within a single folder,
where we will store the input data, the output data, and all
intermediate files from the CytofIn pipeline. We will default to using
the current working directory, but feel free to modify the following
line of code to change which path you want to use.

```{r}
# change this path to wherever you want this vignette to find and store
# its input and output files
base_path <- getwd()
```

```{r, include = FALSE}
base_path <- file.path("~", "Desktop", "cytofin_tests")
```

### Downloading the data

Now that we've identified the root directory we'll use for this
vignette, we will create two folders in which we will store the raw
input data and the validation (bead-normalized) data used in this
vignette:

```{r}
dir.create(file.path(base_path, "raw_data"), showWarnings = FALSE)
dir.create(file.path(base_path, "validation_data"), showWarnings = FALSE)
```

To fill each of these folders with the .fcs files we're analyzing in
this vignette, please download the raw input files
[here](https://flowrepository.org/id/FR-FCM-Z427) and the validation
files [here](https://flowrepository.org/id/FR-FCM-Z42C) on
[FlowRepository](https://flowrepository.org/). Once the files are
downloaded, unzip them. Finally, move all of the unzipped .fcs files from each repository into the `raw_data` and
`validation_data` folders that we just created, respectively.

## Usage

### CyTOF data homogenization (cytofin_homogenize)

Here, the term "homogenization" refers to the process of aligning the
antigen panels of multiple CyTOF experiments by (1) removing all
channels that are not shared across all cohorts and (2) standardizing
the antigen names used to refer to each channel so that existing
analysis tools (like the `flowCore` and `tidyverse` packages) can be
applied in later analytical steps. In CytofIn, dataset homogenization is
performed using the `cytofin_homogenize()` function.

The `cytofin_homogenize()` function takes several arguments. The first
of these is `metadata_path`, a string that specifies the file path to a
.csv or .xlsx metadata file containing information about each of the
.fcs files being analyzed. Specifically, the metadata file will have one
row for each .fcs file being analyzed and must contain the following
columns (all of which will be converted to character vectors):

-   **filename -** Required. The name of the .fcs file within its local
    directory.
-   **cohort -** Required. The name of the cohort (i.e. experimental
    source) of each .fcs file.
-   **plate_number -** Required. The name of the CyTOF plate (e.g.
    "plate1", "plate2", etc.) on which the sample corresponding to each
    .fcs file was analyzed during data acquisition.
-   **patient_id -** Optional. The name of the patient to whom each .fcs
    file corresponds.
-   **condition -** Optional. The stimulation condition corresponding to
    each .fcs file (i.e. "basal", "IL-3", etc.).
-   **is_anchor -** Required. A numeric column indicating whether or not
    each sample should be used as an "anchor" for the batch correction
    procedure (1 if yes; 0 if no). Exactly one anchor should be
    identified for each CyTOF plate being analyzed.
-   **validation -** Optional. The name of the
    [bead-normalized](https://pubmed.ncbi.nlm.nih.gov/23512433/) .fcs
    file corresponding to each input file listed in the `filename`
    column (per gold-standard batch normalization procedure in CyTOF
    batch correction). Most users will ignore this column because
    bead-normalized data will not be available, but it can be used to
    validate the results of the CytofIn batch normalization algorithms
    if bead-normalized data are available.

Importantly, only the fields marked as "required" are needed for
`cytofin_homogenize()` to work; "NA" can be recorded for any/all
optional columns that don't apply to the experimental design of the
files being analyzed (for example, if no stimulation conditions were
used in the studies being integrated, enter "NA" for each element of the
`condition` column). Alternatively, these columns can be omitted from
the metadata table entirely. The following image provides a visual summary of the metadata table used throughout the `CytofIn` pipeline. 

![](./inst/images/image2.png)

For the user's convenience, the `cytofin_generate_metadata_template`
function is provided to generate an example metadata .csv file filled
with dummy example data in a location specified by the user:

```{r, eval = FALSE}
# specify the path where you'd like to store the template file
my_path <- file.path(base_path, "template_folder")

# generate the template file, which then can be edited manually 
cytofin_generate_metadata_template(template_path = my_path)
```

The second argument for `cytofin_homogenize` is `panel_path`, a string
that specifies the file path to a .csv or .xlsx file containing
information about the panel(s) of each of the .fcs files being analyzed.
Each row represents a channel (i.e. a protein measurement) to be
included in the final, homogenized panel. This file must contain the
following columns:

-   **metal_name -** A character vector representing the name of the
    metal isotope measured by each channel.
-   **antigen_name -** A character vector representing the name of the
    antigen associated with a given metal isotope in the consensus panel
    (the final antigen name to assign to a given channel during
    homogenization).
-   **antigen_pattern -** A regular expression used to match antigen
    names that may differ slightly across different .fcs files. For
    example, the regular expression "(C\|c)(D\|d)45" will detect all of
    the following channel names: "cd45", "CD45", "Cd45", or "cD45".
-   **lineage -** A numeric vector representing whether or not a marker
    is a lineage marker (1 if yes; 0 otherwise).
-   **functional -** A numeric vector representing whether or not a
    marker is a functional marker (1 if yes; 0 otherwise).
-   **general -** A numeric vector representing whether or not a marker
    is a "general" (i.e. neither a lineage nor a functional) marker (1
    if yes; 0 otherwise).

The layout of this antigen table (and how it's used during .fcs file homogenization) is displayed in the picture below.

![](./inst/images/image1.png)

As in `cytofin_generate_metadata_template`, the `cytofin_generate_panel_template` function is provided to
generate an example metadata .csv file filled with dummy example data:

```{r, eval = FALSE}
# generate the template file, which then can be edited manually 
cytofin_generate_panel_template(template_path = my_path)
```

For many users, the most difficult part of filling out the consensus
panel information table will be designing the regular expressions for
the `antigen_pattern` column. However, in most cases the required
regular expressions will be quite simple; for a primer on regular
expressions (and their use in the
[`stringr`](https://stringr.tidyverse.org/) package) written by
[RStudio](https://www.rstudio.com/about/), install the `stringr` package
and read the following vignette:

```{r, eval = FALSE}
vignette(topic = "regular-expressions", package = "stringr")
```

The next two arguments for `cytofin_homogenize` are `input_data_path`
and `output_data_path`, two strings that indicate which directory input
.fcs files should be read from and which directory homogenized .fcs
files should be written to, respectively. Lastly, the final two
arguments are optional: `prefix` allows the user to specify the prefix
appended to each input .fcs file name to get the name of the
corresponding output (i.e. homogenized) .fcs file name, and `verbose` is
a boolean value (default = FALSE) specifying if chatty print statements
should be made while the homogenization is performed.

Using these arguments, `cytofin_homogenize` can homogenize a set of
CyTOF files with distinct antigen naming conventions. Specifically, the
program performs a regular expression search to match the synonymous
term in the panel and correct the antigen name with standardized names
in the panel.

Example function call:

```{r, warning = FALSE}
# define input paths 
metadata_path <- 
  system.file(
    file.path("extdata", "test_metadata_raw.csv"), 
    package = "cytofin"
  )

panel_path <- 
  system.file(
    file.path("extdata", "test_panel.csv"), 
    package = "cytofin"
  )

input_data_path <- 
  file.path(base_path, "raw_data")

validation_data_path <- 
  file.path(base_path, "validation_data")

# define output path
# --Change this line to wherever you want the output files saved!--
output_data_path <- file.path(base_path, "homogenization_output")

# call homogenization function
cytofin_homogenize(
  metadata_path = metadata_path, 
  panel_path = panel_path, 
  input_data_path = input_data_path, 
  output_data_path = output_data_path
)
```

This function call will save homogenized .fcs files to the directory
located at `output_data_path`. These files will be different from the
input .fcs files in the `input_data_path` directory in that they will
only contain channels whose antigen names match the `antigen_pattern`
column of the reference panel located at `panel_path`. All other
channels will be removed, and the names of the channels with matches in
`antigen_pattern` will be standardized to the names given in the
`antigen_name` column of the reference panel.

The input files for this homogenization run were as follows:

```{r}
list.files(input_data_path, pattern = ".fcs$")
```

...and the corresponding output file saved in the `output_data_path`
directory are now as follows:

```{r}
list.files(output_data_path, pattern = ".fcs$")
```

### CyTOF batch normalization

After dataset homogenization, **batch correction** (or **batch
normalization**) can be performed across datasets.

In short, `CytofIn` performs batch normalization though the use of
user-identified **generalized anchors** - which are non-identical references assumed to have low variability across batches - that can be used to estimate batch effects from samples collated from heterogeneous sources. To batch normalize using healthy control samples (one per plate) as generalized anchors (which
is ideal when such samples are available), use `cytofin_normalize`. To
batch normalize using the antigen channels with the lowest variability across samples as generalized anchors (which is ideal when healthy samples are unavailable on all plates being analyzed), use `cytofin_normalize_nrs`.

The use of both of these functions is detailed below.

#### Batch normalization using external anchors (cytofin_normalize)

##### Overview 

The `cytofin_normalize` uses user-identified external anchors on each
CyTOF plate being integrated to correct batch effects on a
plate-to-plate basis. One sample on each CyTOF barcoding plate should be
chosen as that plate's external anchor. In general, external anchors
should be chosen based on which samples are the most biologically
similar to one another from plate to plate. For example, if healthy,
non-stimulated samples are included on each CyTOF plate being
integrated, the only expected variability between these samples other
than batch effects would be person-to-person variability. Thus, these
samples are likely to be biologically similar to one another and are
suitable to be chosen as external anchors. Alternatively, if a single
patient or cell line was included on every CyTOF plate being integrated,
the samples corresponding to that patient or cell line on each plate are
would also be suitable as external anchor choices.

Once users have identified 1 external anchor per plate for `CytofIn`
data integration, users must mark its row in the metadata table with a
"1" in the `is_anchor` column (all other samples should be marked with
"0"). `CytofIn` then uses these anchors to define a **universal mean**
and **universal variance** that represent the central tendency and
dispersion, respectively, of the target distribution to which all
samples will be batch corrected. This correction will be performed with
the user's choice from one of five batch correction functions.

In short, `CytofIn`'s batch normalization procedure using external
anchors has two steps:

1.  Preparation of external anchors  
2.  Application of a transformation function that performs the batch
    correction (of which `CytofIn` provides 5 options)

We detail function calls for each of these steps below.

##### Step 1 - Anchor preparation

The `cytofin_prep_anchors` function concatenates the identified anchor
files and then calculates summary statistics that are used for batch
correction in later steps of the pipeline. First, `CytofIn` calculates
the mean and standard deviation of each channel in the homogenized
dataset across all cells from samples identified as external anchors.
These values represent the overall central tendency and dispersion,
respectively, of each channel among the anchor samples on each CyTOF
plate; thus, we call them the **universal means** and **universal
variances** of the `CytofIn` integration. Accordingly, the universal
mean and universal variance vectors will each have *g* elements, where
*g* is the number of channels in the consensus antigen panel in the
panel information table. The universal mean and universal variance
vectors are used in the `meanshift`, `variance`, `z-score`, and
`beadlike` methods of batch correction (see below).

In addition, the mean of all of the elements of the universal mean
vector (i.e. the mean of all channel means) and the mean of all of the
elements of the universal variance vector (i.e. the mean of all channel
variances) are calculated. These values represent the central tendency
and dispersion of antigen measurements in general among the healthy
control samples on each CyTOF plate and are thus no longer
channel-specific. Thus, we call them the *bulk mean* and *bulk
variance*, and they are used in the `meanshift_bulk` batch correction
method implemented in `cytofin_homogenize`.

To calculate these values, we use the `cytofin_prep_anchors` function.
`cytofin_prep_anchors` returns the universal mean vector, universal
variance vector, bulk mean, and bulk variance as a `list()`. In
addition, users are given an option to save these statistics as an .rds
file in a specified directory in order to avoid performing redundant
calculations in future analyses.

Specifically, `cytofin_prep_anchors` takes 4 required arguments:

-   `metadata_path`: A connection leading to an .xlsx or .csv file
    containing a metadata table with information about each file to be
    analyzed. This file should be identical to that used for
    `cytofin_homogenize`.
-   `panel_path`: A connection leading to an .xlsx or .csv file
    containing information about the standardized antigen panel in the
    homogenized dataset. This file should be identical to that used for
    `cytofin_homogenize`.
-   `input_data_path`: A connection to a directory containing the input
    .FCS files from which to draw summary statistics
-   `output_path`: A connection to a directory where the output .rds and
    .FCS files will be saved. The default is "none", in which case no
    output files will be stored (and the only effect of the function
    will be to return the calculated statistics as a `list()`).

In addition, `cytofin_prep_anchors` also takes 2 optional arguments
relating to the conventional arcsinh transformation performed on the raw
ion counts of the input data. These optional arguments are as follows:

-   `shift_factor`: The scalar value `a` in the following equation used
    to transform CyTOF raw data ion counts using the hyperbolic arcsinh
    function: `new_x <- asinh(a + b * x)`. Defaults to 0.

-   `scale_factor`: The scalar value `b` in the following equation used
    to transform CyTOF raw data ion counts using the hyperbolic arcsinh
    function: `new_x <- asinh(a + b * x)`. Defaults to 0.2.

Finally, here is an example functional call of `cytofin_prep_anchors`:

```{r}
input_data_path <- file.path(base_path, "homogenization_output")
output_path <- file.path(base_path, "anchor_prep_output")

anchor_statistics <- 
  cytofin_prep_anchors(
    metadata_path = metadata_path, 
    panel_path = panel_path, 
    input_data_path = input_data_path, 
    output_path = output_path
  )

print(anchor_statistics)
```

As shown above, the returned value is a list with 4 items in it: the
universal variance vector (`universal_var`), the universal mean vector
(`universal_mean`), the bulk variance (`bulk_var`) and the bulk mean
(`bulk_mean`). Note that the elements of `universal_var` and
`universal_mean` are named with their corresponding metal names (not
antigen names), as this interfaces a bit more conveniently with the
`flowCore` functions that `CytofIn` uses under-the-hood.

Importantly, you only need to use `cytofin_prep_anchors` if you plan to
batch normalize your .fcs files using external anchors identified on
each plate (using `cytofin_normalize`). If you plan to batch normalize
your .fcs files using non-redundancy scores from each sample's most
stable channels (using `cytofin_normalize_nrs`), you do not need to run
`cytofin_prep_anchors` first.

##### Step 2 - Batch normalization

After the anchors' summary statistics are computed, batch correction
using external anchors can be performed using either
`cytofin_normalize`. This function can perform batch correction using 5
different normalizations functions (which we call "modes"). Specifically, the options are called the "meanshift", "meanshift_bulk", "variance", "z-score", and "beadlike" normalization functions. Which of
these is most applicable to a given analysis will differ from user to
user. We recommended that users try using both and then manually
inspect/visualize the batch-corrected data in order to determine which
method they prefer.

To perform batch normalization using external anchors identified on each
plate, use `cytofin_normalize`. This batch normalization strategy
assumes that the anchors on each plate are relatively similar to one
another, and it uses this similarity to adjust the marker expression
measurements on each plate based on how much each plate's anchor differs
from the other anchors. The `cytofin_normalize` function takes several
required arguments:

-   `metadata_path`: A connection leading to an .xlsx or .csv file
    containing a metadata table with information about each file to be
    analyzed. This file should be identical to that used for
    `cytofin_homogenize`.
-   `panel_path`: A connection leading to an .xlsx or .csv file
    containing information about the standardized antigen panel in the
    homogenized dataset. This file should be identical to that used for
    `cytofin_homogenize`.
-   `anchor_statistics`: Either a list of numeric values produced by the
    `cytofin_prep_anchors` function or a connection leading to an .rds
    object containing anchor statistics.
-   `input_data_path`: A connection to a directory containing the input
    .fcs files to be batch normalized. In most cases, this will be the
    directory to which the output .FCS files from `cytofin_homogenize`
    were written.
-   `output_data_path`: A connection to a directory where the output
    (i.e. batch normalized) .FCS files will be written.
-   `mode`: A string indicating which transformation function should be
    used for batch normalization ("meanshift", "meanshift_bulk",
    "variance", "z-score", or "beadlike").

In addition to these required arguments, `cytofin_normalize` takes
several optional arguments:

-   `input_prefix`: The string that was appended to the name of the raw
    input .fcs files of `cytofin_homogenize` to create their
    corresponding output file names. Defaults to "homogenized\_".

-   `output_prefix`: The string to be appended to the name of each input
    .fcs file to create the name of the corresponding output file
    (post-homogenization). Defaults to "normalized\_".

-   `shift_factor` and `scale_factor`: The scalar values *a* and *b*,
    respectively, to be used in the hyperbolic arc-sine function used to
    transform CyTOF ion counts according to the following equation:
    `new_x <- asinh(a + b * x)`. `shift_factor` defaults to 0 and
    `scale_factor` defaults to 0.2, which are customary values used by
    most scientists in the CyTOF community.

Using these arguments, a call to `cytofin_normalize` will perform the
batch correction and save the output (i.e. batch normalized) .fcs files
to the directory specified by `output_data_path`. An example function
call is given here:

```{r}
output_data_path <- 
  file.path(base_path, "normalization_results")

norm_result <- 
  cytofin_normalize(
    metadata_path = metadata_path, 
    panel_path = panel_path, 
    anchor_statistics = anchor_statistics, 
    input_data_path = input_data_path, 
    output_data_path = output_data_path, 
    mode = "meanshift"
  )
```

When this function is called, it has two effects. The first is to save
the batch-normalized output .fcs files to the `output_data_path`
directory. The second is to return a data.frame that stores mean and
variance information about each input file (as well as its associated
anchor) both before and after normalization. This data.frame can be
passed directly into the `cytofin_make_plots` function to return 8
diagnostic plots per sample illustrating the quality of the
normalization:

```{r}
# we make only the plot for the first input .fcs file
# for illustrative purposes
cytofin_make_plots(
  normalization_result = norm_result,
  which_rows = 1,
  val_path = "none"
)
```

#### Batch normalization using internal anchors (cytofin_normalize_nrs)


In the event that external anchors are not available, `CytofIn` can use
"internal anchors" within each sample for batch normalization.
Specifically, instead of defining a single external anchor for all the
samples on a given plate like `cytofin_normalize`, the
`cytofin_normalize_nrs` function identifies the most stable channels in
the dataset overall and uses them as internal anchors that are used to
batch normalize all other channels from sample-to-sample. A schematic diagram of how `cytofin_normalize_nrs` works is provided below: 

![](./inst/images/image3.png)

In words, to identify
the most stable channels in the combined dataset, `CytofIn` uses a
PCA-based non-redundancy score (NRS) as described before (see
[here](https://pubmed.ncbi.nlm.nih.gov/26095251/)). A minimum of 3
channels should be selected to establish an internal reference from
which signals can be calibrated between CyTOF files.

To do this, `cytofin_normalize_nrs` takes several of the same arguments as
`cytofin_normalize`, defined as above: `metadata_path`, `panel_path`,
`input_data_path`, `output_data_path`, `input_prefix`, `output_prefix`,
`shift_factor`, and `scale_factor`. In addition, it takes the following
optional arguments:

-   `nchannels`: An integer representing the number of "most stable"
    (i.e. with the lowest non-redundancy scores) channels that should be
    used for batch normalization. Defaults to 3.

-   `make_plot`: A boolean value representing if, in addition to its
    other effects, `cytofin_normalize_nrs` should return a plot
    illustrating the distribution of non-redundancy scores for each
    channel among all .fcs files being batch normalized. Defaults to
    FALSE.

These arguments can be used in a function call as follows:

```{r}
# path to save the normalized .fcs files
output_data_path <- 
  file.path(base_path, "normalization_nrs_results")

# call function
norm_result_nrs <- 
  cytofin_normalize_nrs(
    metadata_path = metadata_path, 
    panel_path = panel_path, 
    input_data_path = input_data_path, 
    output_data_path = output_data_path, 
    nchannels = 3, 
    make_plot = FALSE
  )
```

Just like `cytofin_normalize` above, `cytofin_normalize_nrs` has several
effects. First, it writes batch-normalized .fcs files to
`output_data_path` and makes a plot depicting sample-wise and
channel-wise non-redundancy scores according to the value of
`make_plot`. In addition, it returns a data.frame that can be passed
into `cytofin_make_plots` to make diagnostic plots regarding the batch
normalization procedure:

```{r}
# show only 1 set of plots for illustrative purposes
cytofin_make_plots(
  normalization_result = norm_result_nrs, 
  which_rows = 7, 
  val_path = validation_data_path
)
```

## Additional Information

For questions about the `cytofin` R package, please email
[kardavis\@stanford.edu](mailto:kardavis@stanford.edu) or open a GitHub
issue [here](https://github.com/bennyyclo/Cytofin).

```{r}
# session information for rendering this README file
sessionInfo()
```