methods.qmd

---
title: "Methods"
author: [Shixiang Wang, Yankun Zhao]
date: 2024-12-04
draft: false
toc: true
toc-depth: 3
toc-expand: 2

format:
  html:
    code-fold: true
    code-overflow: wrap
    code-tools: false
    code-link: true
---


## Gene Fusions identification and differential expression analysis

We employed two Gene Fusions(GFs) detection tools, including arriba, STAR-Fusion, to identify, parse, and annotate GFs junctions within each sample, using the human genome version hg38 as the reference. In our pursuit of biologically robust GFs identification within each cohort, we undertook an approach metafusion that amalgamated results from two GFs detection tools. Firstly, we exclusively retained Fusions located on chromosomes 1 to 22, as well as the X and Y chromosomes. Secondly, GFs were mandated to exhibit genomic overlap with gene regions as defined in the reference genome annotation file, specifically "gencode.v34.annotation.gtf.". Thirdly to ensure the precision and recall in GFs prediction with different tools, one can either rely on consensus predictions or all predctions and choosing junction-crossing reads and reads mates which flank (span) the breakpoint. Fourthly, we offer FusionAnnotator to identify and prioritize GFs that have been previously reported in cancer or normal samples. Lastly, the detected data can be divided into Cis-SAGes and other categories to further explore the debate on what should be considered true fusion RNAs. This meticulous methodological approach was instrumental in ensuring the acquisition of biologically sound and reliable GFs datasets, thereby fortifying the integrity of our analysis within each cohort. This analysis facilitated comparisons between patient groups, such as those who responded to checkpoint immunotherapy versus non-responders, as well as between samples collected before and during ICB treatment.

## Response and Response2

We gathered and processed patient 'Response' data from various cohorts. However, due to inconsistencies among these cohorts, such as the usage of different terms like CR, PR, SD, PD, R, NR, DCB, and NDB, we harmonized the response variable based on the timing of treatment and treatment response. As a result, we introduced a new variable called 'Response2' with four distinct categories: Pre-R, Pre-NR, On-R, and On-NR. In cases where response data was not available, we assigned the category NE.

## Cancer Type Abbreviation


| Abbr. | Description                                                   |
| :----------------- | :----------------------------------------------------------- |
| BLCA               | Bladder urothelial carcinoma                               |
| SKCM               | Skin cutaneous melanoma                                    |
| KIRC & CCRCC       | Kidney renal clear cell carcinoma                            |
| HNSC & HNSCC       | Head and neck squamous cell carcinoma                       |
| NSCLC              | Non-small cell lung cancer, including lung adenocarcinoma and lung squamous cell carcinoma |
| SCLC | Small cell lung cancer |
| SARC | Sarcoma |
| BRCA | Breast cancer |
| GBM | Glioblastoma Multiforme |
| STAD | Stomach Adenocarcinoma |
| SGC | Salivary gland cancer |
| LUSC | Lung Squamous Cell Carcinoma |
| MESO | Mesothelioma |
| PCPG | Pheochromocytoma and Paraganglioma |
| LIHC | Liver Hepatocellular Carcinoma |
| PRAD | Prostate Adenocarcinoma |
| ACC | Adrenocortical Carcinoma |
| TGCT | Testicular Germ Cell Tumors |
| PAAD | Pancreatic Adenocarcinoma |
| UCEC | Uterine Corpus Endometrial Carcinoma |
| THCA | Thyroid carcinoma |
| CESC | Cervical squamous cell carcinoma and endocervical adenocarcinoma |
| ESCA | Esophageal carcinoma |
| READ | Rectum adenocarcinoma |
| OV | Ovarian Serous Cystadenocarcinoma |
| CHOL | Cholangiocarcinoma |
| UCS | Uterine Carcinosarcoma |
| LUAD | Lung Adenocarcinoma |
| LGG | Brain Lower Grade Glioma |
| HGSC | High-Grade Serous Carcinoma |
| LSCC | Lung Squamous Cell Carcinoma |
| PDAC | Pancreatic Ductal Adenocarcinoma |
| THYM | Thymoma |
| KICH | Kidney Chromophobe |
| KIRP | Kidney Renal Papillary Cell Carcinoma |
| COAD | Colon Adenocarcinoma |
| LAML & AML | Acute Myeloid Leukemia |
| DLBC | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma |
| ALL | Acute lymphocytic leukemia |
| AML-IF | Acute Myeloid Leukemia, Induction Failure Subproject |
| NBL | Neuroblastoma |
| OS | Osteosarcoma |
| ES | Ewing sarcoma |
| WT | Wilms tumor |
| CCSK | Clear Cell Sarcoma of the Kidney |
| RT | Kidney, Rhabdoid Tumor |
| CCSK | CNS, ependymoma |
| MRT | CNS, rhabdoid tumor |
| CCSK | CNS, medulloblastoma |
| CCSK | NHL, anaplastic large cell lymphoma |
| CCSK | NHL, Burkitt lymphoma (BL) |
| CCSK | Rhabdomyosarcoma |
| CCSK | Soft tissue sarcoma, non-rhabdomyosarcoma |
| CCSK | CNS, other |
| B-ALL | Precursor B-cell lymphoblastic leukemia |
| GNB | Ganglioneuroblastoma |


## Clinical data preprocessing

### TCGA

```r
library(data.table)
library(dplyr)

# load("/home/data2/Projects/Fusion/unclean/TCGA_ALL_clinical.rdata")
#
# TCGA_ALL_clinical |>
#   rename(Patient_ID = bcr_patient_barcode,
#          Age = age_at_initial_pathologic_diagnosis,
#          Sex = gender,
#          Race = race,
#          )

# [Discrepancy] [Not Applicable]  [Not Available]
# [Discrepancy] [Not Available]       [Unknown]
# # [Discrepancy]                                        [Not Applicable]
# 29                                                     105
# [Not Available]                                         [Not Evaluated]
# 5140                                                       4
# [Unknown]


library(UCSCXenaShiny)

table(tcga_clinical_fine$Stage_ajcc)
tcga_clinical

tcga_clinical_fine
tcga_genome_instability
tcga_purity
tcga_subtypes
tcga_surv

nrow(tcga_clinical_fine)
nrow(unique(tcga_clinical_fine))

tcga_clinical = unique(tcga_clinical_fine) |>
  left_join(tcga_surv, by = c("Sample"="sample")) |>
  left_join(tcga_purity, by = c("Sample"="sample"))

tcga_clinical[551, ]$Sample

sum(startsWith(tcga_genome_instability$sample, "TCGA"))

tcga_clinical = tcga_clinical |>
  left_join(tcga_genome_instability[startsWith(tcga_genome_instability$sample, "TCGA") & !duplicated(tcga_genome_instability$sample), ],  by = c("Sample"="sample")) |>
  #left_join(tcga_genome_instability[duplicated(tcga_genome_instability$sample), ],  by = c("Sample"="sample")) |>
  left_join(tcga_subtypes,  by = c("Sample"="sampleID"))


sum(is.na(tcga_clinical$Sample))
sum(duplicated(tcga_clinical$Sample))

tcga_clinical[duplicated(tcga_clinical$Sample), ]$Sample

tcga_clinical.final = tcga_clinical[!duplicated(tcga_clinical$Sample), ] |>
  rename(Sample_ID = Sample, Sex = Gender, OS_Time = OS.time, OS_Status = OS, DSS_Time = DSS.time, DSS_Status = DSS, DFS_Time = DFI.time, DFS_Status = DFI, PFS_Time = PFI.time, PFS_Status = PFI) |>
  select(-cancer_type) |>
  mutate(Sex = ifelse(Sex == "FEMALE", "F", "M"),
         Patient_ID = substr(Sample_ID, 1, 12))

sum(is.na(tcga_clinical.final$Sample_ID))
sum(is.na(tcga_clinical.final$Cancer))
sum(duplicated(tcga_clinical.final$Sample_ID))

tcga_clinical.final = tcga_clinical.final |>
  select(Patient_ID, Sample_ID, everything()) |>
  as.data.table()

tcga_clinical.final

fwrite(tcga_clinical.final, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TCGA_info.tsv", sep = "\t")
```

### CPTAC

```r
library(data.table)


# Clinical data from CPTAC paper ------------------------------------------


cli_list = fs::dir_ls("CPTAC_Clinical_meta_data_v1/")

cli = purrr::map(cli_list, fread) |> rbindlist(fill = TRUE, use.names = TRUE)
colnames(cli)
# 问题，这个第 2 行要跳过才行
# 另外这个同时包含 CPTAC-2/3 的样本，但总数少一些

cli2 = purrr::map(cli_list, function(x) {
  # 由于列名也不同，这里读两次，一次读 colname，一次读数据
  data1 = fread(x)
  data2 = fread(x, skip = 2, header = FALSE)
  # 更新列名
  colnames(data2) = colnames(data1)
  data2$Cancer = stringr::str_remove(basename(x), "_meta.txt")
  data2 |> as.data.table()
}) |> rbindlist(fill = TRUE, use.names = TRUE)
#colnames(cli2) = colnames(cli)
colnames(cli2)

cli2[Tumor == "Yes" & Normal == "Yes"] # 这个是标记患者是否同时 tumor / normal sample


cli_list2 = fs::dir_ls("meta_data_BCM/")

cli3 = purrr::map(cli_list2, function(x) {
  # 由于列名也不同，这里读两次，一次读 colname，一次读数据
  data1 = fread(x)
  data2 = fread(x, skip = 2, header = FALSE)
  # 更新列名
  colnames(data2) = colnames(data1)
  data2$Cancer = stringr::str_remove(basename(x), "_meta.txt")
  data2 |> as.data.table()
}) |> rbindlist(fill = TRUE, use.names = TRUE)
colnames(cli3)

all.equal(cli2, cli3)
all.equal(sort(colnames(cli2)), sort(colnames(cli3)))

cli4 = cli3[, colnames(cli2), with = FALSE]
all.equal(cli2, cli4)
identical(cli2, cli4)

dplyr::setdiff(cli2, cli4) |> View()
dplyr::setdiff(cli4, cli2) |> nrow()

# rbind(cli2[idx == "01OV007"],
#       cli4[idx == "01OV007"]) |> View()

# 2 个数据源目录的文件基本没有区别

cli.final = cli2 |>
  dplyr::rename(
    Patient_ID = idx,
    OS_Status = OS_event,
    OS_Time = OS_days,
    PFS_Status = PFS_event,
    PFS_Time = PFS_days
  ) |>
  dplyr::mutate(
    Sample_ID = Patient_ID,
    Sex = ifelse(Sex == "Female", "F", "M")
  ) |>
  dplyr::select(
    Patient_ID, Sample_ID, Cancer, Age, Sex, OS_Time, OS_Status, PFS_Time, PFS_Status, dplyr::everything()
  )
  # dplyr::select(
  #   - dplyr::starts_with("CIBERSORT"), - dplyr::starts_with("xCell")
  # )


fwrite(cli.final, file = "CPTAC_cli_paper_combined.tsv", sep = "\t")

# Clinical data from GDC --------------------------------------------------


system("
cd /home/data2/Projects/Fusion/unclean/gdc_cli/CPTAC
mkdir CPTAC-3 CPTAC-2
tar zxvf clinical.project-cptac-2.2024-08-07.tar.gz -C CPTAC-2
tar zxvf clinical.project-cptac-3.2024-08-07.tar.gz -C CPTAC-3
       ")

cptac2 = fread("gdc_cli/CPTAC/CPTAC-2/clinical.tsv")
cptac3 = fread("gdc_cli/CPTAC/CPTAC-3/clinical.tsv")

which(colnames(cptac2) %in% "residual_disease")
which(colnames(cptac3) %in% "residual_disease")
all.equal(cptac2[[119]], cptac2[[192]])
all.equal(cptac3[[119]], cptac3[[192]])
cptac2[[192]] = NULL
cptac3[[192]] = NULL

setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)
setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)
# > setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)
# [1] "01OV049"   "26OV010"   "100004028" "05BR058"   "05BR055"   "17OV019"   "11BR069"   "02OV042"   "100004012" "05BR051"   "03BR012"   "02OV045"   "04OV041"
# [14] "05BR052"   "05BR031"   "1488"      "11OV009"   "02OV040"   "01OV002"   "100002921" "22OV001"   "13OV004"   "100003304" "02OV035"   "01OV046"   "01OV045"
# [27] "17OV034"
# > setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)
# [1] "C3N-02672"                    "C3L-01929"                    "C3N-02682"                    "C3N-02012"                    "C3N-02070"
# [6] "C3N-02696"                    "C3L-02354"                    "C3N-03026"                    "C3L-01739"                    "C3N-02088"
# [11] "C3L-04213"                    "C3N-03205"                    "C3L-04081"                    "C3N-03791"                    "C3L-03679"
# [16] "C3L-02643"                    "C3N-04686"                    "C3L-01558"                    "C3L-03268"                    "C3L-01633"
# [21] "C3L-02220"                    "C3N-02439"                    "C3N-02028"                    "C3N-03800"                    "C3L-04392"
# [26] "C3L-02556"                    "C3N-02721"                    "C3L-01672"                    "C3L-00938"                    "C3L-02201"
# [31] "C3N-02146"                    "C3N-02639"                    "C3N-03788"                    "C3L-03733"                    "C3L-04037"
# [36] "C3N-01091"                    "C3L-02747"                    "C3N-01879"                    "C3N-02996"                    "C3N-02296"
# [41] "C3L-02553"                    "C3L-03632"                    "C3N-03755"                    "GTEX-NPJ7-0011-R10A-SM-HAKXW" "C3L-02746"
length(setdiff(cptac3$case_submitter_id, cli.final$Sample_ID))  # 586


View(cptac2[case_submitter_id %in% setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)])
View(cptac3[case_submitter_id %in% setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)])

data.table::fwrite(
  cptac3[case_submitter_id %in% setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)],
  file = "CPTAC3_non_included.tsv", sep = "\t"
)

cptac3


# 根据 GDC 已有信息补全 paper 缺少的样本信息 ---------------------------------------------

# 可能标记了 cancer type 的列
# primary_diagnosis site_of_resection_or_biopsy tissue_or_organ_of_origin

cptac = rbind(cptac2, cptac3)[, .(case_submitter_id, primary_diagnosis, site_of_resection_or_biopsy, tissue_or_organ_of_origin)] |> unique()
cptac = dplyr::left_join(cptac, cli.final[, .(Sample_ID, Cancer)], by = c("case_submitter_id"="Sample_ID"))

setdiff(cli.final$Sample_ID, cptac$case_submitter_id)
# 有 29 例不在 GDC 数据中

table(cptac$Cancer, cptac$primary_diagnosis) |> pheatmap::pheatmap()
table(cptac$Cancer, cptac$site_of_resection_or_biopsy) |> pheatmap::pheatmap()
table(cptac$Cancer, cptac$tissue_or_organ_of_origin) |> pheatmap::pheatmap()

table(cptac$Cancer,
      paste(
        cptac$primary_diagnosis,
        cptac$site_of_resection_or_biopsy,
        cptac$tissue_or_organ_of_origin, sep = "_"
      )) |> pheatmap::pheatmap(show_colnames = FALSE)

# 一个个处理
cat(sub("_meta.txt", "", basename(names(cli_list))))

typelist = sub("_meta.txt", "", basename(names(cli_list)))
# BRCA CCRCC COAD GBM HGSC HNSCC LSCC LUAD PDAC UCEC
for (i in typelist) {
  message("check type ", i)
  cptac_part = cptac |> dplyr::filter(Cancer == i)
  message("  primary_diagnosis")
  table(cptac_part$Cancer, cptac_part$primary_diagnosis) |> print()
  message("  site_of_resection_or_biopsy")
  table(cptac_part$Cancer, cptac_part$site_of_resection_or_biopsy) |> print()
  message("  tissue_or_organ_of_origin")
  table(cptac_part$Cancer, cptac_part$tissue_or_organ_of_origin) |> print()
}
# BRCA (122): tissue_or_organ_of_origin "Breast, NOS"
# CCRCC (103): primary_diagnosis "Renal cell carcinoma, NOS"
#        tissue_or_organ_of_origin "Kidney, NOS"
# COAD (106): tissue_or_organ_of_origin "Colon, NOS", "Rectum, NOS"
# GBM (99): primary_diagnosis "Glioblastoma"
# HGSC (87): primary_diagnosis "Serous adenocarcinoma, NOS" AND
#  tissue_or_organ_of_origin Fallopian tube Ovary Peritoneum, NOS
# HNSCC (108): 其他情况 primary_diagnosis "Squamous cell carcinoma, NOS"
# Base of tongue, NOS Cheek mucosa Floor of mouth, NOS Gum, NOS
# HNSCC                   1            2                  18        3
#
# Head, face or neck, NOS Larynx, NOS Lip, NOS Oropharynx, NOS
# HNSCC                       2          47        4               4
#
# Overlapping lesion of lip, oral cavity and pharynx Tongue, NOS Tonsil, NOS
# HNSCC
# LSCC (108): primary_diagnosis "Squamous cell carcinoma, NOS" AND
#   tissue_or_organ_of_origin "lung" pattern
# LUAD (110): primary_diagnosis "Adenocarcinoma, NOS" AND
#   tissue_or_organ_of_origin "lung" pattern
# PDAC (105): tissue_or_organ_of_origin "pancreas" pattern
# UCEC (95): primary_diagnosis "Endometrioid adenocarcinoma, NOS"
sum(!is.na(cptac$Cancer))

# rbind(cptac2, cptac3)[,192] # 有重复列，需要回到前面处理 (fixed)

# 实际测试和更新
cptac_updated = rbind(cptac2, cptac3) |>
  unique() |>
  dplyr::mutate(
    Cancer = dplyr::case_when(
      tissue_or_organ_of_origin %in% "Breast, NOS" ~ "BRCA",
      primary_diagnosis %in% "Renal cell carcinoma, NOS" ~ "CCRCC",
      tissue_or_organ_of_origin %in% c("Colon, NOS", "Rectum, NOS") ~ "COAD",
      grepl("Glioblastoma", primary_diagnosis, ignore.case = TRUE) | grepl("Oligodendroglioma", primary_diagnosis, ignore.case = TRUE) ~ "GBM",
      grepl("GTEX", case_submitter_id, ignore.case = TRUE) ~ "GTEX-Brain",
      case_submitter_id %in% c("C3L-06912", "C3L-07212") ~ "GBM",
      #primary_diagnosis %in% "Glioblastoma" ~ "GBM",
      primary_diagnosis %in% "Serous adenocarcinoma, NOS" & tissue_or_organ_of_origin %in% c("Fallopian tube", "Ovary", "Peritoneum, NOS") ~ "HGSC",
      primary_diagnosis %in% "Squamous cell carcinoma, NOS" & grepl("lung", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "LSCC",
      primary_diagnosis %in% "Adenocarcinoma, NOS" & grepl("lung", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "LUAD",
      grepl("pancreas", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "PDAC",
      primary_diagnosis %in% "Endometrioid adenocarcinoma, NOS" ~ "UCEC",
      TRUE ~ "HNSCC"
    )
  ) |> as.data.table()

for (i in "HNSCC") { # typelist
  message("check type ", i)
  cptac_part = cptac_updated |> dplyr::filter(Cancer == i)
  message("  primary_diagnosis")
  table(cptac_part$Cancer, cptac_part$primary_diagnosis) |> print()
  message("  site_of_resection_or_biopsy")
  table(cptac_part$Cancer, cptac_part$site_of_resection_or_biopsy) |> print()
  message("  tissue_or_organ_of_origin")
  table(cptac_part$Cancer, cptac_part$tissue_or_organ_of_origin) |> print()
}

table(cptac$Cancer)
table(cptac_updated$Cancer)

cptac_updated |>
  dplyr::filter(
    primary_diagnosis == "'--" | site_of_resection_or_biopsy == "'--" |
    tissue_or_organ_of_origin %in% c("'--", "Unknown")
  )
# 1-9 from GTEX-Brain
# C3L-06912 C3L-07212 Gliomas # https://portal.gdc.cancer.gov/cases/e6027623-a5b2-4d82-a9ca-f37c948dd3ed
cptac_updated$case_submitter_id[grepl("GTEX", cptac_updated$case_submitter_id)]

nrow(cptac_updated)

fwrite(cptac_updated, file = "CPTAC_combined_with_cancer_annotated.tsv", sep = "\t")

cptac_unmatched = cptac_updated[! case_submitter_id %in% cli.final$Sample_ID]
colnames(cptac_unmatched)
colnames(cli.final)[1:20]

cptac_unmatched2 = cptac_unmatched |>
  dplyr::select(
    case_submitter_id, age_at_diagnosis, gender,
    days_to_death, vital_status,
    dplyr::starts_with("ajcc"),
    dplyr::starts_with("days"),
    last_known_disease_status, progression_or_recurrence, residual_disease,
    tumor_grade, Cancer
  )

sum(cptac_unmatched2$days_to_last_follow_up != "'--")
sum(cptac_unmatched2$days_to_death != "'--")
sum(cptac_unmatched2$days_to_recurrence != "'--")
sum(cptac_unmatched2$days_to_last_known_disease_status != "'--")

sum(cptac_unmatched2$progression_or_recurrence != "'--")


sum(!is.na(cli.final$OS_Time))
sum(!is.na(cli.final$PFS_Time))
sum(cli.final$OS_Time >= cli.final$PFS_Time, na.rm = TRUE)
cli.final[cli.final$OS_Time < cli.final$PFS_Time] |> View()

sum(as.integer(cptac_unmatched2$days_to_last_follow_up) >= as.integer(cptac_unmatched2$days_to_last_known_disease_status), na.rm = TRUE)

cptac_unmatched3 = cptac_unmatched2 |>
  dplyr::rename(
    Sample_ID = case_submitter_id,
    Age = age_at_diagnosis,
    Sex = gender,
    OS_Time = days_to_last_follow_up,
    OS_Status = vital_status,
    PFS_Time = days_to_last_known_disease_status,
    PFS_Status = progression_or_recurrence,
    Histologic_Grade = tumor_grade,
    Path_Stage_pT = ajcc_pathologic_t,
    Path_Stage_pN = ajcc_pathologic_n,
    Stage = ajcc_pathologic_stage
  ) |>
  dplyr::select(
    Sample_ID, Age, Sex, OS_Time, OS_Status, PFS_Time, PFS_Status,
    Path_Stage_pN:Path_Stage_pT,
    Histologic_Grade, Cancer
  ) |>
  dplyr::mutate(
    Patient_ID = Sample_ID,
    Age = as.integer(round(as.integer(Age) / 365)),
    Sex = fcase(Sex %in% "female", "F",
                Sex %in% "male", "M", default = NA_character_),
    OS_Time = as.integer(OS_Time),
    OS_Status = fcase(OS_Status %in% "Alive", 0L,
                      OS_Status %in% "Dead", 1L, default = NA_integer_),
    PFS_Time = as.integer(PFS_Time),
    PFS_Status = fcase(PFS_Status %in% "no", 0L,
                       PFS_Status %in% "yes", 1L, default = NA_integer_),
    OS_Time = ifelse(OS_Time < 0, NA_integer_, OS_Time),
    Path_Stage_pN = fcase(
      Path_Stage_pN %in% "N0", "pN0",
      Path_Stage_pN %in% c("N1", "N1a"), "pN1",
      Path_Stage_pN %in% c("N2", "N2a"), "pN2",
      default = NA_character_
    ),
    Path_Stage_pT = fcase(
      Path_Stage_pT %in% "T0", "pT0",
      Path_Stage_pT %in% c("T1", "T1a", "T1b", "T1c"), "pT1",
      Path_Stage_pT %in% c("T2", "T2a", "T2b"), "pT2",
      Path_Stage_pT %in% c("T3", "T3a", "T3b"), "pT3",
      Path_Stage_pT %in% "T4", "pT4",
      default = NA_character_
    ),
    Stage = fcase(
      Stage %in% c("Stage I", "Stage IA", "Stage IA1", "Stage IA2", "Stage IA3", "Stage IB"), "Stage I",
      Stage %in% c("Stage II", "Stage IIA", "Stage IIB"), "Stage II",
      Stage %in% c("Stage III", "Stage IIIA", "Stage IIIB", "Stage IIIC", "Stage IIIC1", "Stage IIIC2"), "Stage III",
      Stage %in% c("Stage IV", "Stage IVB"), "Stage IV",
      default = NA_character_
    ),
    Histologic_Grade = fcase(
      Histologic_Grade %in% "G1", "G1 Well differentiated",
      Histologic_Grade %in% "G2", "G2 Moderately differentiated",
      Histologic_Grade %in% "G3", "G3 Poorly differentiated",
      Histologic_Grade %in% "G4", "G4 Undifferentiated",
      default = NA_character_
    )
  )

cptac_unmatched3

# X: cannot be assessed

table(cptac_unmatched3$Path_Stage_pN)
table(cli.final$Path_Stage_pN)

table(cptac_unmatched3$Path_Stage_pT)
table(cli.final$Path_Stage_pT)

table(cptac_unmatched3$Stage)
table(cli.final$Stage)

table(cptac_unmatched3$Histologic_Grade)
table(cli.final$Histologic_Grade)


# Final output ------------------------------------------------------------

cli.final.2 = dplyr::bind_rows(
  cli.final, cptac_unmatched3
) |> as.data.table()

table(cli.final.2$Cancer)
sum(table(cli.final.2$Cancer))

# Race 全部加进去？
cli.final.2 = dplyr::left_join(
  cli.final.2, unique(rbind(cptac2, cptac3)[, .(case_submitter_id, race)]),
  by = c("Sample_ID"="case_submitter_id")
) |>
  dplyr::mutate(
    race = ifelse(race %in% c("Unknown", "not reported"), NA_character_,
                  race)
  ) |>
  as.data.table()

sum(is.na(cli.final.2$race))

fwrite(cli.final.2, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/CPTAC_info.tsv", sep = "\t")
```


### TARGET

```r
library(data.table)

cli1 = fread("TARGET_donor_allprojects_transfer_to_sample.gz")
cli2 = fread("TARGET_phenotype.gz")

target_cli = cli2 |>
  dplyr::select(-`_cohort`) |>
  dplyr::rename(Patient_ID = `_PATIENT`,
                Sample_ID = sample_id,
                disease_code = primary_disease_code,
                disease = `_primary_disease`,
                sample_type = `_sample_type`) |>
  dplyr::full_join(
    cli1 |> dplyr::select(
      -`_OS_UNIT`, -`_EVENT`, -`_TIME_TO_EVENT`
    ) |>
      dplyr::rename(
        Sample_ID = xena_sample,
        Sex = `_gender`,
        Age = `_age_at_diagnosis`,
        OS_Status = `_OS_IND`,
        OS_Time = `_OS`
      ), by = "Sample_ID"
  ) |>
  dplyr::mutate(
    Sex = ifelse(Sex == "Male", "M", "F"),
    OS_Status = ifelse(OS_Status == "Alive", 0L, 1L)
  ) |>
  dplyr::select(Patient_ID, Sample_ID, Age, Sex, OS_Time, OS_Status, dplyr::everything()) |>
  as.data.table()

cli1

target_cli

fwrite(target_cli, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_xena_info.tsv", sep = "\t")


# Using cli from GDC ------------------------------------------------------

system("
cd /home/data2/Projects/Fusion/unclean/gdc_cli/TARGET
mkdir ALL_P1 ALL_P2 ALL_P3 AML CCSK NBL OS RT WT
tar zxvf clinical.project-target-all-p1.2024-08-07.tar.gz -C ALL_P1
tar zxvf clinical.project-target-all-p2.2024-08-07.tar.gz -C ALL_P2
tar zxvf clinical.project-target-all-p3.2024-08-07.tar.gz -C ALL_P3
tar zxvf clinical.project-target-aml.2024-08-07.tar.gz -C AML
tar zxvf clinical.project-target-ccsk.2024-08-07.tar.gz -C CCSK
tar zxvf clinical.project-target-nbl.2024-08-07.tar.gz -C NBL
tar zxvf clinical.project-target-os.2024-08-07.tar.gz -C OS
tar zxvf clinical.project-target-rt.2024-08-07.tar.gz -C RT
tar zxvf clinical.project-target-wt.2024-08-07.tar.gz -C WT
       ")

gdc_target = rbindlist(
  list(
    fread("gdc_cli/TARGET/ALL_P1/clinical.tsv"),
    fread("gdc_cli/TARGET/ALL_P2/clinical.tsv"),
    fread("gdc_cli/TARGET/ALL_P3/clinical.tsv"),
    fread("gdc_cli/TARGET/AML/clinical.tsv"),
    fread("gdc_cli/TARGET/CCSK/clinical.tsv"),
    fread("gdc_cli/TARGET/NBL/clinical.tsv"),
    fread("gdc_cli/TARGET/OS/clinical.tsv"),
    fread("gdc_cli/TARGET/RT/clinical.tsv"),
    fread("gdc_cli/TARGET/WT/clinical.tsv")
  ),
  use.names = TRUE, fill = TRUE
)

table(gdc_target$project_id)
any(duplicated(gdc_target))
gdc_target = gdc_target[!duplicated(gdc_target)]

all.equal(gdc_target[[119]], gdc_target[[192]])
gdc_target[[192]] = NULL

sum(gdc_target$days_to_last_follow_up != "'--")
sum(gdc_target$days_to_death != "'--")

gdc_target2 = gdc_target |>
  dplyr::rename(
    Sample_ID = case_submitter_id,
    Age = age_at_diagnosis,
    Sex = gender,
    OS_Time = days_to_death,
    OS_Status = vital_status
  ) |>
  dplyr::select(
    Sample_ID, Age, Sex, OS_Time, OS_Status,
    project_id, race, primary_diagnosis
  ) |>
  dplyr::mutate(
    Patient_ID = Sample_ID,
    Age = as.integer(round(as.integer(Age) / 365)),
    Sex = fcase(Sex %in% "female", "F",
                Sex %in% "male", "M", default = NA_character_),
    OS_Time = as.integer(OS_Time),
    OS_Status = fcase(OS_Status %in% "Alive", 0L,
                      OS_Status %in% "Dead", 1L, default = NA_integer_),
    OS_Time = ifelse(OS_Time < 0, NA_integer_, OS_Time)
  ) |>
  as.data.table()

#unique()

# 存在大量 case level 重复
gdc_target2[duplicated(gdc_target2), ]
duplicated(gdc_target[case_submitter_id == "TARGET-20-PANTIV"])
gdc_target[case_submitter_id == "TARGET-20-PANTIV"] |> View() # 存在少量记录的更新？

gdc_target2 = gdc_target2[!duplicated(gdc_target2)]

#table(gdc_target2$PFS_Time)
any(duplicated(gdc_target2))
any(duplicated(gdc_target2$Sample_ID))

gdc_target2[duplicated(gdc_target2$Sample_ID)]
# 会存在不同 ALL 的 Project 中
gdc_dup = gdc_target2[Sample_ID %in% Sample_ID[duplicated(Sample_ID)]]
table(gdc_dup$project_id)
# TARGET-ALL-P1 TARGET-ALL-P2 TARGET-ALL-P3
# 10            57            47
#
# 这种先不处理，如果 ALL 合并分析考虑去重

colnames(gdc_target2)
gdc_target2 = gdc_target2[, c("Patient_ID", "Sample_ID", "Age", "Sex", "race", "primary_diagnosis", "OS_Time", "OS_Status", "project_id")]

fwrite(gdc_target2, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_info.tsv", sep = "\t")

sum(!is.na(gdc_target2$OS_Time))
sum(!is.na(cli1$`_OS`))

nrow(na.omit(unique(target_cli[, .(Patient_ID, OS_Time, OS_Status)])))
nrow(na.omit(unique(gdc_target2[, .(Patient_ID, OS_Time, OS_Status)])))
```