# Tutorial 1: Simple processing of snATAC-seq data

In this first tutorial we will process some example single-cell ATAC-seq data from one of the 10X-genomics datasets available online.
We will:

1. Download the single-cell MTX matrix (sparse matrix format) folder from the 10X genomics processed files obtained after cellRanger processing
2. Infer the Enhlinks "linkage" for all the cells and all the genes
3. Infer the Enhlinks "linkage" for each clusters
4. Infer the Enhlinks "linkage" and identify significant "cluster-specific" links

## Process all the cells

### Download the scATAC-seq MTX matrix

We first need to download the MTX matrix from the 10X example dataset: _10k Human PBMCs, ATAC v2, Chromium Controller_   that is accessible here:
[https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-atac-v2-chromium-controller-2-standard](https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-atac-v2-chromium-controller-2-standard)

```bash
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.tar.gz

tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.tar.gz
```

We then need to download a GTF file that describes the promoter locations in the HG38 :

```bash
wget http://ns104190.ip-147-135-44.us/data_CARE_portal/snATAC/metadata/Homo_sapiens.GRCh38.99.TSS.2K.bed
```

### simple processing with Enhlink

We are now ready to launch a simple  linkage inference with Enhlink to link the enhancers of the MTX matrix to the target promoters defined in the GTF files.

First, we need to go to the folder containg the MTX matrix and the GFT files. We should have these files inside

```bash
tree

# output:

.
├── Homo_sapiens.GRCh38.99.TSS.2K.bed
└── filtered_peak_bc_matrix
    ├── barcodes.tsv
    ├── matrix.mtx
    └── peaks.bed
```

Since _Homo_sapiens.GRCh38.99.TSS.2K.bed_ has more than 60,000 genes, we will only process 500 random genes extracted from this file, as a proof of concept:

```bash
# Please use gshuf instead of shuf (frome coreutils) on OSX
cut -f4 Homo_sapiens.GRCh38.99.TSS.2K.bed|sort|uniq|shuf -n 500 > 500_rand_genes.txt
```

We will launch Enhlink with the following options:
- _-targets_ [optional] which defines the file containing the subset of promoters of interest. If not provided, Enhlink will process the 60K promoters defined in Homo_sapiens.GRCh38.99.TSS.2K.bed
- _-threads_ which defines the number of CPUs to use for each gene
- _-format_  which refers to the format of the input matrix
- _-rmPeaksInTargets_ which prevent inferring promoter-promoter linkage
- _-xgi_ is the barcode index of the MTX matrix (in the cellRanger format, this corresponds to the columns index)
- _-ygi_ is the peak index (as a BED file) of the MTX matrix (in the cellRanger format, this corresponds to the rows index)

```bash
enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
        -xgi ./filtered_peak_bc_matrix/barcodes.tsv \
        -ygi ./filtered_peak_bc_matrix/peaks.bed \
        -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
        -out ./output_tutorial_1_1/ \
        -threads 6 \
        -targets 500_rand_genes.txt \
        -format cellRanger \
        -rmPeaksInTargets
```

This process should take about a minute to complete and the results are stored in the _/output_tutorial_1_1/_ folder:

```
tree /output_tutorial_1_1/

# output
./output_tutorial_1_1/
└── all.bedpe

head ./output_tutorial_1_1/all.bedpe

# EnhLink (version: 0.18.6)
##### PARAMETERS ####
#  maxDepth: 2 n_boot: 100 maxNbFeatures: 4 minMatSize: 100 threshold: 0.050000 minLeafSize: 10 downsample: 0 maxFeatType:all neighborhood: 250000 merging_cutoff: 1500 rmPeaksInPromoter: true  isFloatMat: false onlyPositiveLink: false secondOrder: false threads: 6 secondOrderMaxFeat: 2
# time: 2023-02-17 16:21:10
# input matrix: ./filtered_peak_bc_matrix/matrix.mtx
# xgi index: ./filtered_peak_bc_matrix/barcodes.tsv
# ygi index: ./filtered_peak_bc_matrix/peaks.bed
# mat file: ./filtered_peak_bc_matrix/matrix.mtx
# cluster file file:
# promoter file: Homo_sapiens.GRCh38.99.TSS.2K.bed
# cluster ID: all
#chrID1 start   stop    chrID2  start   stop    FDR.Adj.Pval    Score   Adj.Score       geneID
chr17   75077203        75078080        chr17   75058620        75062620        1.36e-04        2.19e-06        1.07e-05        ac111186.2
chr17   74981894        74982787        chr17   75058620        75062620        6.97e-08        7.57e-06        6.12e-05        ac111186.2
chr17   75154195        75155099        chr17   75058620        75062620        5.60e-08        9.77e-06        2.25e-05        ac111186.2
chr17   75036398        75037311        chr17   75058620        75062620        0.00e+00        1.16e-04        3.56e-04        ac111186.2
chr21   41973808        41974736        chr21   41765089        41769089        1.82e-02        7.43e-07        3.78e-07        ripk4
chr21   41816424        41817163        chr21   41765089        41769089        1.67e-10        3.25e-06        9.46e-07        ripk4
chr21   41895604        41896499        chr21   41765089        41769089        0.00e+00        2.11e-05        4.96e-05        ripk4
chr21   41926204        41927091        chr21   41765089        41769089        0.00e+00        1.23e-05        1.48e-05        ripk4

```

### Processing for different clusters

We can now perform an independent linkage analysis for each clusters. As an example, let's download the default clustering analysis performed by cell ranger:

```bash
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_analysis.tar.gz

tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_Controller_analysis.tar.gz
```

We will use the K-mean clusters for K=3 _analysis/clustering/kmeans_3_clusters/clusters.csv_. We first need to create a .tsv file instead of a .csv, which is straightforward:

```bash
# Option 1
sed 's/,/\t/g' analysis/clustering/kmeans_3_clusters/clusters.csv > analysis/clustering/kmeans_3_clusters/clusters.tsv

# option 2 (better for OSX)
awk '{FS=",";OFS="\t"}{print $1,$2}' analysis/clustering/kmeans_3_clusters/clusters.csv > analysis/clustering/kmeans_3_clusters/clusters.tsv
```

Next, we just have to add the _-clusters_ variable to Enhlink:

```bash
enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
    -xgi ./filtered_peak_bc_matrix/barcodes.tsv \
    -ygi ./filtered_peak_bc_matrix/peaks.bed \
    -clusters ./analysis/clustering/kmeans_3_clusters/clusters.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -out ./output_tutorial_1_2/ \
    -threads 6 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -rmPeaksInTargets
```

The results folder has one bedpe per cluster defined in the clusters.tsv file:

```bash
tree ./output_tutorial_1_2/

# output

./output_tutorial_1_2/
├── 1.bedpe
├── 2.bedpe
└── 3.bedpe

```

### Including covariates in the model

It is often judiciary to include, when possible, the covariates associated with each cells in the Enhlink analysis. This helps prevent batch effect and allows to identify covariate-specific links.
Cell-covariates of interest depend of the context but can be for example, the library IDs (if multiple runs are merged together), the sex, the condition, or even the cell-type/cluster ID. Since this dataset has no obvious cell covariates to download, we will use the previous cluster IDs as covariate as a proof of concept, and identify the cluster-specific links.

This analysis is different than the one above because it identified the links of each cluster using the subset of cells specific to each cluster. In the previous analysis, a link can be found in multiple clusters. Also, the last analysis is useful to find links specific to small clusters/cell-types among much larger subpopulations because it focused only on the subset of cells.

Please *verify* that the covariates file contains a *header* naming the different covariates. The covariates file should have the following structure:

```bash
covariateName1    covariateName2    ...    covariateNameJ
val_1_1           val_2_1                  val_J_1
val_1_2           val_2_2                  val_J_2
...

```

The values must not be *integer* or *float* but rather be categorical.
Let's launch this new analysis

```bash
enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
    -xgi ./filtered_peak_bc_matrix/barcodes.tsv \
    -ygi ./filtered_peak_bc_matrix/peaks.bed \
    -covariates ./analysis/clustering/kmeans_3_clusters/clusters.tsv \
    -secondOrder \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -out ./output_tutorial_1_3/ \
    -threads 6 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -rmPeaksInTargets
```

This analysis should take couple of minutes. The _-covariate_ option allowed to add the binarized covariates into the model. Instead of inferring _p = f(P)_, with p the promoter vector and P the matrix formed with the surrounding peak vectors of p, we inferred _p = f(P,C)_ with C the matrix of binarized covariates. The option allowed _-secondOrder_ to perform additional analyzes to identify covariate-specific links at the expense of more computations.

The results folder has the following files:

```bash
tree ./output_tutorial_1_3/

# output

./output_tutorial_1_3/
├── all.2ndOrder.bedpe
├── all.batch.tsv
└── all.bedpe

```
The file _all.batch.tsv_ lists the covariates associated with each promoter and the file _all.2ndOrder.bedpe_contains an additional column: _Covariate_, defining the covariate associated with each link.

```bash
head -5 output_tutorial_1_3/all.batch.tsv

# output

Cluster_2       chr10   103394626       103398626       3.82e-04        8.77e-05        1.63e-04        pdcd11
Cluster_2       chr7    38176222        38180222        0.00e+00        8.27e-04        1.34e-03        stard3nl
Cluster_3       chr7    74888610        74892610        7.29e-03        5.99e-06        3.58e-06        stag3l2
Cluster_1       chr7    74888610        74892610        2.57e-05        1.75e-05        5.68e-06        stag3l2

head -20 output_tutorial_1_3/all.2ndOrder.bedpe

# Output

# EnhLink (version: 0.18.6)
##### PARAMETERS ####
#  maxDepth: 2 n_boot: 100 maxNbFeatures: 4 minMatSize: 100 threshold: 0.050000 minLeafSize: 10 downsample: 0 maxFeatType:all neighborhood: 250000 merging_cutoff: 1500 rmPeaksInPromoter: true  isFloatMat: false onlyPositiveLink: false secondOrder: true threads: 6 secondOrderMaxFeat: 2
# time: 2023-02-20 12:45:26
# input matrix: ./filtered_peak_bc_matrix/matrix.mtx
# xgi index: ./filtered_peak_bc_matrix/barcodes.tsv
# ygi index: ./filtered_peak_bc_matrix/peaks.bed
# mat file: ./filtered_peak_bc_matrix/matrix.mtx
# cluster file file:
# promoter file: Homo_sapiens.GRCh38.99.TSS.2K.bed
# metadata: ./analysis/clustering/kmeans_3_clusters/clusters.tsv
# cluster ID: all
##### Second order link-covariate interactions ####
#chrID1 start   stop    chrID2  start   stop    Covariate       FDR.Adj.Pval    Score   Adj.Score       geneID
chr10   103551405       103552275       chr10   103394626       103398626       Cluster_2       0.00e+00        5.63e-04        1.04e-03        pdcd11
chr10   103485434       103486330       chr10   103394626       103398626       Cluster_2       9.32e-06        1.67e-07        3.09e-07        pdcd11
chr7    38107239        38108115        chr7    38176222        38180222        Cluster_2       0.00e+00        1.77e-05        2.87e-05        stard3nl
chr7    38215499        38216377        chr7    38176222        38180222        Cluster_2       1.72e-02        2.00e-06        3.23e-06        stard3nl
chr7    38305247        38306139        chr7    38176222        38180222        Cluster_2       0.00e+00        6.68e-05        1.08e-04        stard3nl
chr7    38394425        38395335        chr7    38176222        38180222        Cluster_2       0.00e+00        1.61e-05        2.62e-05        stard3nl

```

Note that in this file, the FDR.Adj.Pval, Score, and Adj.Score represent scores pertaining to covariate-link inference rather than the links themselves

### Class imbalance with covariates

Enhlink provides an option called `-uniformSample` that can be used when analyzing datasets with class imbalance in covariates. This option helps to ensure that each bootstrap contains a more uniform distribution of cells from each class by taking into account the distribution of the covariates.

For instance, if one cluster represents 80% of cells and the other two clusters only represent 10%, the `-uniformSample` option will be beneficial to avoid a fraction of bootstrap samples having too few samples of the underrepresented clusters.

However, it is important to exercise caution when using this option, especially if one class is severely underrepresented. If a class represents less than `nbCellsTotal/nbCovariate`, the number of cells (`N`) sampled will be larger than the class size (`S`), resulting in a larger number of unique cells (`Nu`) being sampled than with a bootstrap procedure (66% on average). For instance, if `N` is twice the size of `S`, `Nu` will reach 90%, and 95% for three times the size of `S`, which could decrease the variability of the unique cells sampled.