# Tutorial 4: Processing multi-omics data with Enhlink

In this tutorial, we will show how to take advantage of multi-omics single-cell data with Enhlink. The advantage of these datasets is to generate features in multiple omics at the single-cell resolution. With regards to linkage inference, Enhlink allows to to first infer linkages in each omic independently and then to intersect the results and identify the common links. these shared links are logically better true candidates since the correlations is observed in multiple modalities. For example with scATAC/RNA-seq dataset, we can search enhancers coaccessible with promoters and also correalted with their respective gene-expression.

## Case study using a multi-omic RNA/ATAC [PBMC from a Healthy Donor](https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-no-cell-sorting-3-k-1-standard-2-0-0) dataset

As an example, we will use the same dataset that in the tutorial 3, which is a [multi-omic RNA/ATAC](https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-no-cell-sorting-3-k-1-standard-2-0-0) of the example available datasets of 10XGenomics.

### scRNA matrix
We first need to download  the [filtered_feature_bc_matrix](https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-no-cell-sorting-3-k-1-standard-2-0-0) the cell x gene expression MTX matrix and the corresponding indexes

```bash
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_filtered_feature_bc_matrix.tar.gz

# decompressing the file
tar -xvf pbmc_unsorted_3k_filtered_feature_bc_matrix.tar.gz
```

The matrix is in the Cell-Ranger mtx format see [tutorial 3](tutorial_3_create_mat.md) which is a 3-columns numerical matrix with the feature indexes in the first column, the barcode index in the second columns and having the indexes starting at 1.

### scATAC matrix

Unfortunately, the Cell-Ranger processing of this dataset does not provide the matrix equivalent for the scATAC-seq data. However, it does provide the fragments and the peak file. As we have seen in [tutorial 3](tutorial_3_create_mat.md), these files are enough to create a cell x peak matrix:

```bash
# downloading fragments and peak files:
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_fragments.tsv.gz

# Creating a cell x peak matrix using the filtered barcodes of the scRNA-seq analysis using ATACMatUtils

ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \
    -xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \
    -ygi ./pbmc_unsorted_3k_atac_peaks.bed \
    -out ./filtered_feature_atac_matrix \
    -format cellRanger
```

## Inferring linkage
To infer enhancers-promoter linkage, we need to download a GTF file indicating where  the promoters are located in the reference genome (see [tutorial 1](tutorial_1_atac_only.md)).

```bash
wget http://ns104190.ip-147-135-44.us/data_CARE_portal/snATAC/metadata/Homo_sapiens.GRCh38.99.TSS.2K.bed
```

Let's also download the secondary analysis of cell Ranger which provides clustering and also linkage.

```bash
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_analysis.tar.gz
tar -xvf pbmc_unsorted_3k_analysis.tar.gz
```

As an example, we will use the Gene expression K-Means clustering with K=3 as reference clustering that identified the cell types and convert it to tsv (required by Enhlink). Again, we don't necessarly need clusters but, in case of multiple cell-type, it is better to infer linkages for all the cell types independently.

First, we need to convert the CSV cluster file into a tab separated file (tsv). We can do it using the `sed` command.

```bash
# On Linux
sed 's/,/\t/g' analysis/clustering/gex/kmeans_3_clusters/clusters.csv > analysis/clustering/gex/kmeans_3_clusters/clusters.tsv

# On OSX
sed -e $'s/,/\t/g' analysis/clustering/gex/kmeans_3_clusters/clusters.csv > analysis/clustering/gex/kmeans_3_clusters/clusters.tsv
```

We will also use a random list of 500 genes instead of processing all of the 60K genes of Homo_sapiens.GRCh38.99.TSS.2K.bed as a proof of concept.

```bash
# Please use gshuf instead of shuf (frome coreutils) on OSX
cut -f4 Homo_sapiens.GRCh38.99.TSS.2K.bed|sort|uniq|shuf -n 500 > 500_rand_genes.txt
```

`-ygi` needs a bed file so we needs to convert the `features.tsv.gz` into a bed file:

```bash
# Please use gzcat instead of zcat (frome coreutils) on OSX
zcat ./filtered_feature_atac_matrix/features.tsv.gz|cut -f1|sed 's/_/\t/g' > ./filtered_feature_atac_matrix/features.bed
```


Now we are ready to use Enhlink for linkage inference.

### ATAC-seq
First, we will infer enhancers-promoter coaccessibilities similarly than in [tutorial 1](tutorial_1_atac_only.md).

```bash
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_4_1/ \
    -threads 6 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -rmPeaksInTargets
```

here the option `-rmPeaksInTargets` is optional and prevent to infer promoter-promoter links.

### RNA-seq

We need to reformat the features.tsv.gz for the Gene expression matrix since we used the gene symbols as gene IDs in the promoter and genes files.

```bash
# the gene symbols are the second columns of the features file

# Please use gzcat instead of zcat (frome coreutils) on OSX
zcat ./filtered_feature_bc_matrix/features.tsv.gz|cut -f2 > ./filtered_feature_bc_matrix/gene_symbols.tsv
```

We can now infers enhancers-gene links by identifying the ones with accessibility correlated with gene expression.

```bash
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \
    -xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \
    -ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_4_2/ \
    -threads 6 \
    -isExpr \
    -targets 500_rand_genes.txt \
    -format cellRanger
```

Note that the `-isExpr` flag is used to define the second matrix (`./filtered_feature_bc_matrix/matrix.mtx.gz`) as a float  matrix, rather than a boolean matrix. When using this option, Enhlink dichotomizes each values of a given feature using the mean of the non null elements of that given gene as threshold. If we would not have used this option, all the non-null zero would have been considered as 1 and the absent values as 0, which in the case of scRNA-seq could still have been enough to infer meaningfull links.

## Intersect ATAC-seq and RNA-seq linkages
 Now that we have linkages in both modalities, we can intersect them and identify the shared links. We can use `enhtools` for that purpose:

 ```bash
 enhtools -intersect \
    -in ./output_tutorial_4_1/ \
    -in2 ./output_tutorial_4_2/ \
    -out output_tutorial_4_3/
 ```

The `-intersect` command is designed to first find pairs of bedpe files from the `-in` and `-in2` folder having a matching file name and output the intersecting links for each pair. It will output the precision, recall and f1 scores in the `stdout` and will also create a `stats.tsv` file for each cluster. the stats.tsv files breakdown the scores for each gene. Let's have a look to how many links we captured in the three experiments:

```bash
for fil in `ls ./output_tutorial_4_*/*.bedpe`
do
	nb=`cat ${fil}|grep -v "#"|wc -l`
	echo "${fil} nb links: ${nb}"
done

## Output:
            # ATAC
./output_tutorial_4_1/1.bedpe nb links: 523
./output_tutorial_4_1/2.bedpe nb links: 585
./output_tutorial_4_1/3.bedpe nb links: 382
            # RNA
./output_tutorial_4_2/1.bedpe nb links: 222
./output_tutorial_4_2/2.bedpe nb links: 159
./output_tutorial_4_2/3.bedpe nb links: 114
            # shared
./output_tutorial_4_3/1.bedpe nb links: 35
./output_tutorial_4_3/2.bedpe nb links: 28
./output_tutorial_4_3/3.bedpe nb links: 11
```

As we can see, The highest number of links are captured with the ATAC data alone, while RNA+ATAC produce around a third of the number of links, depending of the set of genes used. This is not surprising since correlations between ATAC and RNA are noisier than enhancer / promoter accessibility correlation, since latency and other regulatory mechanisms will affect gene expression. In this case, the number of shared links between the two modalidies is only a fraction of the original numbers. These links indicate robust enhancer-promoter candidates since both promoter accessibility and gene expression are correlated to enhancer accessibility.

### increasing the sensitivity of ATAC+RNA linkage
To capture more RNA+ATAC links, we can try to increase the sensitivity of Enhlink by either increasing the pvalue-cutoff threshold (which is already at 0.05) or by increasing the expected number of explanatory features per models or the number of trees (the `-n_boot` option) in the ensemble model (see [Tutorial 5](tutorial_5_advanced_functionalities.md)). By doing so, we will decrease the specificity (.e. more noise) but increase the sensitivity (e.g. number of true positive) of the results. Since we wanto intersect with ATAC links, it might be OK to do so. We will describe how to change the hyperparameters and adapt the accuracy of Enhlink in the next section.

### Intersecting with 10XGenomics linkage
`enhtools -intersect` looks for perfect matches between pairs of files of two folders. Alternatively, we could have used `enhtools -intersect2` that look for perfect matches between two files or `enhtools -intersect3` that looks for intersection based on imperfect matches. In the latter case `enhtools` looks for intersection between each of the two regions of the bedpe files without necessary a perfect match. Let's use it to intersect our results with the Cell-RAnger linkage analysis:

```bash
mkdir ./output_tutorial_4_4/

enhtools -intersect3 -in ./output_tutorial_4_3/1.bedpe -in2 ./analysis/feature_linkage/feature_linkage.bedpe -out ./output_tutorial_4_4/1.bedpe
enhtools -intersect3 -in ./output_tutorial_4_3/2.bedpe -in2 ./analysis/feature_linkage/feature_linkage.bedpe -out ./output_tutorial_4_4/2.bedpe
enhtools -intersect3 -in ./output_tutorial_4_3/3.bedpe -in2 ./analysis/feature_linkage/feature_linkage.bedpe -out ./output_tutorial_4_4/3.bedpe
```


### Conclusion

Overall, ATAC-seq alone is enough to find robust co-accessibility links that have better association than random. To further refine these links, it is useful to infer links from a second omics and identify the fraction of shared linkage in both omics. However, it is good to keep in mind that inferring links from ATAC+RNA only might be noisier than ATAC alone.