# Tutorial 5: Enhlink advanced functionalities in this tutorial, we will describe some of the advanced functionalities of Enhlink that can be used to regularize the model and adjust the sensitivity / specificity, perform a distal analysis, and adjust computational speed. ## Inferring positively/negatively correlated linkage only Enhlink employs a modified information gain score to assess the relationship between enhancers and targets (promoters). The information gain scoring is designed to be symmetrical for both positively and negatively correlated vectors, enabling Enhlink to identify enhancers that negatively correlate with targets as well. By default, `enhlink` ( and `enhgrid`) will infer both positively and negaively correlated links. Furthermore the sign of the scores from the results bedpe file indicates the sense of the correlation. In addition, one can use the `-linkType` option, which accepts the following values: {"all", "positive", "negative"}, to either filter out the negative or the positive links. ## Using an external feature matrix with Enhlink. We saw in the previous tutorials, how to create an cell x peak matrix and how to directly use CellRanger to obtain an input matrix for Enhlink. Enhlink will iterate through the target regions of the file defined in the `-gtf` option, create a target vector for each of the target promoter defined in the GTF file, and use the surrounding features as explanatory variables. however, Enhlink can accept a second feature matrix that can be used to directly extract the target vectors. For example let's reprocess the example of the tutorial 4 and create a cell x feature matrix that will be used as a feature matrix instead. ```bash # 1rst Example from the tutorial 4 enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \ -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \ -ygi ./filtered_feature_atac_matrix/features.bed \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \ -out ./output_tutorial_4_1/ \ -threads 6 \ -targets 500_rand_genes.txt \ -format cellRanger \ -rmPeaksInTargets ``` ### Creation of a cell x feature matrix We will use `ATACMatUtils` from the [ATACDemultiplex](https://gitlab.com/Grouumf/ATACdemultiplex#atacmatutils-efficiently-construct-matrix-for-scatac-seq-datase) package (see tutorial 3) to create a cell x promoter matrix for this dataset: ```bash ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \ -xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \ -ygi ./Homo_sapiens.GRCh38.99.TSS.2K.bed \ -use_symbol \ -norm_type count \ -out ./filtered_promoter_matrix \ -format cellRanger ``` Here, the `-use_norm count` defines the values to be simply the read count at each promoter (see [tutorial 3](tutorial_3_create_mat.md)). We can now use this second matrix as feature matrix with Enhlink: ```bash enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \ -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \ -ygi ./filtered_feature_atac_matrix/features.bed \ -mat2 ./filtered_promoter_matrix/matrix.mtx.gz \ -xgi2 ./filtered_promoter_matrix//barcodes.tsv.gz \ -ygi2 ./filtered_promoter_matrix/features.tsv.gz \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \ -out ./output_tutorial_5_1/ \ -threads 6 \ -targets 500_rand_genes.txt \ -format cellRanger \ -rmPeaksInTargets ``` In this command, the value of the second matrix doesn't matter since Enhlink dichotomize between non null (>=1) vs null (==0) values. To take into account the numerical values with Enhlink, we can use the `-isExpr` option which will dichotomize the values based on the mean of the non-null elements. For example let's try to create an promoter activity score matrix using the logRPM normalisation and use these scores as numerical values: ```bash ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \ -xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \ -ygi ./Homo_sapiens.GRCh38.99.TSS.2K.bed \ -use_symbol \ -norm_type logrpm \ -out ./filtered_promoter_matrix_logrpm \ -format cellRanger ``` Now, let's process the values of this matrix a numerical values with the `-isExpr` ```bash enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \ -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \ -ygi ./filtered_feature_atac_matrix/features.bed \ -mat2 ./filtered_promoter_matrix_logrpm/matrix.mtx.gz \ -xgi2 ./filtered_promoter_matrix_logrpm/barcodes.tsv.gz \ -ygi2 ./filtered_promoter_matrix_logrpm/features.tsv.gz \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \ -out ./output_tutorial_5_2/ \ -threads 6 \ -isExpr \ -targets 500_rand_genes.txt \ -format cellRanger \ -rmPeaksInTargets ``` As a final note on using an additional matrix, since a gene can have multiple promoter region defined in the promoter file and Enhlink doesn't know which of these promoter regions refers to the value of a given gene defined in the feature matrix, Enhlink will indiscriminately assigns the same link for all the promoter region of the gene. ## Regularization and adjusting sensitivity / specificity In the second example of the tutorial 4, we saw that less linked were inferred from te ATAC+RNA matrices compared to the ATAC matrix alone (which was expected). We can try to increase these numbers by increasing the sensitivity of Enhlink at the expense of lowering the specificity. One way to do that is to increase the maximal number of explanatory features per models (`-max_features` which is 4 by default). We also need to increase the maximum depth of each tree (`-depth` which is 2 by default) since these two parameters are linked: at a depth d, there would be no more than 2^d features. ```bash enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \ -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \ -ygi ./filtered_feature_atac_matrix/features.bed \ -mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \ -xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \ -ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \ -out ./output_tutorial_5_2/ \ -threads 6 \ -depth 3 \ -max_features 8 \ -isExpr \ -targets 500_rand_genes.txt \ -format cellRanger ``` ```bash # Number of lines in each output file: wc -l output_tutorial_4_2/*.bedpe # output 251 output_tutorial_4_2/1.bedpe 153 output_tutorial_4_2/2.bedpe 123 output_tutorial_4_2/3.bedpe 527 total wc -l output_tutorial_5_2/*.bedpe # output 345 output_tutorial_5_2/1.bedpe 258 output_tutorial_5_2/2.bedpe 167 output_tutorial_5_2/3.bedpe 770 total ``` Even though we doubled `-max_features`, the number of links captured didn't double since each link inferred by Enhlink has a p-value inferred from the neighborhood and this p-value needs to pass the cutoff threshold to be significant. We can also increase the forest size (`-n_boot` which is 100 per default) to be able to detect smaller associations at the expense of more computations: ```bash enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \ -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \ -ygi ./filtered_feature_atac_matrix/features.bed \ -mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \ -xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \ -ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \ -out ./output_tutorial_5_3/ \ -n_boot 200 \ -threads 6 \ -depth 3 \ -max_features 8 \ -isExpr \ -targets 500_rand_genes.txt \ -format cellRanger $ wc -l output_tutorial_5_3/*.bedpe 510 output_tutorial_5_3/1.bedpe 428 output_tutorial_5_3/2.bedpe 296 output_tutorial_5_3/3.bedpe 1234 total ``` However, we need to insure that the number of background features is large enough to properly compute the p-values. The background features are defined as the set of neighbor features that goes into the model. We assume that the majority of them have do not have an association with the target region, which allows to infer a p-value using their score distribution. If this number is too low (defined by `-min_matsize` which is set as 100 by default), Enhlink will create artificial features from the existing ones. Thus, increasing `-min_matsize` will limit the amount of false positives at the expense of more computations. Another solution to increase the number of background features is to increase the neighborhood size (`-neighborhood` set as 2.5e5 as default) of each gene, defined in number of base pairs. however, the latter case will produce links between more distal regions. ```bash # More precise Enhlink analysis compared to the default parameters enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \ -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \ -ygi ./filtered_feature_atac_matrix/features.bed \ -mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \ -xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \ -ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \ -out ./output_tutorial_5_4/ \ -n_boot 400 \ -min_matsize 200 \ -threads 6 \ -depth 3 \ -max_features 8 \ -isExpr \ -targets 500_rand_genes.txt \ -format cellRanger # This analysis took 186 seconds to complete (31 seconds with the defaults parameters) wc -l output_tutorial_5_4/*.bedpe 487 output_tutorial_5_4/1.bedpe 384 output_tutorial_5_4/2.bedpe 285 output_tutorial_5_4/3.bedpe 1156 total ``` ### Accuracy on the 2nd order analysis Enhlink can optionally perform a second order analysis ( `-secondOrder`) to identify the covariates-specific links. This process is controlled with the same parameters except `-secondOrderMaxFeat` replacing `-min_matsize` and set as 2 as default. ### Improving the speed of the method Processing 40,000 genes, for multiple cell types, for a large number of cells together with the 2nd order analysis can be quite a long process. The speed of Enhlink can be improved using the `-downsample` option, controlling the number of cells randomly sampled for each tree, and the `-maxFeatType` option controlling the maximum of features to be considered. By default, all the features and all the cells are used. However, we showed in the paper that as low as 1000 cells can produce good performances. `-maxFeatType` can take an integer value (number of features), a float (proportion of the total number of features), or can be set up as "log", "sqrt", or "all" (default). We recommand to not use less than 66% of the features, at least with the default forest size (100). Using a smaller subset of features can be usefull for very large forest size, similarly to the classic random forest approach which randomly picks log(nb. features) at each tree. ```bash # More precise Enhlink analysis with cell (1000 per tree) and feature (66% per tree) downsample. enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \ -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \ -ygi ./filtered_feature_atac_matrix/features.bed \ -mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \ -xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \ -ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \ -out ./output_tutorial_5_5/ \ -n_boot 400 \ -min_matsize 200 \ -maxFeatType 0.66 \ -downsample 1000 \ -threads 6 \ -depth 3 \ -max_features 8 \ -isExpr \ -targets 500_rand_genes.txt \ -format cellRanger # This analysis took 103 seconds to complete wc -l output_tutorial_5_4/*.bedpe 487 output_tutorial_5_4/1.bedpe 384 output_tutorial_5_4/2.bedpe 285 output_tutorial_5_4/3.bedpe 1156 total ``` ### Conclusion Overall, `-max_features` in an intuitive parameter that controls Enhlink sensitivity / specificity and can be seen as the expected number of associations for a given target. `-n_boot` is less intuitive but can be seen as the number of trials Enhlink will perform. if enough trials are performed, Enhlink will have enough visibility to detect lower associations. However, due to the high-dimensionality of the data, a lot of features will present some degrees of associations without any biological implication. Thus, it is still important to keep some level of regulation and / or to merge multiple omics together. In that regards, `-min_matsize` can always be increased to insure enough background features but will slow down the process for a large number of targer regions. Also, the expected accuracy can be estimated using simulated data (See [tutorial 2](tutorial_2_simulated_variables.md)). Another regularisation mechanism is to randomly subsample the number of features / cells at each trial with `-downsample` and `-maxFeatType`. This will also increase the speed of the method. ## Distal interactions inference We saw that `-neighborhood` controls the neighborhood size in base pairs of each target region. If we put neighborhood to 0, Enhlink will include all the features, except those within the target region in the model. In this case, `-min_matsize` becomes useless since the number of features is large enough in theory. Also, it makes sense to increase the sensitivity of Enhlink by increasing `-max_features`, `-depth`, and `-n_boot`. This will also drastically increase the computational time of the model, so it may be worthwhile to only perform the distal analysis for a handful of genes. First, let's select 5 genes differentially expressed from cluster 1 using the Cell Ranger analysis: ```bash cat ./analysis/clustering/gex/kmeans_3_clusters/differential_expression.csv|sed 's/,/\t/g'|cut -f2,3,5|grep -v "Feature"|sort -n -k3|head -5|cut -f1 > top_5_genes_cluster1.txt ``` Now, let's perform a distal analysis on these 5 genes. We will increase `-n_boot` to 250 and `-max_features` to 8 to get more sensitivity and we will limit to the cluster 1 subsample. First let's isolate the barcodes from cluster 1 only. ```bash cat analysis/clustering/gex/kmeans_3_clusters/clusters.tsv|grep -E "(1$|Cluster$)" > cluster1_subset.tsv ``` Now, let's launch the analysis. Notice the `-xgi_subset` option. ```bash # Distal analysis enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \ -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \ -ygi ./filtered_feature_atac_matrix/features.bed \ -mat2 ./filtered_promoter_matrix/matrix.mtx.gz \ -xgi2 ./filtered_promoter_matrix//barcodes.tsv.gz \ -ygi2 ./filtered_promoter_matrix/features.tsv.gz \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \ -out ./output_tutorial_5_6/ \ -threads 6 \ -n_boot 250 \ -depth 3 \ -xgi_subset cluster1_subset.tsv \ -max_features 8 \ -neighborhood 0 \ -targets top_5_genes_cluster1.txt \ -format cellRanger \ -rmPeaksInTargets ``` The analysis took 411 seconds to finish with my computer and identified 55 links. Now, it would be interesting to find enhancers shared between multiple genes. So, let's see if we can find some: ```bash cat output_tutorial_5_6/1.bedpe|grep -v "#"|cut -f1,2,3,10|sort|uniq|cut -f1,2,3|sort|uniq -c|sort -k1|tail # The first grep remove the comment lines # the first cut identifies enhancer region with their geneID # The sort|uniq sort them and forces to have only one combination of enhancer / geneID # The second cut, extract the enhancer regions only # We then sort these regions one more time and count the number of occurences with uniq -c # We then sort by number of occurences and plot the tail of the sort, which corresponds to the highest occurences ## Output 1 chr7 98281064 98281964 1 chr8 100309714 100310630 1 chr8 140464272 140465151 2 chr1 24642630 24643558 2 chr19 2061075 2061954 2 chr19 42283565 42284564 2 chr19 45584194 45585165 2 chr19 56140532 56141465 2 chr19 680242 681195 3 chr5 179698324 179699220 ``` As we can see, we are already able to identify 7 enhancers shared by more than one gene of the top 5 signature genes of Cluster 1. Now, we have to be careful and examine these regions more in details to see if they are not experimental artifacts which could be widely accessibles and reflect for example the number of reads. These examinations can be done by visualizing the bigwig, the bedpe, and the peak files. We will cover that aspect in the next section. ### Discard overaccessible OCRs As described in the original publication, it may be necessary to remove overacessible OCRs from this type of analysis because thse regions may reflect some aspects of the experimental process such as batch ID or read depth. To do so, we can for example sort the features by mean and remove the top 5 percents. We can easily do it with Python and plot the histogram of the mean accessibility. ```python from scipy.sparse import load_npz import pandas as pd import numpy as np import pylab as plt import seaborn as sns npz = "./filtered_feature_atac_matrix/matrix.mtx.gz" ygi = "./filtered_feature_atac_matrix/features.bed" outfig = "./enh.mean.dist.pdf" out = "./peak.subset.mean.ygi" M = load_npz(npz) M = M.tocsr().T M.data = M.data.astype('bool') dfy = pd.read_csv(ygi, sep="\t", header=None) dfy.columns = ['chr', 'start', 'stop'] dfy['peak'] = dfy['chr'] + ":" + dfy['start'].astype('str') + "-" + dfy['stop'].astype('str') mean = np.asarray(M.mean(axis=0)).reshape(-1) meani = np.flip(np.argsort(mean)) fig, ax = plt.subplots(1, 1, figsize=(14, 8)) # Plot histogram to see the accessibility mean distribution sns.histplot( x=mean, element='step', common_norm=False, cumulative=False, log_scale=False, multiple='layer', stat='probability', palette="viridis", ax=ax, ) ax.grid(False) ax.set_facecolor("white") plt.show() fig.savefig(outfig) # Define the cutoff as 5% cutoff = int(float(len(meani)) * 0.05) # Remove the top 5 percents of most accessible features dfy.iloc[meani[cutoff:], [0,1,2]].to_csv(out, sep="\t", header=None, index=None) ``` We can now use `./peak.subset.mean.ygi` to ignore the top 5 percents of the most accessible features ```bash # Distal analysis enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \ -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \ -ygi ./filtered_feature_atac_matrix/features.bed \ -ygi_subset ./filtered_feature_atac_matrix/features.bed \ -mat2 ./filtered_promoter_matrix/matrix.mtx.gz \ -xgi2 ./filtered_promoter_matrix//barcodes.tsv.gz \ -ygi2 ./filtered_promoter_matrix/features.tsv.gz \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \ -out ./output_tutorial_5_7/ \ -threads 6 \ -n_boot 250 \ -depth 3 \ -xgi_subset cluster1_subset.tsv \ -max_features 8 \ -neighborhood 0 \ -targets top_5_genes_cluster1.txt \ -format cellRanger \ -rmPeaksInTargets ``` ## using Enhgrid to distribute computation, test the robustness of results, and improve speed As we saw in [tutorial 2](tutorial_2_simulated_variables.md) `enhgrid` allows to process arrays of hyperparameters. This is the same as launching `enhlink` for each combinations of parameters, however, `enhlink` would need to load the input matrices and indexes each time which can represent a tremendous amount of computational time. Also, `enhgrid` has the `-splitTargetList` option whicj allows to distribute the gene list on multiple processes and optimize the computational time. For example, if we have 12 CPUs, we could distribute the gene list on 3 processes with 4 threads each. This would be faster than running one process with 12 threads. Let's apply `enhgrid` to the example 1 of [tutorial 4](tutorial_4_multiomics.md): ```bash enhgrid -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \ -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \ -ygi ./filtered_feature_atac_matrix/features.bed \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \ -out ./output_tutorial_5_7/ \ -threads 3 \ -processes 2 \ -targets 500_rand_genes.txt \ -format cellRanger \ -splitTargetList \ -rmPeaksInTargets ``` We can also use enhgrid to see how robust are Enhlink results by excecuting multiple runs of the same analysis with the `-repetition` option. Let's apply this to the barcodes from cluster 1 ```bash enhgrid -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \ -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \ -ygi ./filtered_feature_atac_matrix/features.bed \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \ -out ./output_tutorial_5_8/ \ -threads 3 \ -xgi_subset cluster1_subset.tsv \ -repetition 6 \ -processes 3 \ -targets 500_rand_genes.txt \ -format cellRanger \ -splitTargetList \ -rmPeaksInTargets # The output folder contains the following files: ./output_tutorial_5_8 ├── rep0.1.bedpe ├── rep1.1.bedpe ├── rep2.1.bedpe ├── rep3.1.bedpe ├── rep4.1.bedpe ├── rep5.1.bedpe └── timestamp.log ``` Now, let see how much links from the first repetitions are retrieved in the other 5 repetitions: ```bash reference="./output_tutorial_5_8/rep0.1.bedpe" for bedpe in `ls ./output_tutorial_5_8/*.bedpe` do nbLinks=`enhtools -intersect3 -in ${bedpe} -in2 ${reference} -stdout|wc -l` echo "#### file: ${bedpe} #### number of links shared with rep0.1.bedpe: ${nbLinks}" done ### Output #### file: ./output_tutorial_5_8/rep0.1.bedpe #### number of links shared with rep0.1.bedpe: 522 #### file: ./output_tutorial_5_8/rep1.1.bedpe #### number of links shared with rep0.1.bedpe: 458 #### file: ./output_tutorial_5_8/rep2.1.bedpe #### number of links shared with rep0.1.bedpe: 436 #### file: ./output_tutorial_5_8/rep3.1.bedpe #### number of links shared with rep0.1.bedpe: 452 #### file: ./output_tutorial_5_8/rep4.1.bedpe #### number of links shared with rep0.1.bedpe: 452 #### file: ./output_tutorial_5_8/rep5.1.bedpe #### number of links shared with rep0.1.bedpe: 458 ``` Also not formal metrics, these numbers can be derived into similarity scores, such as the Rand score. ## Filter links based on TAD domains Topologically associated domains (TAD) define genomic regions whose DNA sequences preferentially contact each other. Thus, such regions can be provided to refine Enlink's results and exclude links which don't fall within the boundary of one of these regions. * Using `enhtools`, it is easy to filter a bedpe file with the `-filter` option by providing a reference BED file containing genomic regions: ```bash enhtools -filter -in ./output_tutorial_5_1/1.bedpe -bed TAD.bed -out ./output_tutorial_5_1/1.filtered.bedpe ``` ### Class imbalance with covariates To address class imbalance in covariate analysis (when the option `-covariates` is used), consider using the `-uniformSample` option in Enhlink. This option ensures a more uniform distribution of the covariates at each bootstrap by iteratively adding cells according to their distribution. For instance, if one cluster represents 80% of cells and the other two clusters only represent 10%, the `-uniformSample` option will be beneficial to avoid a fraction of bootstrap samples having too few samples of the underrepresented clusters. However, it is important to exercise caution when using this option, especially if one class is severely underrepresented. If a class represents less than `nbCellsTotal/nbCovariate`, the number of cells (`N`) sampled will be larger than the class size (`S`), resulting in a larger number of unique cells (`Nu`) being sampled than with a bootstrap procedure (66% on average). For instance, if `N` is twice the size of `S`, `Nu` will reach 90%, and 95% for three times the size of `S`, which could decrease the variability of the unique cells sampled.