Tutorial 5: Enhlink advanced functionalities

in this tutorial, we will describe some of the advanced functionalities of Enhlink that can be used to regularize the model and adjust the sensitivity / specificity, perform a distal analysis, and adjust computational speed.

Inferring positively/negatively correlated linkage only

Enhlink employs a modified information gain score to assess the relationship between enhancers and targets (promoters). The information gain scoring is designed to be symmetrical for both positively and negatively correlated vectors, enabling Enhlink to identify enhancers that negatively correlate with targets as well.

By default, enhlink ( and enhgrid) will infer both positively and negaively correlated links. Furthermore the sign of the scores from the results bedpe file indicates the sense of the correlation. In addition, one can use the -linkType option, which accepts the following values: {“all”, “positive”, “negative”}, to either filter out the negative or the positive links.

Using an external feature matrix with Enhlink.

We saw in the previous tutorials, how to create an cell x peak matrix and how to directly use CellRanger to obtain an input matrix for Enhlink. Enhlink will iterate through the target regions of the file defined in the -gtf option, create a target vector for each of the target promoter defined in the GTF file, and use the surrounding features as explanatory variables. however, Enhlink can accept a second feature matrix that can be used to directly extract the target vectors.

For example let’s reprocess the example of the tutorial 4 and create a cell x feature matrix that will be used as a feature matrix instead.

# 1rst Example from the tutorial 4

enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_4_1/ \
    -threads 6 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -rmPeaksInTargets

Creation of a cell x feature matrix

We will use ATACMatUtils from the ATACDemultiplex package (see tutorial 3) to create a cell x promoter matrix for this dataset:

ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \
    -xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \
    -ygi ./Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -use_symbol \
    -norm_type count \
    -out ./filtered_promoter_matrix \
    -format cellRanger

Here, the -use_norm count defines the values to be simply the read count at each promoter (see tutorial 3). We can now use this second matrix as feature matrix with Enhlink:

enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -mat2 ./filtered_promoter_matrix/matrix.mtx.gz \
    -xgi2 ./filtered_promoter_matrix//barcodes.tsv.gz \
    -ygi2 ./filtered_promoter_matrix/features.tsv.gz \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_5_1/ \
    -threads 6 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -rmPeaksInTargets

In this command, the value of the second matrix doesn’t matter since Enhlink dichotomize between non null (>=1) vs null (==0) values. To take into account the numerical values with Enhlink, we can use the -isExpr option which will dichotomize the values based on the mean of the non-null elements. For example let’s try to create an promoter activity score matrix using the logRPM normalisation and use these scores as numerical values:

ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \
    -xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \
    -ygi ./Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -use_symbol \
    -norm_type logrpm \
    -out ./filtered_promoter_matrix_logrpm \
    -format cellRanger

Now, let’s process the values of this matrix a numerical values with the -isExpr

enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -mat2 ./filtered_promoter_matrix_logrpm/matrix.mtx.gz \
    -xgi2 ./filtered_promoter_matrix_logrpm/barcodes.tsv.gz \
    -ygi2 ./filtered_promoter_matrix_logrpm/features.tsv.gz \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_5_2/ \
    -threads 6 \
    -isExpr \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -rmPeaksInTargets

As a final note on using an additional matrix, since a gene can have multiple promoter region defined in the promoter file and Enhlink doesn’t know which of these promoter regions refers to the value of a given gene defined in the feature matrix, Enhlink will indiscriminately assigns the same link for all the promoter region of the gene.

Regularization and adjusting sensitivity / specificity

In the second example of the tutorial 4, we saw that less linked were inferred from te ATAC+RNA matrices compared to the ATAC matrix alone (which was expected). We can try to increase these numbers by increasing the sensitivity of Enhlink at the expense of lowering the specificity. One way to do that is to increase the maximal number of explanatory features per models (-max_features which is 4 by default). We also need to increase the maximum depth of each tree (-depth which is 2 by default) since these two parameters are linked: at a depth d, there would be no more than 2^d features.

enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \
    -xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \
    -ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_5_2/ \
    -threads 6 \
    -depth 3 \
    -max_features 8 \
    -isExpr \
    -targets 500_rand_genes.txt \
    -format cellRanger

# Number of lines in each output file:

wc -l output_tutorial_4_2/*.bedpe
# output
  251 output_tutorial_4_2/1.bedpe
  153 output_tutorial_4_2/2.bedpe
  123 output_tutorial_4_2/3.bedpe
  527 total

wc -l output_tutorial_5_2/*.bedpe
# output
  345 output_tutorial_5_2/1.bedpe
  258 output_tutorial_5_2/2.bedpe
  167 output_tutorial_5_2/3.bedpe
  770 total

Even though we doubled -max_features, the number of links captured didn’t double since each link inferred by Enhlink has a p-value inferred from the neighborhood and this p-value needs to pass the cutoff threshold to be significant.

We can also increase the forest size (-n_boot which is 100 per default) to be able to detect smaller associations at the expense of more computations:

enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \
    -xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \
    -ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_5_3/ \
    -n_boot 200 \
    -threads 6 \
    -depth 3 \
    -max_features 8 \
    -isExpr \
    -targets 500_rand_genes.txt \
    -format cellRanger

$ wc -l output_tutorial_5_3/*.bedpe
   510 output_tutorial_5_3/1.bedpe
   428 output_tutorial_5_3/2.bedpe
   296 output_tutorial_5_3/3.bedpe
  1234 total

However, we need to insure that the number of background features is large enough to properly compute the p-values. The background features are defined as the set of neighbor features that goes into the model. We assume that the majority of them have do not have an association with the target region, which allows to infer a p-value using their score distribution. If this number is too low (defined by -min_matsize which is set as 100 by default), Enhlink will create artificial features from the existing ones. Thus, increasing -min_matsize will limit the amount of false positives at the expense of more computations. Another solution to increase the number of background features is to increase the neighborhood size (-neighborhood set as 2.5e5 as default) of each gene, defined in number of base pairs. however, the latter case will produce links between more distal regions.

# More precise Enhlink analysis compared to the default parameters

enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \
    -xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \
    -ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_5_4/ \
    -n_boot 400 \
    -min_matsize 200 \
    -threads 6 \
    -depth 3 \
    -max_features 8 \
    -isExpr \
    -targets 500_rand_genes.txt \
    -format cellRanger

# This analysis took 186 seconds to complete (31 seconds with the defaults parameters)

wc -l output_tutorial_5_4/*.bedpe
  487 output_tutorial_5_4/1.bedpe
  384 output_tutorial_5_4/2.bedpe
  285 output_tutorial_5_4/3.bedpe
 1156 total

Accuracy on the 2nd order analysis

Enhlink can optionally perform a second order analysis ( -secondOrder) to identify the covariates-specific links. This process is controlled with the same parameters except -secondOrderMaxFeat replacing -min_matsize and set as 2 as default.

Improving the speed of the method

Processing 40,000 genes, for multiple cell types, for a large number of cells together with the 2nd order analysis can be quite a long process. The speed of Enhlink can be improved using the -downsample option, controlling the number of cells randomly sampled for each tree, and the -maxFeatType option controlling the maximum of features to be considered. By default, all the features and all the cells are used. However, we showed in the paper that as low as 1000 cells can produce good performances. -maxFeatType can take an integer value (number of features), a float (proportion of the total number of features), or can be set up as “log”, “sqrt”, or “all” (default). We recommand to not use less than 66% of the features, at least with the default forest size (100). Using a smaller subset of features can be usefull for very large forest size, similarly to the classic random forest approach which randomly picks log(nb. features) at each tree.

# More precise Enhlink analysis with cell (1000 per tree) and feature (66% per tree) downsample.

enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \
    -xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \
    -ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_5_5/ \
    -n_boot 400 \
    -min_matsize 200 \
    -maxFeatType 0.66 \
    -downsample 1000 \
    -threads 6 \
    -depth 3 \
    -max_features 8 \
    -isExpr \
    -targets 500_rand_genes.txt \
    -format cellRanger


# This analysis took 103 seconds to complete

wc -l output_tutorial_5_4/*.bedpe
  487 output_tutorial_5_4/1.bedpe
  384 output_tutorial_5_4/2.bedpe
  285 output_tutorial_5_4/3.bedpe
 1156 total

Conclusion

Overall, -max_features in an intuitive parameter that controls Enhlink sensitivity / specificity and can be seen as the expected number of associations for a given target. -n_boot is less intuitive but can be seen as the number of trials Enhlink will perform. if enough trials are performed, Enhlink will have enough visibility to detect lower associations. However, due to the high-dimensionality of the data, a lot of features will present some degrees of associations without any biological implication. Thus, it is still important to keep some level of regulation and / or to merge multiple omics together. In that regards, -min_matsize can always be increased to insure enough background features but will slow down the process for a large number of targer regions. Also, the expected accuracy can be estimated using simulated data (See tutorial 2). Another regularisation mechanism is to randomly subsample the number of features / cells at each trial with -downsample and -maxFeatType. This will also increase the speed of the method.

Distal interactions inference

We saw that -neighborhood controls the neighborhood size in base pairs of each target region. If we put neighborhood to 0, Enhlink will include all the features, except those within the target region in the model. In this case, -min_matsize becomes useless since the number of features is large enough in theory. Also, it makes sense to increase the sensitivity of Enhlink by increasing -max_features, -depth, and -n_boot. This will also drastically increase the computational time of the model, so it may be worthwhile to only perform the distal analysis for a handful of genes.

First, let’s select 5 genes differentially expressed from cluster 1 using the Cell Ranger analysis:

cat ./analysis/clustering/gex/kmeans_3_clusters/differential_expression.csv|sed 's/,/\t/g'|cut -f2,3,5|grep -v "Feature"|sort -n -k3|head -5|cut -f1 > top_5_genes_cluster1.txt

Now, let’s perform a distal analysis on these 5 genes. We will increase -n_boot to 250 and -max_features to 8 to get more sensitivity and we will limit to the cluster 1 subsample.

First let’s isolate the barcodes from cluster 1 only.

cat analysis/clustering/gex/kmeans_3_clusters/clusters.tsv|grep -E "(1$|Cluster$)" > cluster1_subset.tsv

Now, let’s launch the analysis. Notice the -xgi_subset option.

# Distal analysis
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -mat2 ./filtered_promoter_matrix/matrix.mtx.gz \
    -xgi2 ./filtered_promoter_matrix//barcodes.tsv.gz \
    -ygi2 ./filtered_promoter_matrix/features.tsv.gz \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_5_6/ \
    -threads 6 \
    -n_boot 250 \
    -depth 3 \
    -xgi_subset cluster1_subset.tsv \
    -max_features 8 \
    -neighborhood 0 \
    -targets top_5_genes_cluster1.txt \
    -format cellRanger \
    -rmPeaksInTargets

The analysis took 411 seconds to finish with my computer and identified 55 links. Now, it would be interesting to find enhancers shared between multiple genes. So, let’s see if we can find some:

cat output_tutorial_5_6/1.bedpe|grep -v "#"|cut -f1,2,3,10|sort|uniq|cut -f1,2,3|sort|uniq -c|sort -k1|tail

# The first grep remove the comment lines
# the first cut identifies enhancer region with their geneID
# The sort|uniq sort them and forces to have only one combination of enhancer / geneID
# The second cut, extract the enhancer regions only
# We then sort these regions one more time and count the number of occurences with uniq -c
# We then sort by number of occurences and plot the tail of the sort, which corresponds to the highest occurences

## Output

1 chr7    98281064        98281964
1 chr8    100309714       100310630
1 chr8    140464272       140465151
2 chr1    24642630        24643558
2 chr19   2061075 2061954
2 chr19   42283565        42284564
2 chr19   45584194        45585165
2 chr19   56140532        56141465
2 chr19   680242  681195
3 chr5    179698324       179699220

As we can see, we are already able to identify 7 enhancers shared by more than one gene of the top 5 signature genes of Cluster 1. Now, we have to be careful and examine these regions more in details to see if they are not experimental artifacts which could be widely accessibles and reflect for example the number of reads. These examinations can be done by visualizing the bigwig, the bedpe, and the peak files. We will cover that aspect in the next section.

Discard overaccessible OCRs

As described in the original publication, it may be necessary to remove overacessible OCRs from this type of analysis because thse regions may reflect some aspects of the experimental process such as batch ID or read depth. To do so, we can for example sort the features by mean and remove the top 5 percents. We can easily do it with Python and plot the histogram of the mean accessibility.

from scipy.sparse import load_npz
import pandas as pd
import numpy as np
import pylab as plt
import seaborn as sns

npz = "./filtered_feature_atac_matrix/matrix.mtx.gz"
ygi = "./filtered_feature_atac_matrix/features.bed"

outfig = "./enh.mean.dist.pdf"
out = "./peak.subset.mean.ygi"


M = load_npz(npz)
M = M.tocsr().T
M.data = M.data.astype('bool')
dfy = pd.read_csv(ygi, sep="\t", header=None)
dfy.columns = ['chr', 'start', 'stop']
dfy['peak'] = dfy['chr'] + ":" + dfy['start'].astype('str') + "-" + dfy['stop'].astype('str')

mean = np.asarray(M.mean(axis=0)).reshape(-1)
meani = np.flip(np.argsort(mean))

fig, ax = plt.subplots(1, 1, figsize=(14, 8))

# Plot histogram to see the accessibility mean distribution
sns.histplot(
    x=mean,
    element='step',
    common_norm=False,
    cumulative=False,
    log_scale=False,
    multiple='layer',
    stat='probability',
    palette="viridis",
    ax=ax,
)

ax.grid(False)
ax.set_facecolor("white")
plt.show()

fig.savefig(outfig)

# Define the cutoff as 5%
cutoff = int(float(len(meani)) * 0.05)

# Remove the top 5 percents of most accessible features
dfy.iloc[meani[cutoff:], [0,1,2]].to_csv(out, sep="\t", header=None, index=None)

We can now use ./peak.subset.mean.ygi to ignore the top 5 percents of the most accessible features

# Distal analysis
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
	-ygi_subset ./filtered_feature_atac_matrix/features.bed \
    -mat2 ./filtered_promoter_matrix/matrix.mtx.gz \
    -xgi2 ./filtered_promoter_matrix//barcodes.tsv.gz \
    -ygi2 ./filtered_promoter_matrix/features.tsv.gz \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_5_7/ \
    -threads 6 \
    -n_boot 250 \
    -depth 3 \
    -xgi_subset cluster1_subset.tsv \
    -max_features 8 \
    -neighborhood 0 \
    -targets top_5_genes_cluster1.txt \
    -format cellRanger \
    -rmPeaksInTargets

using Enhgrid to distribute computation, test the robustness of results, and improve speed

As we saw in tutorial 2 enhgrid allows to process arrays of hyperparameters. This is the same as launching enhlink for each combinations of parameters, however, enhlink would need to load the input matrices and indexes each time which can represent a tremendous amount of computational time. Also, enhgrid has the -splitTargetList option whicj allows to distribute the gene list on multiple processes and optimize the computational time. For example, if we have 12 CPUs, we could distribute the gene list on 3 processes with 4 threads each. This would be faster than running one process with 12 threads.

Let’s apply enhgrid to the example 1 of tutorial 4:

enhgrid -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_5_7/ \
    -threads 3 \
    -processes 2 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -splitTargetList \
    -rmPeaksInTargets

We can also use enhgrid to see how robust are Enhlink results by excecuting multiple runs of the same analysis with the -repetition option. Let’s apply this to the barcodes from cluster 1

enhgrid -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_5_8/ \
    -threads 3 \
    -xgi_subset cluster1_subset.tsv \
    -repetition 6 \
    -processes 3 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -splitTargetList \
    -rmPeaksInTargets

# The output folder contains the following files:

./output_tutorial_5_8
├── rep0.1.bedpe
├── rep1.1.bedpe
├── rep2.1.bedpe
├── rep3.1.bedpe
├── rep4.1.bedpe
├── rep5.1.bedpe
└── timestamp.log

Now, let see how much links from the first repetitions are retrieved in the other 5 repetitions:

reference="./output_tutorial_5_8/rep0.1.bedpe"

for bedpe in `ls ./output_tutorial_5_8/*.bedpe`
do
	nbLinks=`enhtools -intersect3 -in ${bedpe} -in2 ${reference} -stdout|wc -l`
    echo "#### file: ${bedpe} #### number of links shared with rep0.1.bedpe: ${nbLinks}"
done

### Output

#### file: ./output_tutorial_5_8/rep0.1.bedpe #### number of links shared with rep0.1.bedpe: 522
#### file: ./output_tutorial_5_8/rep1.1.bedpe #### number of links shared with rep0.1.bedpe: 458
#### file: ./output_tutorial_5_8/rep2.1.bedpe #### number of links shared with rep0.1.bedpe: 436
#### file: ./output_tutorial_5_8/rep3.1.bedpe #### number of links shared with rep0.1.bedpe: 452
#### file: ./output_tutorial_5_8/rep4.1.bedpe #### number of links shared with rep0.1.bedpe: 452
#### file: ./output_tutorial_5_8/rep5.1.bedpe #### number of links shared with rep0.1.bedpe: 458

Also not formal metrics, these numbers can be derived into similarity scores, such as the Rand score.

Filter links based on TAD domains

Topologically associated domains (TAD) define genomic regions whose DNA sequences preferentially contact each other. Thus, such regions can be provided to refine Enlink’s results and exclude links which don’t fall within the boundary of one of these regions.

Using enhtools, it is easy to filter a bedpe file with the -filter option by providing a reference BED file containing genomic regions:

enhtools -filter -in ./output_tutorial_5_1/1.bedpe -bed TAD.bed -out ./output_tutorial_5_1/1.filtered.bedpe

Class imbalance with covariates

To address class imbalance in covariate analysis (when the option -covariates is used), consider using the -uniformSample option in Enhlink. This option ensures a more uniform distribution of the covariates at each bootstrap by iteratively adding cells according to their distribution.

For instance, if one cluster represents 80% of cells and the other two clusters only represent 10%, the -uniformSample option will be beneficial to avoid a fraction of bootstrap samples having too few samples of the underrepresented clusters.

However, it is important to exercise caution when using this option, especially if one class is severely underrepresented. If a class represents less than nbCellsTotal/nbCovariate, the number of cells (N) sampled will be larger than the class size (S), resulting in a larger number of unique cells (Nu) being sampled than with a bootstrap procedure (66% on average). For instance, if N is twice the size of S, Nu will reach 90%, and 95% for three times the size of S, which could decrease the variability of the unique cells sampled.