Tutorial 2: Use of simulated variables to estimate expected accuracy

In this tutorial, we will use the data from the previous tutorial and see how we can introduce simulated enhancers together with simulated promoters and estimate the exected accuracy for each promoters.

create expected accuracy for each target region and each cluster

We will use the variable -nb_sim_features to introduce a number of simulated enhancers and one simulated target region (i.e. simulated promoter) for each target region. Enhlink will then use the number of recovered within the simulated enhancers and promoter to estimate the expected accuracy of the method with regards to the parameters.

In the experiment below, we set -nb_sim_features to 4 which is the number of simulated enhancers generated per target region.

enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
    -xgi ./filtered_peak_bc_matrix/barcodes.tsv \
    -ygi ./filtered_peak_bc_matrix/peaks.bed \
    -clusters ./analysis/clustering/kmeans_3_clusters/clusters.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -out ./output_tutorial_2_1/ \
    -threads 6 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -nb_sim_features 4 \
    -rmPeaksInTargets

As a results of the command above, a new file: simulated.tsv was generated in the output folder and contains the True Positive Rate (TPR), True Negative Rate (TNR), False Positive Rate (FPR), and the False Negative Rate (FNR) estimated from the simulated enhancers and promoter for each target promoter and for each cluster. The accuracy of Enhlink is correlated with the number of cells and the mean, which is the average accessibility for a given scATAC-seq promoter, of the target region. So, more accessible promoters and larger groups will produce more accurate results.

head ./output_tutorial_2_1/simulated.tsv

# output:

#SIMULATION results for -depth: 2 -n_boot: 100 -nb_sim_features: 4 -lambda1 :1.800000 -lambda2 :0.050000 -max_features: 4 -min_matsize: 100 -threshold: 0.050000 -downsample: 0 -maxFeatType: all
TPR     TNR     FPR     FNR     chr     start   stop    promoter        peakMean        nbCells group
0.8000  1.0000  0.0000  0.0000  chr6    33287247        33291247        wdr46   0.5188  3009    2
1.0000  1.0000  0.0000  0.0000  chr6    33287247        33291247        wdr46   0.5154  4697    1
0.5000  1.0000  0.0000  0.0000  chr6    33287247        33291247        wdr46   0.5154  2540    3

Mechanics of the simulation

The simulation’s mechanics involve computing expected accuracy scores based on two hyperparameters: lambda1 and lambda2. These parameters correspond to Poisson distributions that control the levels of similarity between a simulated enhancer and its corresponding simulated promoter. Specifically, lambda1 determines the amount of dropouts - or the number of cells with a zero when the corresponding simulated promoter is accessible to a given simulated enhancer. Meanwhile, lambda2 controls the amount of false positives - or the number of cells with a one when the corresponding simulated promoter is not accessible.

To create a simulated promoter, the original promoter is shuffled. A simulated enhancer is then instantiated as a copy of the simulated promoter. Next, a random number is generated for each cell using two Poisson distributions. If the generated value is greater than 0, the corresponding cell value for the simulated enhancer is updated accordingly. Additional information can be found in the paper’s method section.

Real enhancer-promoter links from snATAC-seq data were used to estimate lambda1 and lambda2 values in the paper. Usually, promoters are more accessible than enhancers, so we would expect lambda1 to be greater than lambda2. However, testing additional values can help to broaden the expected accuracy range.

Computing a wider range of expected accuracy

To perform this task, we can use the enhgrid software which allows to provide a range of values for various hyperparameters and performs an analysis for each combination of hyperparameters. Since this can represent a lot of computation, we will perform this analysis for only one gene from the previous analysis: GLOD4. We will try 3 values for lambda1, lambda2, and for nb_sim_features for a total of 27 combinations and, since we have 3 clusters, we will get a value for each of them at each combination. We also use the -onlySim option to only perform the simulation and not infer the links. Also, since the procedure is stochastic, we will perform 5 repetitions of each combination. Finally, the grid of analysis can optionally be parallelized using the processes option. The effective number of CPUs required will then be -processes x threads

To expand the range of expected accuracy, we can use the enhgrid software allowing to provide a range of values for different hyperparameters, with the software analyzing each hyperparameter combination. However, given the computational resources required, we will apply this analysis to just one gene from the previous analysis, GLOD4, as a proof of concept. Specifically, we will test three values each for lambda1, lambda2, and nb_sim_features, yielding 27 combinations. With three clusters, we will obtain one value for each cluster at each combination. We will use the -onlySim option to avoid inferring links and perform only simulations. Since the procedure is stochastic, we will repeat each combination five times. Additionally, we can optionally parallelize the analysis grid using the processes option. This will require an effective number of CPUs equal to -processes x threads.

enhgrid -mat ./filtered_peak_bc_matrix/matrix.mtx \
    -xgi ./filtered_peak_bc_matrix/barcodes.tsv \
    -ygi ./filtered_peak_bc_matrix/peaks.bed \
    -clusters ./analysis/clustering/kmeans_3_clusters/clusters.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -out ./output_tutorial_2_2/ \
    -target GLOD4 \
    -threads 3 \
    -format cellRanger \
    -lambda1 0.04,0.08,1.2 \
    -lambda2 1.2,1.8,2.4 \
    -nb_sim_features 2,4,8 \
    -rmPeaksInTargets \
    -processes 3 \
    -repetition 5 \
    -onlySim

This analysis should be performed in less than a minute and should produce the following 5 * 27 output files:

ls ./output_tutorial_2_2/

# output
grid.nb0.rep0.simulated.tsv
grid.nb0.rep1.simulated.tsv
grid.nb0.rep2.simulated.tsv
grid.nb0.rep3.simulated.tsv
grid.nb0.rep4.simulated.tsv
grid.nb10.rep0.simulated.tsv
grid.nb10.rep1.simulated.tsv
grid.nb10.rep2.simulated.tsv
grid.nb10.rep3.simulated.tsv
grid.nb10.rep4.simulated.tsv
...

Example of postprocessing

The post processing of these results depends of what one wants to do and can take many forms. Below, we provide a simple example of how to extract these values and create an informative plot using Python with the Pandas, and Seaborn libraries.

import seaborn as sns
import pandas as pd
from glob import glob
import re
import pylab as plt

results = pd.DataFrame()

files = glob("./output_tutorial_2_2/*simulated.tsv")
regex = " -nb_sim_features: (?P<nbsim>.+) -lambda1 :(?P<lambda1>.+) -lambda2 :(?P<lambda2>.+) -max_features"

for fil in files:
	header = open(fil).readline()
	match = re.search(regex, header).groupdict()
	tsv = pd.read_csv(fil, sep="\t", comment="#")
	tsv['lambda1'] = match['lambda1']
	tsv['lambda2'] = match['lambda2']
	tsv['NbSim'] = match['nbsim']
	results = pd.concat([results, tsv])

results.fillna(0, inplace=True)
results['f1-score'] = results['TPR'] / (results['TPR'] + 0.5 * (results['FPR'] + results['FNR']))

results['lambda1 / lambda2'] = results['lambda1'].astype('str') + "/" + results['lambda2'].astype('str')

violin = sns.violinplot(data=results, x='group', y='f1-score', hue='lambda1 / lambda2')
violin.set_title('GLOD4 expected accuracy', fontsize=14)
violin.set_xlabel('Cluster ID', fontsize=14)
violin.set_ylabel('F1-score', fontsize=14)
sns.move_legend(violin, "upper left", bbox_to_anchor=(1, 1))
plt.tight_layout()
plt.show()

The results of this analysis is plotted below. As we can see, the F1-score is overall very good (i.e. the mean is close to 1.0) for most of the lambdas values, expected the most extreme. These results were expected since GLOD4 has overall a high accessibility ratio amongst cells (38% of cells have on average its promoter accessible). Promoters with low accessibilities will produce greater variance and less accurate links.

Conclusion

Several factors influence the accuracy of link-inference estimation for a given promoter and enhancers, including the promoter accessibility ratio, the number of surrounding and linked enhancers, the number of cells, and the intrinsic correlations within the promoter and the enhancers, modeled by the lambdas values. Based on experimental measurements and real snATAC-seq data, we estimated the lambdas values for several promoters and enhancers, and speculate that other links may exhibit similar patterns. Except for extreme values of lambdas, the inferred score remains fairly robust to fluctuations in the lambdas values.