# Tutorial 2: Use of simulated variables to estimate expected accuracy In this tutorial, we will leverage the data from the previous tutorial and demonstrate how to generate simulated enhancers and promoters to estimate the expected precision for each target. ## create expected accuracy for each target region and each cluster We will use the variable _-nb_sim_features_ to introduce a number of simulated enhancers and one simulated target region (i.e. simulated promoter) for each target region. Enhlink will then use the number of recovered within the simulated enhancers and promoter to estimate the expected accuracy of the method with regards to the parameters. In the experiment below, we set _-nb_sim_features_ to 4 which is the number of simulated enhancers generated per target region. ```bash enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \ -xgi ./filtered_peak_bc_matrix/barcodes.tsv \ -ygi ./filtered_peak_bc_matrix/peaks.bed \ -clusters ./analysis/clustering/kmeans_3_clusters/clusters.tsv \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -out ./output_tutorial_2_1/ \ -threads 6 \ -targets 500_rand_genes.txt \ -format cellRanger \ -nb_sim_features 4 \ -rmPeaksInTargets ``` As a results of the command above, a new file: _simulated.tsv_ was generated in the output folder and contains the True Positive Rate (TPR), True Negative Rate (TNR), False Positive Rate (FPR), and the False Negative Rate (FNR) estimated from the simulated enhancers and promoter for each target promoter and for each cluster. The accuracy of Enhlink is correlated with the number of cells and the mean, which is the average accessibility for a given scATAC-seq promoter, of the target region. So, more accessible promoters and larger groups will produce more accurate results. ```bash head ./output_tutorial_2_1/simulated.tsv # output: #SIMULATION results for -depth: 2 -n_boot: 100 -nb_sim_features: 4 -lambda1 :1.800000 -lambda2 :0.050000 -max_features: 4 -min_matsize: 100 -threshold: 0.050000 -downsample: 0 -maxFeatType: all TPR TNR FPR FNR chr start stop promoter peakMean nbCells group 0.8000 1.0000 0.0000 0.0000 chr6 33287247 33291247 wdr46 0.5188 3009 2 1.0000 1.0000 0.0000 0.0000 chr6 33287247 33291247 wdr46 0.5154 4697 1 0.5000 1.0000 0.0000 0.0000 chr6 33287247 33291247 wdr46 0.5154 2540 3 ``` ## Mechanics of the simulation The simulation's mechanics involve computing expected accuracy scores based on two hyperparameters: _lambda1_ and _lambda2_. These parameters correspond to Poisson distributions that control the levels of similarity between a simulated enhancer and its corresponding simulated promoter. Specifically, _lambda1_ determines the amount of dropouts - which is the number of cells with a zero when the corresponding simulated promoter is accessible to a given simulated enhancer. Meanwhile, _lambda2_ controls the amount of false positives - or the number of cells with a one when the corresponding simulated promoter is not accessible. To create a simulated promoter, the original promoter is shuffled. A simulated enhancer is then instantiated as a copy of the simulated promoter. Next, a random number is generated for each cell using two Poisson distributions. If the generated value is greater than 0, the corresponding cell value for the simulated enhancer is updated accordingly. Additional information can be found in the paper's method section. Real enhancer-promoter links from snATAC-seq data were used to estimate _lambda1_ and _lambda2_ values in the paper. Usually, promoters are more accessible than enhancers, so we would expect _lambda1_ to be greater than _lambda2_. However, testing additional values can help to broaden the expected accuracy range. ## Computing a wider range of expected accuracy To perform this task, we can use the _enhgrid_ software which allows to provide a range of values for various hyperparameters and performs an analysis for each combination of hyperparameters. Since this can represent a lot of computation, we will perform this analysis for only one gene from the previous analysis: _GLOD4_. We will try 3 values for _lambda1_, _lambda2_, and for _nb_sim_features_ for a total of 27 combinations and, since we have 3 clusters, we will get a value for each of them at each combination. We also use the _-onlySim_ option to only perform the simulation and not infer the links. Also, since the procedure is stochastic, we will perform 5 repetitions of each combination. Finally, the grid of analysis can optionally be parallelized using the _processes_ option. The effective number of CPUs required will then be _-processes_ x _threads_ To expand the range of expected accuracy, we can use the _enhgrid_ software allowing to provide a range of values for different hyperparameters, with the software analyzing each hyperparameter combination. However, given the computational resources required, we will apply this analysis to just one gene from the previous analysis, _GLOD4_, as a proof of concept. Specifically, we will test three values each for _lambda1_, _lambda2_, and _nb_sim_features_, yielding 27 combinations. With three clusters, we will obtain one value for each cluster at each combination. We will use the _-onlySim_ option to avoid inferring links and perform only simulations. Since the procedure is stochastic, we will repeat each combination five times. Additionally, we can optionally parallelize the analysis grid using the _processes_ option. This will require an effective number of CPUs equal to _-processes_ x _threads_. ```bash enhgrid -mat ./filtered_peak_bc_matrix/matrix.mtx \ -xgi ./filtered_peak_bc_matrix/barcodes.tsv \ -ygi ./filtered_peak_bc_matrix/peaks.bed \ -clusters ./analysis/clustering/kmeans_3_clusters/clusters.tsv \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -out ./output_tutorial_2_2/ \ -target GLOD4 \ -threads 3 \ -format cellRanger \ -lambda1 0.04,0.08,1.2 \ -lambda2 1.2,1.8,2.4 \ -nb_sim_features 2,4,8 \ -rmPeaksInTargets \ -processes 3 \ -repetition 5 \ -onlySim ``` This analysis should be performed in less than a minute and should produce the following 5 * 27 output files: ```bash ls ./output_tutorial_2_2/ # output grid.nb0.rep0.simulated.tsv grid.nb0.rep1.simulated.tsv grid.nb0.rep2.simulated.tsv grid.nb0.rep3.simulated.tsv grid.nb0.rep4.simulated.tsv grid.nb10.rep0.simulated.tsv grid.nb10.rep1.simulated.tsv grid.nb10.rep2.simulated.tsv grid.nb10.rep3.simulated.tsv grid.nb10.rep4.simulated.tsv ... ``` ## Example of postprocessing The post processing of these results depends of what one wants to do and can take many forms. Below, we provide an example of how to extract these values and create an informative plot using Python with the Pandas, and Seaborn libraries. ```python import seaborn as sns import pandas as pd from glob import glob import re import pylab as plt results = pd.DataFrame() files = glob("./output_tutorial_2_2/*simulated.tsv") regex = " -nb_sim_features: (?P.+) -lambda1 :(?P.+) -lambda2 :(?P.+) -max_features" for fil in files: header = open(fil).readline() match = re.search(regex, header).groupdict() tsv = pd.read_csv(fil, sep="\t", comment="#") tsv['lambda1'] = match['lambda1'] tsv['lambda2'] = match['lambda2'] tsv['NbSim'] = match['nbsim'] results = pd.concat([results, tsv]) results.fillna(0, inplace=True) results['f1-score'] = results['TPR'] / (results['TPR'] + 0.5 * (results['FPR'] + results['FNR'])) results['lambda1 / lambda2'] = results['lambda1'].astype('str') + "/" + results['lambda2'].astype('str') violin = sns.violinplot(data=results, x='group', y='f1-score', hue='lambda1 / lambda2') violin.set_title('GLOD4 expected accuracy', fontsize=14) violin.set_xlabel('Cluster ID', fontsize=14) violin.set_ylabel('F1-score', fontsize=14) sns.move_legend(violin, "upper left", bbox_to_anchor=(1, 1)) plt.tight_layout() plt.show() ``` ![](img/GLOD4_exp_acc.png) The results of this analysis is plotted below. As we can see, the F1-score is overall very good (i.e. the mean is close to 1.0) for most of the lambdas values, expected the most extreme. These results were expected since GLOD4 has overall a high accessibility ratio amongst cells (38% of cells have on average its promoter accessible). Promoters with low accessibilities will produce greater variance and less accurate links. ## Conclusion Several factors influence the accuracy of link-inference estimation for a given target region and enhancer features, including the target accessibility ratio, the number of surrounding and linked enhancers, the number of cells, and the intrinsic correlations within the promoter and the enhancers, modeled by the lambdas values. Based on experimental measurements and real snATAC-seq data, we estimated the lambdas values for several promoters and enhancers, and speculate that other links may exhibit similar patterns. Except for extreme values of lambdas, the inferred score remains fairly robust to fluctuations in the lambdas values.