# Tutorial 2: Use of simulated variables to estimate expected accuracy

In this tutorial, we will leverage the data from the previous tutorial and demonstrate how to generate simulated enhancers and promoters to estimate the expected precision for each target.

## create expected accuracy for each target region and each cluster

We will use the variable _-nb_sim_features_ to introduce a number of simulated enhancers and one simulated target region (i.e. simulated promoter) for each target region. Enhlink will then use the number of recovered within the simulated enhancers and promoter to estimate the expected accuracy of the method with regards to the parameters.

In the experiment below, we set _-nb_sim_features_ to 4 which is the  number of simulated enhancers generated per target region.

```bash
enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
    -xgi ./filtered_peak_bc_matrix/barcodes.tsv \
    -ygi ./filtered_peak_bc_matrix/peaks.bed \
    -clusters ./analysis/clustering/kmeans_3_clusters/clusters.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -out ./output_tutorial_2_1/ \
    -threads 6 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -nb_sim_features 4 \
    -rmPeaksInTargets
```

As a results of the command above, a new file: _simulated.tsv_ was generated in the output folder and contains the True Positive Rate (TPR), True Negative Rate (TNR), False Positive Rate (FPR), and the False Negative Rate (FNR) estimated from the simulated enhancers and promoter for each target promoter and for each cluster. The accuracy of Enhlink is correlated with the number of cells and the mean, which is the average accessibility for a given scATAC-seq promoter, of the target region. So, more accessible promoters and larger groups will produce more accurate results.

```bash
head ./output_tutorial_2_1/simulated.tsv

# output:

#SIMULATION results for -depth: 2 -n_boot: 100 -nb_sim_features: 4 -lambda1 :1.800000 -lambda2 :0.050000 -max_features: 4 -min_matsize: 100 -threshold: 0.050000 -downsample: 0 -maxFeatType: all
TPR     TNR     FPR     FNR     chr     start   stop    promoter        peakMean        nbCells group
0.8000  1.0000  0.0000  0.0000  chr6    33287247        33291247        wdr46   0.5188  3009    2
1.0000  1.0000  0.0000  0.0000  chr6    33287247        33291247        wdr46   0.5154  4697    1
0.5000  1.0000  0.0000  0.0000  chr6    33287247        33291247        wdr46   0.5154  2540    3

```

## Mechanics of the simulation

The simulation's mechanics involve computing expected accuracy scores based on two hyperparameters: _lambda1_ and _lambda2_. These parameters correspond to Poisson distributions that control the levels of similarity between a simulated enhancer and its corresponding simulated promoter. Specifically, _lambda1_ determines the amount of dropouts - which is the number of cells with a zero when the corresponding simulated promoter is accessible to a given simulated enhancer. Meanwhile, _lambda2_ controls the amount of false positives - or the number of cells with a one when the corresponding simulated promoter is not accessible.

To create a simulated promoter, the original promoter is shuffled. A simulated enhancer is then instantiated as a copy of the simulated promoter. Next, a random number is generated for each cell using two Poisson distributions. If the generated value is greater than 0, the corresponding cell value for the simulated enhancer is updated accordingly. Additional information can be found in the paper's method section.

Real enhancer-promoter links from snATAC-seq data were used to estimate _lambda1_ and _lambda2_ values in the paper. Usually, promoters are more accessible than enhancers, so we would expect _lambda1_ to be greater than _lambda2_. However, testing additional values can help to broaden the expected accuracy range.

## Computing a wider range of expected accuracy

To perform this task, we can use the _enhgrid_ software which allows to provide a range of values for various hyperparameters and performs an analysis for each combination of hyperparameters. Since this can represent a lot of computation, we will perform this analysis for only one gene from the previous analysis: _GLOD4_. We will try 3 values for _lambda1_, _lambda2_, and for _nb_sim_features_ for a total of 27 combinations and, since we have 3 clusters, we will get a value for each of them at each combination. We also use the _-onlySim_ option to only perform the simulation and not infer the links. Also, since the procedure is stochastic, we will perform 5 repetitions of each combination. Finally, the grid of analysis can optionally be parallelized using the _processes_ option. The effective number of CPUs required will then be _-processes_ x _threads_

To expand the range of expected accuracy, we can use the _enhgrid_ software allowing to provide a range of values for different hyperparameters, with the software analyzing each hyperparameter combination. However, given the computational resources required, we will apply this analysis to just one gene from the previous analysis, _GLOD4_, as a proof of concept. Specifically, we will test three values each for _lambda1_, _lambda2_, and _nb_sim_features_, yielding 27 combinations. With three clusters, we will obtain one value for each cluster at each combination. We will use the _-onlySim_ option to avoid inferring links and perform only simulations. Since the procedure is stochastic, we will repeat each combination five times. Additionally, we can optionally parallelize the analysis grid using the _processes_ option. This will require an effective number of CPUs equal to _-processes_ x _threads_.

```bash
enhgrid -mat ./filtered_peak_bc_matrix/matrix.mtx \
    -xgi ./filtered_peak_bc_matrix/barcodes.tsv \
    -ygi ./filtered_peak_bc_matrix/peaks.bed \
    -clusters ./analysis/clustering/kmeans_3_clusters/clusters.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -out ./output_tutorial_2_2/ \
    -target GLOD4 \
    -threads 3 \
    -format cellRanger \
    -lambda1 0.04,0.08,1.2 \
    -lambda2 1.2,1.8,2.4 \
    -nb_sim_features 2,4,8 \
    -rmPeaksInTargets \
    -processes 3 \
    -repetition 5 \
    -onlySim
```

This analysis should be performed in less than a minute and should produce the following 5 * 27 output files:

```bash
ls ./output_tutorial_2_2/

# output
grid.nb0.rep0.simulated.tsv
grid.nb0.rep1.simulated.tsv
grid.nb0.rep2.simulated.tsv
grid.nb0.rep3.simulated.tsv
grid.nb0.rep4.simulated.tsv
grid.nb10.rep0.simulated.tsv
grid.nb10.rep1.simulated.tsv
grid.nb10.rep2.simulated.tsv
grid.nb10.rep3.simulated.tsv
grid.nb10.rep4.simulated.tsv
...

```

## Example of postprocessing

The post processing of these results depends of what one wants to do and can take many forms. Below, we provide an example of how to extract these values and create an informative plot using Python with the Pandas, and Seaborn libraries.

```python
import seaborn as sns
import pandas as pd
from glob import glob
import re
import pylab as plt

results = pd.DataFrame()

files = glob("./output_tutorial_2_2/*simulated.tsv")
regex = " -nb_sim_features: (?P<nbsim>.+) -lambda1 :(?P<lambda1>.+) -lambda2 :(?P<lambda2>.+) -max_features"

for fil in files:
	header = open(fil).readline()
	match = re.search(regex, header).groupdict()
	tsv = pd.read_csv(fil, sep="\t", comment="#")
	tsv['lambda1'] = match['lambda1']
	tsv['lambda2'] = match['lambda2']
	tsv['NbSim'] = match['nbsim']
	results = pd.concat([results, tsv])

results.fillna(0, inplace=True)
results['f1-score'] = results['TPR'] / (results['TPR'] + 0.5 * (results['FPR'] + results['FNR']))

results['lambda1 / lambda2'] = results['lambda1'].astype('str') + "/" + results['lambda2'].astype('str')

violin = sns.violinplot(data=results, x='group', y='f1-score', hue='lambda1 / lambda2')
violin.set_title('GLOD4 expected accuracy', fontsize=14)
violin.set_xlabel('Cluster ID', fontsize=14)
violin.set_ylabel('F1-score', fontsize=14)
sns.move_legend(violin, "upper left", bbox_to_anchor=(1, 1))
plt.tight_layout()
plt.show()
```


![](img/GLOD4_exp_acc.png)

The results of this analysis is plotted below. As we can see, the F1-score is overall very good (i.e. the mean is close to 1.0) for most of the lambdas values, expected the most extreme. These results were expected since GLOD4 has overall a high accessibility ratio amongst cells (38% of cells have on average its promoter accessible). Promoters with low accessibilities will produce greater variance and less accurate links.

## Conclusion

Several factors influence the accuracy of link-inference estimation for a given target region and enhancer features, including the target accessibility ratio, the number of surrounding and linked enhancers, the number of cells, and the intrinsic correlations within the promoter and the enhancers, modeled by the lambdas values. Based on experimental measurements and real snATAC-seq data, we estimated the lambdas values for several promoters and enhancers, and speculate that other links may exhibit similar patterns. Except for extreme values of lambdas, the inferred score remains fairly robust to fluctuations in the lambdas values.