# Tutorial 6: Generating simulated matrices for method comparison ## Overview In this tutorial, we will generate a simulated matrix that compares different methods for identifying target regions and linked enhancers. We will begin by creating a cells x peaks matrix using an existing input matrix for background data. We will then simulate additional data for the target regions and linked enhancers that will be included alongside the background data. The simulated matrix can then be tested with Enhlink to try to retrieve the enhancers of the simulated regions. ## Steps ### loading the matrix First, we need to load an existing cells x peaks sparse matrix with the Python language. The two required files are the sparse matrix `npz` python file and the columns index, defining the peak regions as a bed file. ```python from scipy.sparse import load_npz import pandas as pd npz = "tmp.repl1_CUBE_ISLET_V1.delta.v2_peak.npz" mat = load_npz(npz) ygi = "peaks.index.bed" dfy = pd.read_csv(ygi, sep="\t", header=None) dfy.columns = ['chr', 'start', 'stop'] ``` ### intersecting the peak index with regerence GTF file Secondly, we need a promoter GTF file defining the promoter regions of each gene. We then need to find the peaks intersecting these promoter regions and annotate them with the corresponding gene symbols. We can do this using `ATACAnnotateRegions` from the `ATACdemultiplex` package. It can be installed with the following command: ```bash go install gitlab.com/Grouumf/ATACdemultiplex/ATACAnnotateRegions@latest ``` Now, let's create a file containing all the peaks from the index file interesting the GTF promoter file: ```bash YGI="peaks.index.bed" PROMOTER="promoters.gtf" OUT="intersecting_peaks.promoter.bed" ATACAnnotateRegions -bed ${YGI} -ref ${PROMOTER} -ignore -stdout|sort|uniq > -out ${OUT} head ${OUT} ## output of the first lines: chr1 10006926 10007925 Ppp1r42 chr1 10008609 10009608 Ppp1r42 chr1 10011702 10012701 Ppp1r42 chr1 10013940 10014939 Ppp1r42 chr1 10037419 10038418 Cops5 chr1 10037419 10038418 Cspp1 ``` We can now load this file in python and define the set of peaks that ```python ygi_inter = "/home/poirio/data2/ENHLINK/SIMMAT/intersecting_peaks.promoter.bed" dfyi = pd.read_csv(ygi_inter, sep="\t", header=None) dfyi.columns = ['chr', 'start', 'stop', 'geneID'] dfyi['peak'] = dfyi['chr'] + "_" + dfyi['start'].astype('str') + "_" + dfyi['stop'].astype('str') dfy_series = dfy['chr'] + "_" + dfy['start'].astype('str') + "_" + dfy['stop'].astype('str') dfy_index = {a:i for i,a in enumerate(dfy_series)} gene_index = {} used_peaks = set() for i in dfyi.index: peak_str = dfyi.loc[i, 'peak'] peak_id = dfy_index[peak_str] if peak_str in used_peaks: continue gene_index[dfyi.loc[i, 'geneID']] = peak_id used_peaks.add(peak_str) ``` ### simulation function Now, we need to define our simulation function that a) shuffle a promoter b) copy it and introduces noise for each simulated enhancer associated with it. It is controlled by two hyperparameters defining the ratios of false positives and false negatives in the simulated enhancers. ```python import numpy as np def simulate_ygi_enh(ygi, nb_enh, noise1=1.8, noise2=0.05): ygir = ygi.copy() np.random.shuffle(ygir) enhr_array = [] for _ in range (nb_enh): enhr = ygir.copy() for i, v in enumerate(enhr): p1 = np.random.poisson(lam=noise1) p2 = np.random.poisson(lam=noise2) if p1 > 0 and v > 0: enhr[i] = 0 elif p2 > 0 and v == 0: enhr[i] = 1 enhr_array.append(enhr) return ygir, np.asarray(enhr_array) ``` ### looping through genes and create simulated promoter and enhancers We can select random genes from the list and create simulated enhancers for each selected gene. Then, we add these enhancers and promoters to the original sparse matrix as output, along with the column index of the new matrix, the list of simulated enhancers and their associated simulated promoter reference (bedpe), and the list of simulated promoters with their genomic coordinates ```bash from scipy.sparse import hstack from scipy.sparse import csr_matrix from sys import stdout import numpy as np from scipy.sparse import save_npz ### Output files ### fout = "simulated_matrix.npz" fout_promoter = "simulated_matrix.promoter.tsv" fout_feat = "simulated_matrix.enhancer.bedpe" fout_ygi = "simulated_matrix.index.bed" # Number of random genes to draw nb_sim_genes = 400 # Number of simulated enhancers to draw for each genes (between 2 and 7) range_nb_sim_features = (2, 8) # Initialisation of object sim_promoters = pd.DataFrame() sim_features = pd.DataFrame() dfy2 = dfy.copy() current_index = dfy.shape[1] rand_genes = np.random.choice(list(gene_index.keys()), size=nb_sim_genes) rand_ids = np.array([gene_index[a] for a in rand_genes]) print("loading matrices...") mat = mat.tocsr() mat2 = mat.tolil() print("Creating simulated variables...") feat_array = None used_ygis = set() ``` looping for each gene selected, simulate a promoter for a given peak within a promoter region, replace the value the promoter with the simulated promoter, and simulate enhancers linked to that promoter. Each peaks and promoter can only be used once. ```python for i, rand_id in enumerate(rand_ids): stdout.write("\r#### feat nb. {0}".format(i)) stdout.flush() nb_sim = np.random.randint(range_nb_sim_features[0], range_nb_sim_features[1]) vect_y = np.asarray(mat[:, rand_id].todense()).T[0] chosen_prom = dfy.loc[rand_id].copy() chosen_prom["geneID"] = i sim_promoters = sim_promoters.append(chosen_prom) vect_y, array_r = simulate_ygi_enh(vect_y, nb_sim, noise1=1.25, noise2=0.15) mat2[:, rand_id] = vect_y if feat_array is None: feat_array = array_r else: feat_array = np.vstack([feat_array, array_r]) for j in range(nb_sim): sim_y = chosen_prom.copy() sim_y['start'] += 10000 + j sim_y['stop'] = sim_y['start'] + 10 + j while tuple(sim_y[['chr', 'start', 'stop']]) in used_ygis: sim_y['start'] += 1 used_ygis.add(tuple(sim_y[['chr', 'start', 'stop']])) dfy2 = dfy2.append(sim_y[['chr', 'start', 'stop']]) sim_y.index = ['chr2', 'start2', 'stop2', 'geneID'] sim_features = sim_features.append(pd.concat([chosen_prom, sim_y])) ``` ### Saving the matrix and indexes Finally, we stack the simulated promoters and enhancers with the original matrices, compute the accessibility mean of each simulated promoter, and save the indexes ```python feat_array = csr_matrix(feat_array) mat2 = hstack([mat2, feat_array.T]) prom_mean = mat[:, rand_ids].mean(axis=0) enh_mean = mat2.tocsr()[:, -feat_array.shape[0]:].mean(axis=0) print("\nPromoters mean: {0} std:{1}".format(prom_mean.mean(), prom_mean.std())) print("Simulated enhancers mean: {0} std:{1}".format(enh_mean.mean(), enh_mean.std())) for key in ['start', 'stop']: sim_promoters[key] = sim_promoters[key].astype('int') dfy2[key] = dfy2[key].astype('int') for key in ['start', 'stop', 'start2', 'stop2', 'geneID']: sim_features[key] = sim_features[key].astype('int') sim_features['geneID'] = 'simGeneID' + sim_features['geneID'].astype('str') sim_promoters['geneID'] = 'simGeneID' + sim_promoters['geneID'].astype('int').astype('str') print("\n\nsaving files...") sim_promoters.to_csv(fout_promoter, sep="\t", header=None, index=None) print("file written: {0}".format(fout_promoter)) sim_features = sim_features.loc[:, ['chr', 'start', 'stop', 'chr2', 'start2', 'stop2', 'geneID']] sim_features.to_csv(fout_feat, sep="\t", index=None) print("file written: {0}".format(fout_feat)) dfy2.to_csv(fout_ygi, sep="\t", header=None, index=None) print("file written: {0}".format(fout_ygi)) print("saving mat...") save_npz(fout, mat2) print("file written: {0}".format(fout)) ``` ## Using Enhlink to process the simulated matrix To process the simulated matrix with Enhlink, we have to first convert the matrix to a COO (COOrdinate formate) matrix, since Enhlink cannot read a numpy sparse matrix. A COO matrix is a three-columns file indicating the x and y coordinates and the values of each entry of the sparse matrix. To do so, we can use the following script: [npz_to_COO_mat](https://gitlab.com/Grouumf/enhlinktools/-/blob/master/scripts/npz_to_COO_mat) ```bash python ./npz_to_COO_mat --npz simulated_matrix.npz --out simulated_matrix.coo.gz ``` We can then call Enhlink to infer enhancers on the specific simulated promoter regions: ```bash COO="simulated_matrix.coo.gz" XGI="simulated_matrix.xgi" # these are the row indexes YGI="simulated_matrix.index.bed" OUT="./simulated_matrix_enhlink/" PROMOTER="simulated_matrix.promoter.tsv" mkdir -p ${OUT} enhlink \ -mat ${COO} \ -xgi ${XGI} \ -ygi ${YGI} \ -out ${OUT} \ -rmPeaksInTargets \ -gtf ${PROMOTER} \ -threads 8 ```