Tutorial 6: Generating simulated matrices for method comparison
Overview
In this tutorial, we will generate a simulated matrix that compares different methods for identifying target regions and linked enhancers. We will begin by creating a cells x peaks matrix using an existing input matrix for background data. We will then simulate additional data for the target regions and linked enhancers that will be included alongside the background data. The simulated matrix can then be tested with Enhlink to try to retrieve the enhancers of the simulated regions.
Steps
loading the matrix
First, we need to load an existing cells x peaks sparse matrix with the Python language.
The two required files are the sparse matrix npz
python file and the columns index, defining the peak regions as a bed file.
from scipy.sparse import load_npz
import pandas as pd
npz = "tmp.repl1_CUBE_ISLET_V1.delta.v2_peak.npz"
mat = load_npz(npz)
ygi = "peaks.index.bed"
dfy = pd.read_csv(ygi, sep="\t", header=None)
dfy.columns = ['chr', 'start', 'stop']
intersecting the peak index with regerence GTF file
Secondly, we need a promoter GTF file defining the promoter regions of each gene.
We then need to find the peaks intersecting these promoter regions and annotate them with the corresponding gene symbols. We can do this using ATACAnnotateRegions
from the ATACdemultiplex
package.
It can be installed with the following command:
go install gitlab.com/Grouumf/ATACdemultiplex/ATACAnnotateRegions@latest
Now, let’s create a file containing all the peaks from the index file interesting the GTF promoter file:
YGI="peaks.index.bed"
PROMOTER="promoters.gtf"
OUT="intersecting_peaks.promoter.bed"
ATACAnnotateRegions -bed ${YGI} -ref ${PROMOTER} -ignore -stdout|sort|uniq > -out ${OUT}
head ${OUT}
## output of the first lines:
chr1 10006926 10007925 Ppp1r42
chr1 10008609 10009608 Ppp1r42
chr1 10011702 10012701 Ppp1r42
chr1 10013940 10014939 Ppp1r42
chr1 10037419 10038418 Cops5
chr1 10037419 10038418 Cspp1
We can now load this file in python and define the set of peaks that
ygi_inter = "/home/poirio/data2/ENHLINK/SIMMAT/intersecting_peaks.promoter.bed"
dfyi = pd.read_csv(ygi_inter, sep="\t", header=None)
dfyi.columns = ['chr', 'start', 'stop', 'geneID']
dfyi['peak'] = dfyi['chr'] + "_" + dfyi['start'].astype('str') + "_" + dfyi['stop'].astype('str')
dfy_series = dfy['chr'] + "_" + dfy['start'].astype('str') + "_" + dfy['stop'].astype('str')
dfy_index = {a:i for i,a in enumerate(dfy_series)}
gene_index = {}
used_peaks = set()
for i in dfyi.index:
peak_str = dfyi.loc[i, 'peak']
peak_id = dfy_index[peak_str]
if peak_str in used_peaks:
continue
gene_index[dfyi.loc[i, 'geneID']] = peak_id
used_peaks.add(peak_str)
simulation function
Now, we need to define our simulation function that a) shuffle a promoter b) copy it and introduces noise for each simulated enhancer associated with it. It is controlled by two hyperparameters defining the ratios of false positives and false negatives in the simulated enhancers.
import numpy as np
def simulate_ygi_enh(ygi, nb_enh, noise1=1.8, noise2=0.05):
ygir = ygi.copy()
np.random.shuffle(ygir)
enhr_array = []
for _ in range (nb_enh):
enhr = ygir.copy()
for i, v in enumerate(enhr):
p1 = np.random.poisson(lam=noise1)
p2 = np.random.poisson(lam=noise2)
if p1 > 0 and v > 0:
enhr[i] = 0
elif p2 > 0 and v == 0:
enhr[i] = 1
enhr_array.append(enhr)
return ygir, np.asarray(enhr_array)
looping through genes and create simulated promoter and enhancers
We can select random genes from the list and create simulated enhancers for each selected gene. Then, we add these enhancers and promoters to the original sparse matrix as output, along with the column index of the new matrix, the list of simulated enhancers and their associated simulated promoter reference (bedpe), and the list of simulated promoters with their genomic coordinates
from scipy.sparse import hstack
from scipy.sparse import csr_matrix
from sys import stdout
import numpy as np
from scipy.sparse import save_npz
### Output files ###
fout = "simulated_matrix.npz"
fout_promoter = "simulated_matrix.promoter.tsv"
fout_feat = "simulated_matrix.enhancer.bedpe"
fout_ygi = "simulated_matrix.index.bed"
# Number of random genes to draw
nb_sim_genes = 400
# Number of simulated enhancers to draw for each genes (between 2 and 7)
range_nb_sim_features = (2, 8)
# Initialisation of object
sim_promoters = pd.DataFrame()
sim_features = pd.DataFrame()
dfy2 = dfy.copy()
current_index = dfy.shape[1]
rand_genes = np.random.choice(list(gene_index.keys()), size=nb_sim_genes)
rand_ids = np.array([gene_index[a] for a in rand_genes])
print("loading matrices...")
mat = mat.tocsr()
mat2 = mat.tolil()
print("Creating simulated variables...")
feat_array = None
used_ygis = set()
looping for each gene selected, simulate a promoter for a given peak within a promoter region, replace the value the promoter with the simulated promoter, and simulate enhancers linked to that promoter. Each peaks and promoter can only be used once.
for i, rand_id in enumerate(rand_ids):
stdout.write("\r#### feat nb. {0}".format(i))
stdout.flush()
nb_sim = np.random.randint(range_nb_sim_features[0], range_nb_sim_features[1])
vect_y = np.asarray(mat[:, rand_id].todense()).T[0]
chosen_prom = dfy.loc[rand_id].copy()
chosen_prom["geneID"] = i
sim_promoters = sim_promoters.append(chosen_prom)
vect_y, array_r = simulate_ygi_enh(vect_y, nb_sim, noise1=1.25, noise2=0.15)
mat2[:, rand_id] = vect_y
if feat_array is None:
feat_array = array_r
else:
feat_array = np.vstack([feat_array, array_r])
for j in range(nb_sim):
sim_y = chosen_prom.copy()
sim_y['start'] += 10000 + j
sim_y['stop'] = sim_y['start'] + 10 + j
while tuple(sim_y[['chr', 'start', 'stop']]) in used_ygis:
sim_y['start'] += 1
used_ygis.add(tuple(sim_y[['chr', 'start', 'stop']]))
dfy2 = dfy2.append(sim_y[['chr', 'start', 'stop']])
sim_y.index = ['chr2', 'start2', 'stop2', 'geneID']
sim_features = sim_features.append(pd.concat([chosen_prom, sim_y]))
Saving the matrix and indexes
Finally, we stack the simulated promoters and enhancers with the original matrices, compute the accessibility mean of each simulated promoter, and save the indexes
feat_array = csr_matrix(feat_array)
mat2 = hstack([mat2, feat_array.T])
prom_mean = mat[:, rand_ids].mean(axis=0)
enh_mean = mat2.tocsr()[:, -feat_array.shape[0]:].mean(axis=0)
print("\nPromoters mean: {0} std:{1}".format(prom_mean.mean(), prom_mean.std()))
print("Simulated enhancers mean: {0} std:{1}".format(enh_mean.mean(), enh_mean.std()))
for key in ['start', 'stop']:
sim_promoters[key] = sim_promoters[key].astype('int')
dfy2[key] = dfy2[key].astype('int')
for key in ['start', 'stop', 'start2', 'stop2', 'geneID']:
sim_features[key] = sim_features[key].astype('int')
sim_features['geneID'] = 'simGeneID' + sim_features['geneID'].astype('str')
sim_promoters['geneID'] = 'simGeneID' + sim_promoters['geneID'].astype('int').astype('str')
print("\n\nsaving files...")
sim_promoters.to_csv(fout_promoter, sep="\t", header=None, index=None)
print("file written: {0}".format(fout_promoter))
sim_features = sim_features.loc[:, ['chr', 'start', 'stop', 'chr2', 'start2', 'stop2', 'geneID']]
sim_features.to_csv(fout_feat, sep="\t", index=None)
print("file written: {0}".format(fout_feat))
dfy2.to_csv(fout_ygi, sep="\t", header=None, index=None)
print("file written: {0}".format(fout_ygi))
print("saving mat...")
save_npz(fout, mat2)
print("file written: {0}".format(fout))
Using Enhlink to process the simulated matrix
To process the simulated matrix with Enhlink, we have to first convert the matrix to a COO (COOrdinate formate) matrix, since Enhlink cannot read a numpy sparse matrix. A COO matrix is a three-columns file indicating the x and y coordinates and the values of each entry of the sparse matrix. To do so, we can use the following script: npz_to_COO_mat
python ./npz_to_COO_mat --npz simulated_matrix.npz --out simulated_matrix.coo.gz
We can then call Enhlink to infer enhancers on the specific simulated promoter regions:
COO="simulated_matrix.coo.gz"
XGI="simulated_matrix.xgi" # these are the row indexes
YGI="simulated_matrix.index.bed"
OUT="./simulated_matrix_enhlink/"
PROMOTER="simulated_matrix.promoter.tsv"
mkdir -p ${OUT}
enhlink \
-mat ${COO} \
-xgi ${XGI} \
-ygi ${YGI} \
-out ${OUT} \
-rmPeaksInPromoter \
-promoter ${PROMOTER} \
-threads 8