# Tutorial 3: Create sparse matrices compatible with Enhlink In this tutorial, we will describe how to create matrices that can be used with Enhlink. ## recognized matrices format Enhink currently accepts three types of matrix format: 1. the COOrdinate format (COO) which is similar to the scipy [COO format](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html) consists of a three-columns tab separated .tsv file with the two first columns being the cell and the peak index and the last columns the values. The matrix file needs to have two index files describing the cell barcodes and the features, which can be either a bed file with genomic regions or just feature names such as gene IDs. The COO matrix has the indexes starting at 0. ```bash # These are the first lines of dummy matrices used as illustration head -3 example_atac_coo_mat.tsv # output: 0 10 1 2 0 1 3 30 1 head -3 example_barcodes.txt # output: TCGAAAAA-1 TCGCCCCC-1 TCGGGGGG-1 head -3 example_peaks.bed # output: chr1 100000 101000 chr2 100000 101000 chr3 100000 101000 ``` In the example above, the second line of the matrix `2 1 1` refers to `TCGGGGGG-1` and `chr1 100000 101000` having a value of 1. 2. the Matrix Market File (MTX) is described [here](https://networkrepository.com/mtx-matrix-market-format.html) and is very similar to the COO format.the MTX matrix comes with a header and the first line describes the dimensions of the matrix (i.e. the number of cells and features). The MTX matrix also has the indexes starting at 0. ```bash head -4 example_atac_mtx_mat.tsv # output: %%MatrixMarket matrix coordinate real general % 100 500 2022 0 10 1 2 0 1 3 30 1 ``` The matrix above is similar to the COO except the header and the first line: `100 500 2022`, indicating the number of cells (100), the number of features (500), and the sum of all the values (2022). These numbers are made up for the example. 3. The cellRanger format is a MTX matrix with the feature indexes in the first column, the barcode index in the second columns and has the indexes starting at 1 instead of 0. Please notice the how the matrix above differs from the first MTX matrix. ```bash head -4 example_atac_cellRanger.tsv # output: %%MatrixMarket matrix coordinate real general % 500 100 2022 11 1 1 1 2 1 31 4 1 ``` ## Create a cell x peak sparse matrix from a fragment file Fragments files are BED-formatted files where the fourth column corresponds to a cell barcode. These files are one of the standard output files generated by the Cell Ranger single-cell ATAC-seq data processing pipeline. Each line in the file represents a sequencing read that has been assigned to a specific cell barcode. First, let's download a fragments file from one of the example available datasets of 10XGenomics: we will use the multi-omic RNA/ATAC [PBMC from a Healthy Donor - No Cell Sorting (3k)](https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-no-cell-sorting-3-k-1-standard-2-0-0) dataset. ``` # Downloading fragment file wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_fragments.tsv.gz # preview the fragments file without the header lines starting with # # Please use gzcat instead of zcat (frome coreutils) on OSX zcat pbmc_unsorted_3k_atac_fragments.tsv.gz|grep -v "#"|head # output chr1 10073 10431 CAAGGGAGTAAGCTCA-1 1 chr1 10085 10279 ACAGGTAAGCTTTGGG-1 1 chr1 10085 10335 GGTTAATGTACTTCAC-1 1 chr1 10091 10339 TGTTGGCCAAGGAATC-1 1 chr1 10150 10192 CCCTTAATCATCCTCA-1 2 # The last column corresponds to the read orientation and has no importance here ``` Then, we need to download (or to infer) peaks or genomic regions from this fragments file. If we want to compute ourself the peaks, we can use for example the [MACS2](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html) software. In the case of ATAC-seq reads, the peaks are which regions that corresponds to open chromatin regions. Alternatively, Cell-Ranger already computed these peaks for us, so let's just use their file: ``` # Downloading the peaks file wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_peaks.bed ``` Finally we need the barcodes file. We will use the one from the filtered feature matrix for the gene expression because we know that these barcodes are considered as valid by Cell-Ranger. Also, we will be able to use both matrices in the next tutorial: ```bash wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_filtered_feature_bc_matrix.tar.gz # decompressing the file tar -xvf pbmc_unsorted_3k/pbmc_unsorted_3k_filtered_feature_bc_matrix.tar.gz ``` To create the matrix we can use the [_ATACMatutils_](https://gitlab.com/Grouumf/ATACdemultiplex#atacmatutils-efficiently-construct-matrix-for-scatac-seq-datase) software from a 2021 [snATAC-seq study](https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_peaks.bed). To install it, we can directly download the executable if we have a linux x86_64 architecture or install it with go: ```bash # option 1: install it with go go install gitlab.com/Grouumf/ATACdemultiplex/ATACMatUtils@latest # option 2: download the executable wget https://gitlab.com/Grouumf/ATACdemultiplex/-/raw/master/bin/ATACMatUtils chmod +x ./ATACMatUtils # give exec permission cp ./ATACMatUtils /usr/local/bin # optionally put the exec in the global path ``` More info can be found [here](https://gitlab.com/Grouumf/ATACdemultiplex). We can now easily create a cell x peak sparse matrix: ```bash ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \ -xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \ -ygi ./pbmc_unsorted_3k_atac_peaks.bed \ -out ./filtered_feature_atac_matrix \ -format cellRanger ``` this should only take a minute and creates a new folder with the mtx matrix and the two index files: ```bash ls ./filtered_feature_atac_matrix # output: filtered_feature_atac_matrix ├── barcodes.tsv.gz ├── features.tsv.gz └── matrix.mtx.gz ``` The matrix above is boolean, e.g. the value is alwasy 1. in other words, a peak is either accessible or not for a given cell. It is also possible to create a numerical matrices with the `-use_count` and the `-norm_type` options of `ATACMatUtils` (see ATACMatUtils -h). For example, we will bellow create a numerical matrix that can be used as `Gene expression` matrix with Enhlink (See next tutorial). ```bash # the following command creates a matrix having the read count as numerical value for each promoter ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \ -xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \ -ygi ./Homo_sapiens.GRCh38.99.TSS.2K.bed \ -out ./promoter_atac_matrix \ -use_symbol \ -use_count \ -norm_type count \ -format cellRanger # display the first lines of the matrix # Please use gzcat instead of zcat (frome coreutils) on OSX zcat promoter_atac_matrix/matrix.mtx.gz|head # output %%MatrixMarket matrix coordinate real general % 60642 3009 15321203 59922 1 2 18545 1 3 29792 1 1 17524 1 2 # display the first lines of the feature files # Please use gzcat instead of zcat (frome coreutils) on OSX zcat promoter_atac_matrix/features.tsv.gz|head A1BG A1BG count A1BG-AS1 A1BG-AS1 count A1CF A1CF count A2M A2M count A2M-AS1 A2M-AS1 count ``` Other norms can be used such as simple, rpm, logrpm, fpkm, and logfpkm (see ATACMatUtils -h). ## Python frameworks ### Transform a Python sparse matrix to a COO/MTX file. Sparse matrices are manipulated using the scipy.sparse library. To save a python sparse matrix into a MTX format, simply use the `scipy.io` package. ```python from scipy.io import mmwrite import numpy as np from scipy.sparse import coo_matrix # dummy matrix M = np.random.randint(0, 2, (10, 40)) Ms = coo_matrix(M) mmwrite('dummy_mtx_matrix.mtx', Ms) ``` ### Transform a scanpy object to MTX a scanpy object stores the data in structure linked to a sparse matrix and a dataframe with the barcode and feature indexes ```python # Here is an example for converting an AnnData object to a sparse matrix with the index files. import scanpy as sc # Example dataset adata = sc.datasets.ebi_expression_atlas("E-MTAB-4888") # To be conform with the cellRanger format, we have to use the transpose of the matrix mmwrite('E-MTAB-4888.sparse_matrix.mtx', adata.X.T) adata.obs.index.to_frame().to_csv('barcodes.tsv.gz', header=None, sep="\t", index=None) adata.var.index.to_frame().to_csv('features.tsv.gz', header=None, sep="\t", index=None) ``` ### read MTX/COO matrices to Python Reading mtx matrices in Python is straightforward with the `mmread` function of the `scipy.io` library. `scanpy` is also able to directly read a MTX matrix from cellRanger. Finally, [here](https://gitlab.com/Grouumf/ATACdemultiplex/-/blob/master/scripts/COO_to_npz_mat) is a code to convert COO matrices to Python. ### Concert NPZ sparse matrix and index into a dataframe Python sparse matrix are `.npz` files can be read with the scipy sparse library. Since the row and column indexes of this matrix are numerical, a npz matrix requires two index files (xgi and ygi), describing the row and column features. ```python from scipy.sparse import load_npz import pandas as pd import numpy as np fmat = "python.sparse_matrix.npz" fxindex = "python.sparse_matrix.xindex.xgi" fyindex = "python.sparse_matrix.xindex.xgi" # Load the matrix M = load_npz(fmat) # Load the index and convert them into an array xgi = np.array([x.strip() for x in open(fxindex)]) ygi = np.array([x.strip() for y in open(fyindex)]) # Create a pandas dataframe df = pd.DataFrame.sparse.from_spmatrix(M, index=xgi, columns=ygi) ``` ## R frameworks ### Saving matrices The [`R-devel`](https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/externalFormats.html) library contains the equivalent of the `scipy.io` functions. Furthermore, the library [write10xCounts](https://rdrr.io/github/MarioniLab/DropletUtils/man/write10xCounts.html) seems to be suited for saving and reading genomic matrices and data into the cellRanger format, but we didn't test it at this point. ### Reading matrices reading COO matrices and converting them into R sparse matrices is straightforward using the R `sparseMatrix` object from the `Matrix` library, since it takes the same 3 columns as input. [here](https://gitlab.com/Grouumf/ATACdemultiplex/-/blob/master/scripts/coo_to_R_sparse_matrix.R) is an example code that can be used for this task.