dataset.data_prep.Brain_Large

class dataset.data_prep.Brain_Large(file_dir=('/home/longlab/Data/Thesis/Data/', '1M_neurons_filtered_gene_bc_matrices_h5.h5'), n_sub_samples=100000, n_select_genes=720, low_memory=True)[source]

Bases: Dataset

Loads BRAIN data set.

A class with necessary pre-processing steps for the Large brain dataset. The variance of the genes for a sub-sample of 10^5 cells will be calculated and the high variable genes (720 by default) will be selected.

The Large brain dataset can be downloaded from the following url: “http://cf.10xgenomics.com/samples/cell-exp/1.3.0/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5”

The data is in HDF5 format which can be easily accessed and processed with the h5py library.

The data contains:

barcodes: This contains information of the batch number which can be used for batch correction.

gene_names contains the gene names. genes contains the Ensembl Gene id such as: ‘ENSMUSG00000089699’ data is an array containing all the non zero elements of the sparse matrix indices is an array mapping each element in data to its column in the sparse

matrix.

indptr maps the elements of data and indices to the rows of the sparse matrix. shape the dimension of the sparse matrix

For more info please visit https://stackoverflow.com/questions/52299420/scipy-csr-matrix-understand-indptr and “https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html”

Parameters

file_dir: str: The directory of the HDF5 file which should be provided by the user.
n_sub_samples: int: Number of samples (cells) in the downsampled matrix. Default: 10^5
n_select_genes: int: Number of the high variable genes to be selected in the pre-processing step. Default: 750
low_memory: Boolean: If False, the whole data will be loaded into memory (high amount of memory required); True by default.

Attributes

file_dirstr: The directory of the HDF5 file
n_select_genesint: Number of high variable genes to be selected in the pre-processing step.
n_genesint: Total number of genes in the data set
n_cellsint: Total number of cells
selected_genesndarray: The indices of selected high variable genes.
low_memoryboolean: Whether perform data loading with memory efficiency or not.
matrixscipy csc_matrix: If low_memory is False, it is the whole data otherwise None.

Examples

>>> import data_prep
>>> brain = data_prep.Brain_Large()
>>> dl = DataLoader(brain, batch_size= batch_size, shuffle=True)

Methods