dataset.data_prep.Brain_Large

class dataset.data_prep.Brain_Large(file_dir=('/home/longlab/Data/Thesis/Data/', '1M_neurons_filtered_gene_bc_matrices_h5.h5'), n_sub_samples=100000, n_select_genes=720, low_memory=True)[source]

Bases: Dataset

Loads BRAIN data set.

A class with necessary pre-processing steps for the Large brain dataset. The variance of the genes for a sub-sample of 10^5 cells will be calculated and the high variable genes (720 by default) will be selected.

The Large brain dataset can be downloaded from the following url: “http://cf.10xgenomics.com/samples/cell-exp/1.3.0/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5

The data is in HDF5 format which can be easily accessed and processed with the h5py library.

The data contains:

barcodes: This contains information of the batch number which can be used for batch correction.

gene_names contains the gene names. genes contains the Ensembl Gene id such as: ‘ENSMUSG00000089699’ data is an array containing all the non zero elements of the sparse matrix indices is an array mapping each element in data to its column in the sparse

matrix.

indptr maps the elements of data and indices to the rows of the sparse matrix. shape the dimension of the sparse matrix

For more info please visit https://stackoverflow.com/questions/52299420/scipy-csr-matrix-understand-indptr and “https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

Parameters

file_dir: str

The directory of the HDF5 file which should be provided by the user.

n_sub_samples: int

Number of samples (cells) in the downsampled matrix. Default: 10^5

n_select_genes: int

Number of the high variable genes to be selected in the pre-processing step. Default: 750

low_memory: Boolean

If False, the whole data will be loaded into memory (high amount of memory required); True by default.

Attributes

file_dirstr

The directory of the HDF5 file

n_select_genesint

Number of high variable genes to be selected in the pre-processing step.

n_genesint

Total number of genes in the data set

n_cellsint

Total number of cells

selected_genesndarray

The indices of selected high variable genes.

low_memoryboolean

Whether perform data loading with memory efficiency or not.

matrixscipy csc_matrix

If low_memory is False, it is the whole data otherwise None.

Examples

>>> import data_prep
>>> brain = data_prep.Brain_Large()
>>> dl = DataLoader(brain, batch_size= batch_size, shuffle=True)

Methods