dataset.data_prep.Brain_Large
- class dataset.data_prep.Brain_Large(file_dir=('/home/longlab/Data/Thesis/Data/', '1M_neurons_filtered_gene_bc_matrices_h5.h5'), n_sub_samples=100000, n_select_genes=720, low_memory=True)[source]
Bases:
Dataset
Loads BRAIN data set.
A class with necessary pre-processing steps for the Large brain dataset. The variance of the genes for a sub-sample of 10^5 cells will be calculated and the high variable genes (720 by default) will be selected.
The Large brain dataset can be downloaded from the following url: “http://cf.10xgenomics.com/samples/cell-exp/1.3.0/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5”
The data is in HDF5 format which can be easily accessed and processed with the h5py library.
The data contains:
barcodes: This contains information of the batch number which can be used for batch correction.
gene_names contains the gene names. genes contains the Ensembl Gene id such as: ‘ENSMUSG00000089699’ data is an array containing all the non zero elements of the sparse matrix indices is an array mapping each element in data to its column in the sparse
matrix.
indptr maps the elements of data and indices to the rows of the sparse matrix. shape the dimension of the sparse matrix
For more info please visit https://stackoverflow.com/questions/52299420/scipy-csr-matrix-understand-indptr and “https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html”
Parameters
- file_dir: str
The directory of the HDF5 file which should be provided by the user.
- n_sub_samples: int
Number of samples (cells) in the downsampled matrix. Default: 10^5
- n_select_genes: int
Number of the high variable genes to be selected in the pre-processing step. Default: 750
- low_memory: Boolean
If False, the whole data will be loaded into memory (high amount of memory required); True by default.
Attributes
- file_dirstr
The directory of the HDF5 file
- n_select_genesint
Number of high variable genes to be selected in the pre-processing step.
- n_genesint
Total number of genes in the data set
- n_cellsint
Total number of cells
- selected_genesndarray
The indices of selected high variable genes.
- low_memoryboolean
Whether perform data loading with memory efficiency or not.
- matrixscipy csc_matrix
If low_memory is False, it is the whole data otherwise None.
Examples
>>> import data_prep >>> brain = data_prep.Brain_Large() >>> dl = DataLoader(brain, batch_size= batch_size, shuffle=True)
Methods