Creating an Epitome Dataset¶
This section explains how to load in an Epitome Dataset. If you are interested in pre-processing your own dataset from ENCODE or ChIP-Atlas, see Configuring data.
First, import EpitomeDataset:
from epitome.dataset import *
Create an Epitome Dataset¶
First, create an Epitome Dataset. In the dataset, you will define the ChIP-seq targets you want to predict, the cell types you want to train from and the assays you want to use to compute cell type similarity.
targets = ['CTCF','RAD21','SMC3']
celltypes = ['K562', 'A549', 'GM12878']
dataset = EpitomeDataset(targets=targets, cells=celltypes)
Note that you do not have to define celltypes. If you leave celltypes
blank, the Epitome dataset will choose cell types that have coverage for the
ChIP-seq targets chosen. The parameters min_cells_per_target and min_targets_per_cell
specify the minimum number of cells required for a ChIP-seq target, and the minimum
number of ChIP-seq targets required to include a celltype. By default,
min_cells_per_target = 3 and min_targets_per_cell = 2.
targets = ['CTCF','RAD21','SMC3']
dataset = EpitomeDataset(targets=targets,
min_cells_per_target = 4, # requires that each ChIP-seq target has data from at least 4 cell types
min_targets_per_cell = 3) # requires that each cell type has data for all three ChIP-seq targets
Note that by default, EpitomeDataset sets DNase-seq (DNase) to be used to compute cell type similarity between cell types. To specify a different assay to compute cell type similarity, you can specify in the Epitome dataset:
dataset = EpitomeDataset(targets=targets,
cells=celltypes,
similarity_targets = ['DNase', 'H3K27ac'])
You can then visualize the ChIP-seq targets and cell types in your dataset by
using the view() function:
dataset.view()
To list all of the ChIP-seq targets that an Epitome dataset has available data for,
you can define an Epitome Dataset without specifying targets or cells.
You can then use the list_targets() function to print all available ChIP-seq targets
in the dataset:
dataset = EpitomeDataset()
dataset.list_targets() # prints > 200 ChIP-seq targets
You can now use your dataset in an Epitome model.
Load your processed dataset¶
You can specify the data path and/or genome assembly that you would like to use
in the Epitome dataset. You just need to define the data_dir and/or
assembly variables:
dataset = EpitomeDataset(data_dir="path/to/configured/data",
assembly="hg19")
Note if both the data_dir and assembly are set, the dataset will
append the specified assembly to the data_dir path such as
~/$USERNAME/epitome/data/hg19/data.h5 and return the dataset that is stored
in the path if it exists. If there is no data stored at that path, Epitome will
try to download the specified assembly from the S3 cluster at
https://epitome-data.s3-us-west-1.amazonaws.com.
You do not need to define both variables though. If you leave data_dir empty,
the Epitome dataset will append the assembly to the default data path located
in ~/$USER_NAME/.epitome/data/ and return the dataset if it exists at that path.
If there is no existing dataset located at the data path, Epitome will download
the dataset for the specified assembly from S3 to that path:
dataset = EpitomeDataset(assembly="hg19")
If the assembly is not specified but the data_dir is, the dataset will assume
that the specified data directory data_dir is the absolute data path and it
will append the default assembly to the configured data path. Like above, if the
dataset exists at the configured data path, Epitome will load the configured data
into the EpitomeDataset. If there is no existing dataset, Epitome will download
the dataset for the default assembly from S3 and store it at the default data path:
dataset = EpitomeDataset(data_dir="path/to/configured/data")
If neither data_dir or assembly are set, the dataset will just try to
fetch the data.zip file in the default data directory. If no data exists in
the default directory, Epitome will download the dataset for the default assembly
from S3 and store it at the default data path:
dataset = EpitomeDataset()