API documentation¶
CluBCpG was built to be, primarily, a command-line based set of tools for the analysis of WGBS data. However, the package does include a few APIs which may be useful for users wanting to interact with their data more programmatically or extend the functions of CluBCpG.
ClubCpG APIs¶
-
class
clubcpg.ParseBam.
BamFileReadParser
(bamfile, quality_score, read1_5=None, read1_3=None, read2_5=None, read2_3=None, no_overlap=True)[source]¶ Used to simplify the opening and reading from BAM files. BAMs must be coordinate sorted and indexed.
Example: >>> from clubcpg.ParseBam import BamFileReadParser >>> parser = BamFileReadParser("/path/to/data.BAM", quality_score=20, read1_5=3, read1_3=4, read2_5=7, read2_3=1) >>> reads = parser.parse_reads("chr7", 10000, 101000) >>> reads = parser.correct_cpg_positions(reads) # This step is optional >>> matrix = parser.create_matrix(reads)
-
__init__
(bamfile, quality_score, read1_5=None, read1_3=None, read2_5=None, read2_3=None, no_overlap=True)[source]¶ Class used to read WGBSeq reads from a BAM file, extract methylation, and convert into data frame
Parameters: - bamfile – Path to bam file location
- quality_score – Only include reads >= this fastq quality
- read1_5 – mbias ignore read1 5’
- read1_3 – mbias ignore read1 3’
- read2_5 – mbias ignore read2 5’
- read2_3 – mbias ignore read2 3’
- no_overlap – bool. If overlap exists between two reads, ignore that region from read 2.
-
static
correct_cpg_positions
(output: list)[source]¶ For some reason, Bismark alignment produces instances where a CpG site location is incorrect by 1 bp, even after accounting for DNA strand alignmment. This function fixes this. If two cpgs have positions such as 4, 5 (which is impossible because there needs to by a G between them) this function will convert all 5s to 4s. This only needs to be applied to matrices which are empty after dropna() is called.
Parameters: output – a list of lists of tuples. The output of self.parse_reads() Returns: list of the same style, execpt the first position in the tuple will have a corrected CpG position.
-
create_matrix
(read_cpgs)[source]¶ Converted parsed reads into a pandas dataframe.
Parameters: read_cpgs (iterable) – read CpGs generated by self.parse_reads Returns: matrix methylated (1) and unmethylated (0) states Return type: pd.DataFrame
-
fix_read_overlap
(full_reads, read_cpgs)[source]¶ Takes pysam reads and read_cpgs generated during parse reads and removes any overlap between read1 and read2. If possible it also stitches read1 and read2 together to create a super read.
Parameters: - full_reads – set of reads generated by self.parse_reads()
- read_cpgs – todoo
Returns: A list in the same format as read_cpgs input, but corrected for paired read overlap
-
-
class
clubcpg.CalculateBinCoverage.
CalculateCompleteBins
(bam_file: str, bin_size: int, output_directory: str, number_of_processors=1, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, no_overlap=True)[source]¶ Class to calculate the number of reads covering all CpGs
-
__init__
(bam_file: str, bin_size: int, output_directory: str, number_of_processors=1, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, no_overlap=True)[source]¶ This class is initialized with a path to a bam file and a bin size
Parameters: - bam_file – One of the BAM files for analysis to be performed
- bin_size – Size of the bins for the analysis, integer
Number_of_processors: How many CPUs to use for parallel computation, default=1
-
analyze_bins
(individual_chrom=None)[source]¶ Main function in class. Run the Complete analysis on the data
Parameters: individual_chrom – Chromosome to analyze: ie “chr7” Returns: filename of the generated report
-
calculate_bin_coverage
(bin)[source]¶ Take a single bin, return a matrix. This is passed to a multiprocessing Pool.
Parameters: bin – Bin should be passed as “Chr19_4343343” Returns: pd.DataFrame with rows containing NaNs dropped
-
generate_bins_list
(chromosome_len_dict: dict)[source]¶ Get a dict of lists of all bins according to desired bin size for all chromosomes in the passed dict
Parameters: chromosome_len_dict – A dict of chromosome length sizes from get_chromosome_lenghts, cleaned up by remove_scaffolds() if desired Returns: dict with each key being a chromosome. ex: chr1
-
-
class
clubcpg.ClusterReads.
ClusterReads
(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True)[source]¶ This class is used to take a dataframe or matrix of reads and cluster them
Example: >>> from clubcpg.ClusterReads import ClusterReads >>> cluster = ClusterReads(bam_a="/path/to/file.bam", bam_b="/path/to/file.bam", bins_file="/path/to/file.csv", suffix="chr19") >>> cluster.execute()
-
__init__
(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
static
attempt_cpg_position_correction
(reads, parser: clubcpg.ParseBam.BamFileReadParser)[source]¶ Take the reads and a parser object, attempted cpg position correction and return corrected reads
Parameters: - reads – parsed reads from BamFileReadParser
- parser – an instance of the BamFileReadParser object
Returns: reads with CpG positions corrected
-
execute
(return_only=False)[source]¶ This method will start multiprocessing execution of this class.
Parameters: return_only (bool) – Whether to return the results as a variabel (True) or write to file (False) Returns: list of lists if :attribute: return_only False otherwise None Return type: list or None
-
filter_data_frame
(matrix: pandas.core.frame.DataFrame)[source]¶ Takes a dataframe of clusters and removes any groups with less than self.cluster_member_min members
Parameters: matrix – dataframe of clustered reads Type: pd.DataFrame Returns: input matrix with some clusters removed
-
generate_individual_matrix_data
(filtered_matrix, chromosome, bin_loc)[source]¶ Take output of process_bins() and converts it into a list of lines of text data for output
Parameters: - filtered_matrix (pd.DataFrame) – dataframe returned by
ClusterReads.filter_data_frame()
- chromosome (string) – chromosome as “Chr5”
- bin_loc (string) – location representing the bin given as the end coordinate, ie 590000
Returns: comma separated lines extracted from the filtered matrix, containing chromosome and bin info
Return type: list
- filtered_matrix (pd.DataFrame) – dataframe returned by
-
process_bins
(bin)[source]¶ This is the main method and should be called using Pool.map It takes one bin location and uses the other helper functions to get the reads, form the matrix, cluster it with DBSCAN, and output the cluster data as text lines ready to writing to a file.
Parameters: bin – string in this format: “chr19_55555” Returns: a list of lines representing the cluster data from that bin
-
-
class
clubcpg.ClusterReads.
ClusterReadsWithImputation
(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, models_A=None, models_B=None, chunksize=10000)[source]¶ This class is used to perfom the same clustering, but also enabled the ability to perform imputation during clustering. This inherits from
ClusterReads
Example: >>> from clubcpg.ClusterReads import ClusterReadsWithImputation >>> cluster = ClusterReadsWithImputation(...) >>> cluster.execute()
-
__init__
(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, models_A=None, models_B=None, chunksize=10000)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
execute
(return_only=False)[source]¶ This method will start multiprocessing execution of this class.
Parameters: return_only (bool) – Whether to return the results as a variabel (True) or write to file (False) Returns: list of lists if :attribute: return_only False otherwise None Return type: list or None
-
-
class
clubcpg.ConnectToCpGNet.
TrainWithPReLIM
(cpg_density=None, save_path=None)[source]¶ Used to train models using CpGnet
-
__init__
(cpg_density=None, save_path=None)[source]¶ Class to train a CpGNet model from input data
Parameters: - cpg_density (int) – Number of CpGs
- save_path – Location of folder to save the resulting model files. One per cpg density
-
save_net
(model)[source]¶ Save the network to a file
Parameters: model ( clubcpg_prelim.PReLIM
) – The trained PReLIM model. Located at PReLIM.modelReturns: Path to the saved model
-
-
class
clubcpg.Imputation.
Imputation
(cpg_density: int, bam_file: str, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, processes=-1)[source]¶ The class providing convienent APIs to train models and impute from models using PReLIM
-
__init__
(cpg_density: int, bam_file: str, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, processes=-1)[source]¶ [summary]
Parameters: - {int} -- Number of CpGs this class instance will be used for (cpg_density) –
- {str} -- path to the bam file (bam_file) –
Keyword Arguments: - {[type]} -- [description] (default (mbias_read2_3) – {None})
- {[type]} -- [description] (default – {None})
- {[type]} -- [description] (default – {None})
- {[type]} -- [description] (default – {None})
- {int} -- number or CPUs to use when parallelization can be utilized, default= All available (default (processes) – {-1})
-
extract_matrices
(coverage_data_frame: pandas.core.frame.DataFrame, sample_limit: int = None, return_bins=False)[source]¶ Extract CpG matrices from bam file.
Parameters: {pd.DataFrame} -- Output of clubcpg-coverage read in as a csv file (coverage_data_frame) – Keyword Arguments: {bool} -- Return the bin location along with the matrix (default (return_bins) – {False}) Returns: [tuple] – Returns tuple of (bin, np.array) if returns_bins = True else returns only np.array
-
impute_from_model
(models_folder: str, matrices: iter, postprocess=True)[source]¶ Generator to provide imputed matrices on-the-fly
Parameters: - {str} -- Path to directory containing trained CpGNet models (models_folder) –
- {iter} -- An iterable containging n x m matrices with n=cpgs and m=reads (matrices) –
Keyword Arguments: {bool} -- Round imputed values to 1s and 0s (default (postprocess) – {True})
-
static
postprocess_predictions
(predicted_matrix)[source]¶ Takes array with predicted values and rounds them to 0 or 1 if threshold is exceeded
Parameters: {[type]} -- matrix generated by imputation (predicted_matrix) – Returns: [type] – predicted matrix predictions as 1, 0, or NaN
-
train_model
(output_folder: str, matrices: iter)[source]¶ Train a CpGNet model using
TrainWithCpGNet
Parameters: - {str} -- Folder to save trained models (output_folder) –
- {iter} -- An iterable of CpGMatrices - ideally obtained through Imputation.extract_matrices() (matrices) –
Returns: [keras model] – Returns the trained CpGNet model
-
PReLIM APIs¶
-
class
clubcpg_prelim.PReLIM.
CpGBin
(matrix, binStartInc=None, binEndInc=None, cpgPositions=None, sequence='', encoding=None, missingToken=-1, chromosome=None, binSize=100, species='MM10', verbose=True, tag1=None, tag2=None)[source]¶ Constructor for a bin
-
__init__
(matrix, binStartInc=None, binEndInc=None, cpgPositions=None, sequence='', encoding=None, missingToken=-1, chromosome=None, binSize=100, species='MM10', verbose=True, tag1=None, tag2=None)[source]¶ Parameters: - matrix – numpy array, the bin’s CpG matrix.
- binStartInc – integer, the starting, inclusive, chromosomal index of the bin.
- binEndInc – integer, the ending, inclusive, chromosomal index of the bin.
- cpgPositions – array of integers, the chromosomal positions of the CpGs in the bin.
- sequence – string, nucleotide sequence (A,C,G,T)
- encoding – array, a reduced representation of the bin’s CpG matrix
- missingToken – integer, the token that represents missing data in the matrix.
- chromosome – string, the chromosome this bin resides in.
- binSize – integer, the number of base pairs this bin covers
- species – string, the speices this bin belongs too.
- verbose – boolean, print warnings, set to “false” for no error checking and faster speed
- tag1 – anything, for custom use.
- tag2 – anything, for custom use.
-
-
class
clubcpg_prelim.PReLIM.
PReLIM
(cpgDensity=2)[source]¶ PReLIM imputation class to handle training and predicting from models.
-
fit
(X_train, y_train, n_estimators=[10, 50, 100, 500, 1000], cores=-1, max_depths=[1, 5, 10, 20, 30], model_file=None, verbose=False)[source]¶ Inputs: 1. X_train, numpy array, Contains feature vectors. 2. y_train, numpy array, Contains labels for training data. 3. n_estimators, list, the number of estimators to try during a grid search. 4. max_depths, list, the maximum depths of trees to try during a grid search. 5. cores, the number of cores to use during training, helpful for grid search. 6. model_file, string, The name of the file to save the model to.
If None, then create a file name that includes a timestamp. If you don’t want to save a file, set this to “no”5-fold validation is built into the grid search
Outputs: The trained model
Usage: model.fit(X_train, y_train)
-
impute
(matrix)[source]¶ Inputs: 1. matrix, a 2d np array, dtype=float, representing a CpG matrix, 1=methylated, 0=unmethylated, -1=unknown
Outputs: 1. A 2d numpy array with predicted probabilities of methylation
-
impute_many
(matrices)[source]¶ Imputes a bunch of matrices at the same time to help speed up imputation time.
Inputs:
1. matrices: array-like (i.e. list), where each element is a 2d np array, dtype=float, representing a CpG matrix, 1=methylated, 0=unmethylated, -1=unknown
Outputs:
- A List of 2d numpy arrays with predicted probabilities of methylation for unknown values.
-
loadWeights
(model_file)[source]¶ Inputs: 1. model_file, string, name of file with a saved model
Outputs: None
Effects: self.model is loaded with the provided weights
-
predict
(X)[source]¶ Inputs: 1. X, numpy array, contains feature vectors
Outputs: 1. 1-d numpy array of predicted class labels
Usage: y_pred = CpGNet.predict(X)
-
predict_classes
(X)[source]¶ Inputs: 1. X, numpy array, contains feature vectors
Outputs: 1. 1-d numpy array of prediction values
Usage: y_pred = CpGNet.predict_classes(X)
-