API documentation
CluBCpG was built to be, primarily, a command-line based set of tools for the analysis of WGBS data. However, the package does include a few APIs which may be useful for users wanting to interact with their data more programmatically or extend the functions of CluBCpG.
ClubCpG APIs
- class clubcpg.ParseBam.BamFileReadParser(bamfile, quality_score, read1_5=None, read1_3=None, read2_5=None, read2_3=None, no_overlap=True)[source]
Used to simplify the opening and reading from BAM files. BAMs must be coordinate sorted and indexed.
- Example:
>>> from clubcpg.ParseBam import BamFileReadParser >>> parser = BamFileReadParser("/path/to/data.BAM", quality_score=20, read1_5=3, read1_3=4, read2_5=7, read2_3=1) >>> reads = parser.parse_reads("chr7", 10000, 101000) >>> reads = parser.correct_cpg_positions(reads) # This step is optional, but highly recommended >>> matrix = parser.create_matrix(reads)
- __init__(bamfile, quality_score, read1_5=None, read1_3=None, read2_5=None, read2_3=None, no_overlap=True)[source]
Class used to read WGBSeq reads from a BAM file, extract methylation, and convert into data frame
- Parameters:
bamfile – Path to bam file location
quality_score – Only include reads >= this fastq quality
read1_5 – mbias ignore read1 5’
read1_3 – mbias ignore read1 3’
read2_5 – mbias ignore read2 5’
read2_3 – mbias ignore read2 3’
no_overlap – bool. If overlap exists between two reads, ignore that region from read 2.
- static correct_cpg_positions(output: list)[source]
For some reason, Bismark alignment produces instances where a CpG site location is incorrect by 1 bp, even after accounting for DNA strand alignmment. This function fixes this. If two cpgs have positions such as 4, 5 (which is impossible because there needs to by a G between them) this function will convert all 5s to 4s. This only needs to be applied to matrices which are empty after dropna() is called.
- Parameters:
output – a list of lists of tuples. The output of self.parse_reads()
- Returns:
list of the same style, execpt the first position in the tuple will have a corrected CpG position.
- create_matrix(read_cpgs)[source]
Converted parsed reads into a pandas dataframe.
- Parameters:
read_cpgs (iterable) – read CpGs generated by self.parse_reads
- Returns:
matrix methylated (1) and unmethylated (0) states
- Return type:
pd.DataFrame
- fix_read_overlap(full_reads, read_cpgs)[source]
Takes pysam reads and read_cpgs generated during parse reads and removes any overlap between read1 and read2. If possible it also stitches read1 and read2 together to create a super read.
- Parameters:
full_reads – set of reads generated by self.parse_reads()
read_cpgs – todoo
- Returns:
A list in the same format as read_cpgs input, but corrected for paired read overlap
- class clubcpg.CalculateBinCoverage.CalculateCompleteBins(bam_file, bin_size, output_directory, number_of_processors=1, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, no_overlap=True)[source]
Class to calculate the number of reads covering all CpGs
- __init__(bam_file, bin_size, output_directory, number_of_processors=1, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, no_overlap=True)[source]
This class is initialized with a path to a bam file and a bin size
- Parameters:
bam_file – One of the BAM files for analysis to be performed
bin_size – Size of the bins for the analysis, integer
- Number_of_processors:
How many CPUs to use for parallel computation, default=1
- analyze_bins(individual_chrom=None)[source]
Main function in class. Run the Complete analysis on the data
- Parameters:
individual_chrom – Chromosome to analyze: ie “chr7”
- Returns:
filename of the generated report
- calculate_bin_coverage(bin)[source]
Take a single bin, return a matrix. This is passed to a multiprocessing Pool.
- Parameters:
bin – Bin should be passed as “Chr19_4343343”
- Returns:
pd.DataFrame with rows containing NaNs dropped
- generate_bins_list(chromosome_len_dict)[source]
Get a dict of lists of all bins according to desired bin size for all chromosomes in the passed dict
- Parameters:
chromosome_len_dict – A dict of chromosome length sizes from get_chromosome_lenghts, cleaned up by remove_scaffolds() if desired
- Returns:
dict with each key being a chromosome. ex: chr1
- class clubcpg.ClusterReads.ClusterReads(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, permute_labels=False)[source]
This class is used to take a dataframe or matrix of reads and cluster them
- Example:
>>> from clubcpg.ClusterReads import ClusterReads >>> cluster = ClusterReads(bam_a="/path/to/file.bam", bam_b="/path/to/file.bam", bins_file="/path/to/file.csv", suffix="chr19") >>> cluster.execute()
- __init__(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, permute_labels=False)[source]
- static attempt_cpg_position_correction(reads, parser: BamFileReadParser)[source]
Take the reads and a parser object, attempted cpg position correction and return corrected reads
- Parameters:
reads – parsed reads from BamFileReadParser
parser – an instance of the BamFileReadParser object
- Returns:
reads with CpG positions corrected
- execute(return_only=False)[source]
This method will start multiprocessing execution of this class.
- Parameters:
return_only (bool) – Whether to return the results as a variabel (True) or write to file (False)
- Returns:
list of lists if :attribute: return_only False otherwise None
- Return type:
list or None
- filter_data_frame(matrix: DataFrame)[source]
Takes a dataframe of clusters and removes any groups with less than self.cluster_member_min members
- Parameters:
matrix – dataframe of clustered reads
- Type:
pd.DataFrame
- Returns:
input matrix with some clusters removed
- generate_individual_matrix_data(filtered_matrix, chromosome, bin_loc)[source]
Take output of process_bins() and converts it into a list of lines of text data for output
- Parameters:
filtered_matrix (pd.DataFrame) – dataframe returned by
ClusterReads.filter_data_frame()
chromosome (string) – chromosome as “Chr5”
bin_loc (string) – location representing the bin given as the end coordinate, ie 590000
- Returns:
comma separated lines extracted from the filtered matrix, containing chromosome and bin info
- Return type:
list
- process_bins(bin)[source]
This is the main method and should be called using Pool.map It takes one bin location and uses the other helper functions to get the reads, form the matrix, cluster it with DBSCAN, and output the cluster data as text lines ready to writing to a file.
- Parameters:
bin – string in this format: “chr19_55555”
- Returns:
a list of lines representing the cluster data from that bin
- class clubcpg.ClusterReads.ClusterReadsWithImputation(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, models_A=None, models_B=None, chunksize=10000)[source]
This class is used to perfom the same clustering, but also enabled the ability to perform imputation during clustering. This inherits from
ClusterReads
- Example:
>>> from clubcpg.ClusterReads import ClusterReadsWithImputation >>> cluster = ClusterReadsWithImputation(...) >>> cluster.execute()
- __init__(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, models_A=None, models_B=None, chunksize=10000)[source]
- execute(return_only=False)[source]
This method will start multiprocessing execution of this class.
- Parameters:
return_only (bool) – Whether to return the results as a variabel (True) or write to file (False)
- Returns:
list of lists if :attribute: return_only False otherwise None
- Return type:
list or None
- class clubcpg.ConnectToCpGNet.TrainWithPReLIM(cpg_density=None, save_path=None)[source]
Used to train models using CpGnet
- __init__(cpg_density=None, save_path=None)[source]
Class to train a CpGNet model from input data
- Parameters:
cpg_density (int) – Number of CpGs
save_path – Location of folder to save the resulting model files. One per cpg density
- save_net(model)[source]
Save the network to a file
- Parameters:
model (
clubcpg_prelim.PReLIM
) – The trained PReLIM model. Located at PReLIM.model- Returns:
Path to the saved model
- class clubcpg.Imputation.Imputation(cpg_density: int, bam_file: str, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, processes=-1)[source]
The class providing convienent APIs to train models and impute from models using PReLIM
- __init__(cpg_density: int, bam_file: str, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, processes=-1)[source]
[summary]
- Parameters:
for (cpg_density {int} -- Number of CpGs this class instance will be used) –
file (bam_file {str} -- path to the bam) –
- Keyword Arguments:
(default (processes {int} -- number or CPUs to use when parallelization can be utilized, default= All available) – {None})
(default – {None})
(default – {None})
(default – {None})
(default – {-1})
- extract_matrices(coverage_data_frame: DataFrame, sample_limit: int | None = None, return_bins=False)[source]
Extract CpG matrices from bam file.
- Parameters:
file (coverage_data_frame {pd.DataFrame} -- Output of clubcpg-coverage read in as a csv) –
- Keyword Arguments:
(default (return_bins {bool} -- Return the bin location along with the matrix) – {False})
- Returns:
[tuple] – Returns tuple of (bin, np.array) if returns_bins = True else returns only np.array
- impute_from_model(models_folder: str, matrices: iter, postprocess=True)[source]
Generator to provide imputed matrices on-the-fly
- Parameters:
models (models_folder {str} -- Path to directory containing trained CpGNet) –
m=reads (matrices {iter} -- An iterable containging n x m matrices with n=cpgs and) –
- Keyword Arguments:
(default (postprocess {bool} -- Round imputed values to 1s and 0s) – {True})
- static postprocess_predictions(predicted_matrix)[source]
Takes array with predicted values and rounds them to 0 or 1 if threshold is exceeded
- Parameters:
imputation (predicted_matrix {[type]} -- matrix generated by) –
- Returns:
[type] – predicted matrix predictions as 1, 0, or NaN
- train_model(output_folder: str, matrices: iter)[source]
Train a CpGNet model using
TrainWithCpGNet
- Parameters:
models (output_folder {str} -- Folder to save trained) –
Imputation.extract_matrices() (matrices {iter} -- An iterable of CpGMatrices - ideally obtained through) –
- Returns:
[keras model] – Returns the trained CpGNet model
PReLIM APIs
- class clubcpg_prelim.PReLIM.CpGBin(matrix, binStartInc=None, binEndInc=None, cpgPositions=None, sequence='', encoding=None, missingToken=-1, chromosome=None, binSize=100, species='MM10', verbose=True, tag1=None, tag2=None)[source]
Constructor for a bin
- __init__(matrix, binStartInc=None, binEndInc=None, cpgPositions=None, sequence='', encoding=None, missingToken=-1, chromosome=None, binSize=100, species='MM10', verbose=True, tag1=None, tag2=None)[source]
- Parameters:
matrix – numpy array, the bin’s CpG matrix.
binStartInc – integer, the starting, inclusive, chromosomal index of the bin.
binEndInc – integer, the ending, inclusive, chromosomal index of the bin.
cpgPositions – array of integers, the chromosomal positions of the CpGs in the bin.
sequence – string, nucleotide sequence (A,C,G,T)
encoding – array, a reduced representation of the bin’s CpG matrix
missingToken – integer, the token that represents missing data in the matrix.
chromosome – string, the chromosome this bin resides in.
binSize – integer, the number of base pairs this bin covers
species – string, the speices this bin belongs too.
verbose – boolean, print warnings, set to “false” for no error checking and faster speed
tag1 – anything, for custom use.
tag2 – anything, for custom use.
- class clubcpg_prelim.PReLIM.PReLIM(cpgDensity=2)[source]
PReLIM imputation class to handle training and predicting from models.
- fit(X_train, y_train, n_estimators=[10, 50, 100, 500, 1000], cores=-1, max_depths=[1, 5, 10, 20, 30], model_file=None, verbose=False)[source]
Inputs: 1. X_train, numpy array, Contains feature vectors. 2. y_train, numpy array, Contains labels for training data. 3. n_estimators, list, the number of estimators to try during a grid search. 4. max_depths, list, the maximum depths of trees to try during a grid search. 5. cores, the number of cores to use during training, helpful for grid search. 6. model_file, string, The name of the file to save the model to. If None, then create a file name that includes a timestamp. If you don’t want to save a file, set this to “no”
5-fold validation is built into the grid search
Outputs: The trained model
Usage: model.fit(X_train, y_train)
- impute(matrix)[source]
Inputs: 1. matrix, a 2d np array, dtype=float, representing a CpG matrix, 1=methylated, 0=unmethylated, -1=unknown
Outputs: 1. A 2d numpy array with predicted probabilities of methylation
- impute_many(matrices)[source]
Imputes a bunch of matrices at the same time to help speed up imputation time.
Inputs:
1. matrices: array-like (i.e. list), where each element is a 2d np array, dtype=float, representing a CpG matrix, 1=methylated, 0=unmethylated, -1=unknown
Outputs:
A List of 2d numpy arrays with predicted probabilities of methylation for unknown values.
- loadWeights(model_file)[source]
Inputs: 1. model_file, string, name of file with a saved model
Outputs: None
Effects: self.model is loaded with the provided weights
- predict(X)[source]
Inputs: 1. X, numpy array, contains feature vectors
Outputs: 1. 1-d numpy array of predicted class labels
Usage: y_pred = CpGNet.predict(X)
- predict_classes(X)[source]
Inputs: 1. X, numpy array, contains feature vectors
Outputs: 1. 1-d numpy array of prediction values
Usage: y_pred = CpGNet.predict_classes(X)