API documentation

CluBCpG was built to be, primarily, a command-line based set of tools for the analysis of WGBS data. However, the package does include a few APIs which may be useful for users wanting to interact with their data more programmatically or extend the functions of CluBCpG.

ClubCpG APIs

class clubcpg.ParseBam.BamFileReadParser(bamfile, quality_score, read1_5=None, read1_3=None, read2_5=None, read2_3=None, no_overlap=True)[source]

Used to simplify the opening and reading from BAM files. BAMs must be coordinate sorted and indexed.

Example:

>>> from clubcpg.ParseBam import BamFileReadParser
>>> parser = BamFileReadParser("/path/to/data.BAM", quality_score=20, read1_5=3, read1_3=4, read2_5=7, read2_3=1)
>>> reads = parser.parse_reads("chr7", 10000, 101000)
>>> reads = parser.correct_cpg_positions(reads) # This step is optional, but highly recommended
>>> matrix = parser.create_matrix(reads)

__init__(bamfile, quality_score, read1_5=None, read1_3=None, read2_5=None, read2_3=None, no_overlap=True)[source]

Class used to read WGBSeq reads from a BAM file, extract methylation, and convert into data frame

Parameters:

bamfile – Path to bam file location
quality_score – Only include reads >= this fastq quality
read1_5 – mbias ignore read1 5’
read1_3 – mbias ignore read1 3’
read2_5 – mbias ignore read2 5’
read2_3 – mbias ignore read2 3’
no_overlap – bool. If overlap exists between two reads, ignore that region from read 2.

static correct_cpg_positions(output: list)[source]

For some reason, Bismark alignment produces instances where a CpG site location is incorrect by 1 bp, even after accounting for DNA strand alignmment. This function fixes this. If two cpgs have positions such as 4, 5 (which is impossible because there needs to by a G between them) this function will convert all 5s to 4s. This only needs to be applied to matrices which are empty after dropna() is called.

Parameters:: output – a list of lists of tuples. The output of self.parse_reads()
Returns:: list of the same style, execpt the first position in the tuple will have a corrected CpG position.

create_matrix(read_cpgs)[source]

Converted parsed reads into a pandas dataframe.

Parameters:: read_cpgs (iterable) – read CpGs generated by self.parse_reads
Returns:: matrix methylated (1) and unmethylated (0) states
Return type:: pd.DataFrame

fix_read_overlap(full_reads, read_cpgs)[source]

Takes pysam reads and read_cpgs generated during parse reads and removes any overlap between read1 and read2. If possible it also stitches read1 and read2 together to create a super read.

Parameters:

full_reads – set of reads generated by self.parse_reads()
read_cpgs – todoo

Returns:

A list in the same format as read_cpgs input, but corrected for paired read overlap

parse_reads(chromosome: str, start: int, stop: int)[source]

Parameters:

chromosome – chromosome as “chr6”
start – start coordinate
stop – end coordinate

Returns:

List of reads and their positional tags as assigned by bismark

class clubcpg.CalculateBinCoverage.CalculateCompleteBins(bam_file, bin_size, output_directory, number_of_processors=1, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, no_overlap=True)[source]

Class to calculate the number of reads covering all CpGs

__init__(bam_file, bin_size, output_directory, number_of_processors=1, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, no_overlap=True)[source]

This class is initialized with a path to a bam file and a bin size

Parameters:

bam_file – One of the BAM files for analysis to be performed
bin_size – Size of the bins for the analysis, integer

Number_of_processors:

How many CPUs to use for parallel computation, default=1

analyze_bins(individual_chrom=None)[source]

Main function in class. Run the Complete analysis on the data

Parameters:: individual_chrom – Chromosome to analyze: ie “chr7”
Returns:: filename of the generated report

calculate_bin_coverage(bin)[source]

Take a single bin, return a matrix. This is passed to a multiprocessing Pool.

Parameters:: bin – Bin should be passed as “Chr19_4343343”
Returns:: pd.DataFrame with rows containing NaNs dropped

generate_bins_list(chromosome_len_dict)[source]

Get a dict of lists of all bins according to desired bin size for all chromosomes in the passed dict

Parameters:: chromosome_len_dict – A dict of chromosome length sizes from get_chromosome_lenghts, cleaned up by remove_scaffolds() if desired
Returns:: dict with each key being a chromosome. ex: chr1

get_chromosome_lengths()[source]

Get dictionary containing lengths of the chromosomes. Uses bam file for reference

Returns:: Dictionary of chromosome lengths, ex: {“chrX”: 222222}

static remove_scaffolds(chromosome_len_dict)[source]

Return a dict containing only the standard chromosomes starting with “chr”

Parameters:: chromosome_len_dict – A dict generated by get_chromosome_lenghts()
Returns:: a dict containing only chromosomes starting with “chr”

class clubcpg.ClusterReads.ClusterReads(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, permute_labels=False)[source]

This class is used to take a dataframe or matrix of reads and cluster them

Example:

>>> from clubcpg.ClusterReads import ClusterReads
>>> cluster = ClusterReads(bam_a="/path/to/file.bam", bam_b="/path/to/file.bam", bins_file="/path/to/file.csv", suffix="chr19")
>>> cluster.execute()

__init__(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, permute_labels=False)[source]

static attempt_cpg_position_correction(reads, parser: BamFileReadParser)[source]

Take the reads and a parser object, attempted cpg position correction and return corrected reads

Parameters:

reads – parsed reads from BamFileReadParser
parser – an instance of the BamFileReadParser object

Returns:

reads with CpG positions corrected

execute(return_only=False)[source]

This method will start multiprocessing execution of this class.

Parameters:: return_only (bool) – Whether to return the results as a variabel (True) or write to file (False)
Returns:: list of lists if :attribute: return_only False otherwise None
Return type:: list or None

filter_data_frame(matrix: DataFrame)[source]

Takes a dataframe of clusters and removes any groups with less than self.cluster_member_min members

Parameters:: matrix – dataframe of clustered reads
Type:: pd.DataFrame
Returns:: input matrix with some clusters removed

generate_individual_matrix_data(filtered_matrix, chromosome, bin_loc)[source]

Take output of process_bins() and converts it into a list of lines of text data for output

Parameters:

filtered_matrix (pd.DataFrame) – dataframe returned by ClusterReads.filter_data_frame()
chromosome (string) – chromosome as “Chr5”
bin_loc (string) – location representing the bin given as the end coordinate, ie 590000

Returns:

comma separated lines extracted from the filtered matrix, containing chromosome and bin info

Return type:

list

process_bins(bin)[source]

This is the main method and should be called using Pool.map It takes one bin location and uses the other helper functions to get the reads, form the matrix, cluster it with DBSCAN, and output the cluster data as text lines ready to writing to a file.

Parameters:: bin – string in this format: “chr19_55555”
Returns:: a list of lines representing the cluster data from that bin

class clubcpg.ClusterReads.ClusterReadsWithImputation(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, models_A=None, models_B=None, chunksize=10000)[source]

This class is used to perfom the same clustering, but also enabled the ability to perform imputation during clustering. This inherits from ClusterReads

Example:

>>> from clubcpg.ClusterReads import ClusterReadsWithImputation
>>> cluster = ClusterReadsWithImputation(...)
>>> cluster.execute()

__init__(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, models_A=None, models_B=None, chunksize=10000)[source]

execute(return_only=False)[source]

This method will start multiprocessing execution of this class.

Parameters:: return_only (bool) – Whether to return the results as a variabel (True) or write to file (False)
Returns:: list of lists if :attribute: return_only False otherwise None
Return type:: list or None

class clubcpg.ConnectToCpGNet.TrainWithPReLIM(cpg_density=None, save_path=None)[source]

Used to train models using CpGnet

__init__(cpg_density=None, save_path=None)[source]

Class to train a CpGNet model from input data

Parameters:

cpg_density (int) – Number of CpGs
save_path – Location of folder to save the resulting model files. One per cpg density

save_net(model)[source]

Save the network to a file

Parameters:: model (clubcpg_prelim.PReLIM) – The trained PReLIM model. Located at PReLIM.model
Returns:: Path to the saved model

train_model(bins: iter)[source]

Train the CpGNet model on a list of provided bins

Parameters:: bins – iterable containing CpG matrices of 1 (methylated), 0 (unmethylated), and -1 (unknown)
Returns:: Path to the saved model file

class clubcpg.Imputation.Imputation(cpg_density: int, bam_file: str, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, processes=-1)[source]

The class providing convienent APIs to train models and impute from models using PReLIM

__init__(cpg_density: int, bam_file: str, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, processes=-1)[source]

[summary]

Parameters:

for (cpg_density {int} -- Number of CpGs this class instance will be used) –
file (bam_file {str} -- path to the bam) –

Keyword Arguments:

(default (processes {int} -- number or CPUs to use when parallelization can be utilized, default= All available) – {None})
(default – {None})
(default – {None})
(default – {None})
(default – {-1})

extract_matrices(coverage_data_frame: DataFrame, sample_limit: int | None = None, return_bins=False)[source]

Extract CpG matrices from bam file.

Parameters:: file (coverage_data_frame {pd.DataFrame} -- Output of clubcpg-coverage read in as a csv) –
Keyword Arguments:: (default (return_bins {bool} -- Return the bin location along with the matrix) – {False})
Returns:: [tuple] – Returns tuple of (bin, np.array) if returns_bins = True else returns only np.array

impute_from_model(models_folder: str, matrices: iter, postprocess=True)[source]

Generator to provide imputed matrices on-the-fly

Parameters:

models (models_folder {str} -- Path to directory containing trained CpGNet) –
m=reads (matrices {iter} -- An iterable containging n x m matrices with n=cpgs and) –

Keyword Arguments:

(default (postprocess {bool} -- Round imputed values to 1s and 0s) – {True})

static postprocess_predictions(predicted_matrix)[source]

Takes array with predicted values and rounds them to 0 or 1 if threshold is exceeded

Parameters:: imputation (predicted_matrix {[type]} -- matrix generated by) –
Returns:: [type] – predicted matrix predictions as 1, 0, or NaN

train_model(output_folder: str, matrices: iter)[source]

Train a CpGNet model using TrainWithCpGNet

Parameters:

models (output_folder {str} -- Folder to save trained) –
Imputation.extract_matrices() (matrices {iter} -- An iterable of CpGMatrices - ideally obtained through) –

Returns:

[keras model] – Returns the trained CpGNet model

PReLIM APIs

class clubcpg_prelim.PReLIM.CpGBin(matrix, binStartInc=None, binEndInc=None, cpgPositions=None, sequence='', encoding=None, missingToken=-1, chromosome=None, binSize=100, species='MM10', verbose=True, tag1=None, tag2=None)[source]

Constructor for a bin

__init__(matrix, binStartInc=None, binEndInc=None, cpgPositions=None, sequence='', encoding=None, missingToken=-1, chromosome=None, binSize=100, species='MM10', verbose=True, tag1=None, tag2=None)[source]

Parameters:

matrix – numpy array, the bin’s CpG matrix.
binStartInc – integer, the starting, inclusive, chromosomal index of the bin.
binEndInc – integer, the ending, inclusive, chromosomal index of the bin.
cpgPositions – array of integers, the chromosomal positions of the CpGs in the bin.
sequence – string, nucleotide sequence (A,C,G,T)
encoding – array, a reduced representation of the bin’s CpG matrix
missingToken – integer, the token that represents missing data in the matrix.
chromosome – string, the chromosome this bin resides in.
binSize – integer, the number of base pairs this bin covers
species – string, the speices this bin belongs too.
verbose – boolean, print warnings, set to “false” for no error checking and faster speed
tag1 – anything, for custom use.
tag2 – anything, for custom use.

class clubcpg_prelim.PReLIM.PReLIM(cpgDensity=2)[source]

PReLIM imputation class to handle training and predicting from models.

__init__(cpgDensity=2)[source]

fit(X_train, y_train, n_estimators=[10, 50, 100, 500, 1000], cores=-1, max_depths=[1, 5, 10, 20, 30], model_file=None, verbose=False)[source]

Inputs: 1. X_train, numpy array, Contains feature vectors. 2. y_train, numpy array, Contains labels for training data. 3. n_estimators, list, the number of estimators to try during a grid search. 4. max_depths, list, the maximum depths of trees to try during a grid search. 5. cores, the number of cores to use during training, helpful for grid search. 6. model_file, string, The name of the file to save the model to. If None, then create a file name that includes a timestamp. If you don’t want to save a file, set this to “no”

5-fold validation is built into the grid search

Outputs: The trained model

Usage: model.fit(X_train, y_train)

impute(matrix)[source]

Inputs: 1. matrix, a 2d np array, dtype=float, representing a CpG matrix, 1=methylated, 0=unmethylated, -1=unknown

Outputs: 1. A 2d numpy array with predicted probabilities of methylation

impute_many(matrices)[source]

Imputes a bunch of matrices at the same time to help speed up imputation time.

Inputs:

1. matrices: array-like (i.e. list), where each element is a 2d np array, dtype=float, representing a CpG matrix, 1=methylated, 0=unmethylated, -1=unknown

Outputs:

A List of 2d numpy arrays with predicted probabilities of methylation for unknown values.

loadWeights(model_file)[source]

Inputs: 1. model_file, string, name of file with a saved model

Outputs: None

Effects: self.model is loaded with the provided weights

predict(X)[source]

Inputs: 1. X, numpy array, contains feature vectors

Outputs: 1. 1-d numpy array of predicted class labels

Usage: y_pred = CpGNet.predict(X)

predict_classes(X)[source]

Inputs: 1. X, numpy array, contains feature vectors

Outputs: 1. 1-d numpy array of prediction values

Usage: y_pred = CpGNet.predict_classes(X)

predict_proba(X)[source]

Inputs: 1. X, numpy array, contains feature vectors

Outputs: 1. 1-d numpy array of class predictions

Usage: y_pred = CpGNet.predict(X)

train(bin_matrices, model_file='no', verbose=False)[source]

bin_matrices: list of cpg matrices

model_file, string, The name of the file to save the model to.: If None, then create a file name that includes a timestamp. If you don’t want to save a file, set this to “no”