4. Using CluBCpG with PReLIM

4.1. Introduction

PReLIM exists as its own stand-alone package, but for simplicity and compatibility, a version of PReLIM comes bundled with CluBCpG.

Imputation with PReLIM is taken care of behind the scenes and CluBCpG includes three command line scripts to perform analysis with imputation.

Usage is almost identical to standard CluBCpG pipeline, but includes one extra step and requires the use of a couple extra command line flags.

The documentation included in this section mostly highlights the differences and additional steps needed to run CluBCpG with PReLIM imputation.

Note

It is highly recommended that you have read Using CluBCpG first.

The command-tools provided are clubcpg-impute-coverage, clubcpg-impute-train, and clubcpg-impute-cluster

4.2. Typical imputation workflow

4.2.1. Calculate bin coverage

Use clubcpg-coverage to calculate bin coverage as performed in the standard workflow. (Typical workflow)

4.2.2. Filter outputs

Filter these csv outputs for >= 1 reads and >= 2 CpGs. The following one-liner will filter it correctly:

cat CompleteBins.yourfilename.chr19.csv | awk -F "," '$2>=1 && $3>=2' > CompleteBins.yourfilename.chr19.filtered.csv

But as before, you can filter this with any other method you like.

Note

PReLIM requires at least 1 read fully covering all CpGs in a bin

Note

Differing from the typical workflow, if you intend to analyze two BAM files with imputation, you will want to perform this process on both BAM files. This is because you will train an imputation model using both BAM files separately. This improves the accuracy of the imputations.

4.2.3. Train a PReLIM model

Now clubcpg-impute-train can be used to automatically train multiple imputation models using PReLIM. For each BAM file, a random set of bins will be selected from the filtered coverage file provided.

The number of randomly sampled bins can be set using the -l flag.

Warning

Increasing the number of bins will drastically increase the run-time of this process and can consume a lot of memory, potentially crashing PReLIM. When testing we found there was no improvement in accuracy above 10,000 bins.

Models bins containing 2-5 CpGs will be automatically trained and saved in the folder designated by -o. Do NOT rename them.

4.2.4. Compute coverage gains

Now with the trained models saved, the post-imputation coverage can be calculated using clubcpg-impute-coverage. This functions almost identically to clubcpg-coverage except:

you need to provide the folder containing the saved models using the -m flag.
You also need to provide the coverage file filtered for >= 1 reads with the -c flag.
- This accelerates the process by skipping bins which cannot be imputed upon do to lack of coverage.

This will create an output file identical to clubcpg-coverage, except the values represent the number of reads post-imputation

4.2.5. Filter imputed coverage output

Filter this csv output however you wish. Again, we recommend 10 reads and 2 cpgs. See Filter output for more details.

4.2.6. Perform clustering with imputation

Using clubcpg-impute-cluster you can now perform read clustering. This also functions almost identically to the standard method. However, you must also point it to the folder containing the models.

Here you use the --models_A and --models_B flags to point the tool to PReLIM models for input -a and -b respectively. Imputation on each file is performed independent of the other.

4.3. Command line tools

These options can also be viewed by running --help after each tool on the command line.

4.3.1. clubcpg-impute-train

usage: clubcpg-impute-train [-h] [-a INPUT_BAM_FILE] [-c COVERAGE] [-o OUTPUT]
                            [-n N] [-l LIMIT_SAMPLES] [--read1_5 READ1_5]
                            [--read1_3 READ1_3] [--read2_5 READ2_5]
                            [--read2_3 READ2_3]

-h, --help: show this help message and exit

-a <input_bam_file>, --input_bam_file <input_bam_file>: Input bam file, coordinate sorted with index present

-c <coverage>, --coverage <coverage>: output file from clubcpg-coverage, filtered for at least 1 read and 2 cpgs

-o <output>, --output <output>: folder to save generated model files

-n <n>: number of cpu cores to use

-l <limit_samples>, --limit_samples <limit_samples>: Limit the number of samples used to train the model, this will speed up training. default=10000

--read1_5 <read1_5>: integer, read1 5’ m-bias ignore bp, default=0

--read1_3 <read1_3>: integer, read1 3’ m-bias ignore bp, default=0

--read2_5 <read2_5>: integer, read2 5’ m-bias ignore bp, default=0

--read2_3 <read2_3>: integer, read2 3’ m-bias ignore bp, default=0

4.3.2. clugcpg-impute-coverage

usage: clugcpg-impute-coverage [-h] [-a INPUT_BAM_FILE] [-c COVERAGE]
                               [-m MODELS] [-o OUTPUT] [-n N]
                               [-chr CHROMOSOME] [--read1_5 READ1_5]
                               [--read1_3 READ1_3] [--read2_5 READ2_5]
                               [--read2_3 READ2_3]

-h, --help: show this help message and exit

-a <input_bam_file>, --input_bam_file <input_bam_file>: Input bam file, coordinate sorted with index present

-c <coverage>, --coverage <coverage>: output file from clubcpg-coverage, filtered for at least 1 read and 2 cpgs

-m <models>, --models <models>: Path to folder containing saved models

-o <output>, --output <output>: folder to save imputed coverage data

-n <n>: number of cpu cores to use

-chr <chromosome>, --chromosome <chromosome>: Optional, perform only on one chromosome. Default=all chromosomes provided in -c. Example: ‘chr7’

--read1_5 <read1_5>: integer, read1 5’ m-bias ignore bp, default=0

--read1_3 <read1_3>: integer, read1 3’ m-bias ignore bp, default=0

--read2_5 <read2_5>: integer, read2 5’ m-bias ignore bp, default=0

--read2_3 <read2_3>: integer, read2 3’ m-bias ignore bp, default=0

4.3.3. clubcpg-impute-cluster

usage: clubcpg-impute-cluster [-h] [-a INPUT_BAM_A] [-b INPUT_BAM_B]
                              [--bins BINS] [-o OUTPUT_DIR]
                              [--bin_size BIN_SIZE]
                              [-m CLUSTER_MEMBER_MINIMUM] [-r READ_DEPTH]
                              [-n NUM_PROCESSORS] [--read1_5 READ1_5]
                              [--read1_3 READ1_3] [--read2_5 READ2_5]
                              [--read2_3 READ2_3] [--no_overlap [NO_OVERLAP]]
                              [--remove_noise [REMOVE_NOISE]]
                              [--suffix SUFFIX] [--models_A MODELS_A]
                              [--models_B MODELS_B] [--chunksize CHUNKSIZE]

-h, --help: show this help message and exit

-a <input_bam_a>, --input_bam_A <input_bam_a>: First Input bam file, coordinate sorted with index present, REQUIRED

-b <input_bam_b>, --input_bam_B <input_bam_b>: Second Input bam file, coordinate sorted with index present, OPTIONAL

--bins <bins>: File with each line being one bin to extract and analyze, generated by clubcpg-coverage, REQUIRED

-o <output_dir>, --output_dir <output_dir>: Output directory to save figures, defaults to bam file location

--bin_size <bin_size>: Size of bins to extract and analyze, default=100

-m <cluster_member_minimum>, --cluster_member_minimum <cluster_member_minimum>: Minimum number of members a cluster should have for it to be considered, default=4

-r <read_depth>, --read_depth <read_depth>: Minium number of reads covering all CpGs that the bins should have to analyze, default=10

-n <num_processors>, --num_processors <num_processors>: Number of processors to use for analysis, default=1

--read1_5 <read1_5>: integer, read1 5’ m-bias ignore bp, default=0

--read1_3 <read1_3>: integer, read1 3’ m-bias ignore bp, default=0

--read2_5 <read2_5>: integer, read2 5’ m-bias ignore bp, default=0

--read2_3 <read2_3>: integer, read2 3’ m-bias ignore bp, default=0

--no_overlap <no_overlap>: bool, remove any overlap between paired reads and stitch reads together when possible, default=True

--remove_noise <remove_noise>: bool, Discard the cluster containing noise points (-1) after clustering, default=True

--suffix <suffix>: Any additional info to include in the output file name, chromosome for example

--models_A <models_a>: Models to impute for input_bam_A, OPTIONAL

--models_B <models_b>: Models to impute for imput_bam_B, OPTIONAL

--chunksize <chunksize>: How large of chunks to split bins into during imputation. Higher will go faster, but uses more memory