2. Using CluBCpG

2.1. Introduction

After installing CluBCpG to your python environment there should not be a new set of command line tools on your PATH. They all begin with clubcpg-. Each one of them comes with a help command which displays the command line arguments each tool accepts.

For example you can run

clubcpg-coverage --help

and display the following

usage: clubcpg-coverage [-h] [-a INPUT_BAM_A] [-o OUTPUT_DIR]
                        [-bin_size BIN_SIZE] [-n NUM_PROCESSORS]
                        [-chr CHROMOSOME] [--read1_5 READ1_5]
                        [--read1_3 READ1_3] [--read2_5 READ2_5]
                        [--read2_3 READ2_3] [--no_overlap [NO_OVERLAP]]

optional arguments:
  -h, --help            show this help message and exit
  -a INPUT_BAM_A, --input_bam_A INPUT_BAM_A
                        First Input bam file, coordinate sorted with index
                        present
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Output directory to save figures, defaults to bam file
                        loaction
  -bin_size BIN_SIZE    Size of bins to extract and analyze, default=100
  -n NUM_PROCESSORS, --num_processors NUM_PROCESSORS
                        Number of processors to use for analysis, default=1
  -chr CHROMOSOME, --chromosome CHROMOSOME
                        Chromosome to analyze, example: 'chr19', required
  --read1_5 READ1_5     integer, read1 5' m-bias ignore bp, default=0
  --read1_3 READ1_3     integer, read1 3' m-bias ignore bp, default=0
  --read2_5 READ2_5     integer, read2 5' m-bias ignore bp, default=0
  --read2_3 READ2_3     integer, read2 3' m-bias ignore bp, default=0
  --no_overlap [NO_OVERLAP]
                        bool, remove any overlap between paired reads and
                        stitch reads together when possible, default=True

details of each flag are elaborated upon below in Command line tools.

2.2. Sequencing data pre-processing

CluBCpG uses BAM files generated by standard bioinformatics pipelines. For the most part you can generate this BAM file in any way you see fit. However CluBCpG does have a few requirements:

  1. The bisulfite read mapping should be performed with Bismark

    Warning

    CluBCpG compatibility has NOT been tested with other bisulfite read mappers. It may work fine. It may initiate the end of the universe. I don’t know, because it hasn’t been tested.

2. The BAM file should be cooridinate sorted and have an index file (.bai) present in its directory. Samtools can be utilized for both of these steps using samtools sort and samtools index.

2.3. Typical workflow

2.3.1. Prepare your data

Obtain your correctly processed BAM file(s)

  • ClubCpG accepts one or two BAM files for processing

2.3.2. Calculate bin coverage

Use clubcpg-coverage to calculate the number of reads fully covering all bins across the genome.

  1. This process should be performed on individual chromosomes

Hint

This is an excellent step to parallelize if running in a HPC environment. clubcpg-coverage has a flag, -n to specify the number of CPU cores to use during processing. This also works if running on one machine with multiple cores.

Additionally, each chromosome can be run on an independent compute node. No need to split the BAM file. CluBCpG will only operate on the chromosome specified with the -chr flag. (See clubcpg-coverage)

Note

If you running CluBCpG on two BAM files, this step only needs to be performed on the first BAM file.

  1. A typical command may look like this:

# optimal values for the --read flags will need to be determined for each sequencing experiment
clubcpg-coverage -a /path/to/file.bam -n 24 -chr chr19 --read1_5 4 --read1_3 2 --read2_5 11 --read2_3 5

2.3.3. Filter output

Filter the generated csv file for desired number of reads and CpG densities

a. The output is csv file that does not have a header but the columns contain the following data: bin id, number of reads, number of cpgs.

  1. You can filter this however you like. We recommend >= 10 reads and >= 2 cpgs.

  1. bash and awk can be used to filter the output using the following one-liner:

cat CompleteBins.yourfilename.chr19.csv | awk -F "," '$2>=10 && $3>=2' > CompleteBins.yourfilename.chr19.filtered.csv

2.3.4. Perform clustering

Use clubcpg-cluster to perform cluster analysis

a. Here you provide your filtered csv file from the previous step into this clustering step using the --bins flag. This accelerates the analysis by only reading bins which have already been pre-determined to meet coverage requirements.

b. If running two bam files: If the coverage requirements were met in the first BAM, but not the second BAM, the bin will be ignored and not included in the final report.

Hint

Here is another opportunity for parallelization. clubcpb-cluster also can be run with the -n flag to select the number of CPU cores. But if you have a separate csv file for each chromosome from step 2/3, you can run each of these separately on multiple nodes.

Just use the --suffix flag to append on the chromosome information into the filename of the final report.

2.4. Command line tools

These options can also be viewed by running --help after each tool on the command line.

2.4.1. clubcpg-coverage

usage: clubcpg-coverage [-h] [-a INPUT_BAM_A] [-o OUTPUT_DIR]
                        [--bin_size BIN_SIZE] [-n NUM_PROCESSORS]
                        [-chr CHROMOSOME] [--read1_5 READ1_5]
                        [--read1_3 READ1_3] [--read2_5 READ2_5]
                        [--read2_3 READ2_3] [--no_overlap [NO_OVERLAP]]
-h, --help

show this help message and exit

-a <input_bam_a>, --input_bam_A <input_bam_a>

Input bam file, coordinate sorted with index present

-o <output_dir>, --output_dir <output_dir>

Output directory to save results, defaults to bam file loaction

--bin_size <bin_size>

Size of bins to extract and analyze, default=100

-n <num_processors>, --num_processors <num_processors>

Number of processors to use for analysis, default=1

-chr <chromosome>, --chromosome <chromosome>

Chromosome to analyze, example: ‘chr19’, required

--read1_5 <read1_5>

integer, read1 5’ m-bias ignore bp, default=0

--read1_3 <read1_3>

integer, read1 3’ m-bias ignore bp, default=0

--read2_5 <read2_5>

integer, read2 5’ m-bias ignore bp, default=0

--read2_3 <read2_3>

integer, read2 3’ m-bias ignore bp, default=0

--no_overlap <no_overlap>

bool, remove any overlap between paired reads and stitch reads together when possible, default=True

2.4.2. clubcpg-cluster

usage: clubcpg-cluster [-h] [-a INPUT_BAM_A] [-b INPUT_BAM_B] [--bins BINS]
                       [-o OUTPUT_DIR] [--bin_size BIN_SIZE]
                       [-m CLUSTER_MEMBER_MINIMUM] [-r READ_DEPTH]
                       [-n NUM_PROCESSORS] [--read1_5 READ1_5]
                       [--read1_3 READ1_3] [--read2_5 READ2_5]
                       [--read2_3 READ2_3] [--no_overlap [NO_OVERLAP]]
                       [--remove_noise [REMOVE_NOISE]] [--suffix SUFFIX]
                       [--permute [PERMUTE]]
-h, --help

show this help message and exit

-a <input_bam_a>, --input_bam_A <input_bam_a>

First Input bam file, coordinate sorted with index present, REQUIRED

-b <input_bam_b>, --input_bam_B <input_bam_b>

Second Input bam file, coordinate sorted with index present, OPTIONAL

--bins <bins>

File with each line being one bin to extract and analyze, generated by clubcpg-coverage, REQUIRED

-o <output_dir>, --output_dir <output_dir>

Output directory to save results, defaults to bam file location

--bin_size <bin_size>

Size of bins to extract and analyze, default=100

-m <cluster_member_minimum>, --cluster_member_minimum <cluster_member_minimum>

Minimum number of reads a cluster should have for it to be considered, default=4

-r <read_depth>, --read_depth <read_depth>

Minimum number of reads covering all CpGs that the bins should have to analyze, default=10

-n <num_processors>, --num_processors <num_processors>

Number of processors to use for analysis, default=1

--read1_5 <read1_5>

integer, read1 5’ m-bias ignore bp, default=0

--read1_3 <read1_3>

integer, read1 3’ m-bias ignore bp, default=0

--read2_5 <read2_5>

integer, read2 5’ m-bias ignore bp, default=0

--read2_3 <read2_3>

integer, read2 3’ m-bias ignore bp, default=0

--no_overlap <no_overlap>

bool, remove any overlap between paired reads and stitch reads together when possible, default=True

--remove_noise <remove_noise>

bool, Discard the cluster containing noise points (-1) after clustering, default=True

--suffix <suffix>

Any additional info to include in the output file name, chromosome for example

--permute <permute>

Randomly shuffle the input file label on the reads prior to clustering. Has no effect if only analyzing one file