BreakDown-1.0 is a perl package that estimates the variant allele fraction (VAF) and infers genotypes of structural variants from next generation paired-end sequencing reads. There are two steps in BreakDown. First, bam2cfg.pl is used to generate a configure file of the sequence data. Next, BreakDown.pl usesthis configure file and the putative SV files to generate a VAF/genotype summary file. The followings are the usages and program dependancy. Necessary files ----------------- 1. A configure file describing meta-data of bam files. This is the output of Step 1. 2. An SV file with the following columns 1. Chromosome 1 2. Position 1 3. Orientation 1 4. Chromosome 2 5. Position 2 6. Orientation 2 7. Type of a SV 8. Size of a SV 9. Confidence Score Columns 1-3 and 4-6 are used to specify the coordinates of the two SV breakpoints. The orientation is a string that records the number of reads mapped to the plus (+) or the minus (-) strand in the anchoring regions. Column 7 is the type of SV detected: DEL (deletions), INS (insertion), INV (inversion), ITX (intra-chromosomal translocation), CTX (inter-chromosomal translocation), and Unknown. Column 8 is the size of the SV in bp. It is meaningless for inter-chromosomal translocations. Column 9 is the confidence score associated with the prediction. If BreakDancer is used to generate putative SV calls, the output of BreakDancer is in the qualified format already. 3. Reference file from which the alignment algorithm was used for generating the bam file. Running steps ---------------- Step 1. bam2cfg.pl to generate configure files. Usage: perl bam2cfg.pl bam_files Inputs are all the bam files separated by space. The usage can be found in [1]. Step 2. BreakDown.pl to estimate VAF and infer genotype for structural variations. A few options useful to users are listed below. Usage: BreakDown.pl Options: -g Turn on GC correction, default: off -d DIR Directory where all intermediate files are saved -c STR Select chromosome STR -q INT Only analyze reads with mapping quality >= [0] -r FLOAT Prior of SVs in the genome [0.0001] -x STR Type of evidences included in VAF and genotype estimation: {n,d,v,nd,nv,dv,ndv} (n: normal, d: discordant, v: variant) -F STR Path to the samtools faidx-ed reference sequence (useful for GC estimation) -S Skip sites with size bigger than [1000000] -b Bin size. Should be larger when genome coverage is low. [100] -P The mean mapping quality of a bin below which the bin will be not participated in the genotyping decision and AF estimation. [30] -y The libraries to be ignored, with colon to separate. Most of these options are self-exaplanatory. If the bam file contains only of one chromosome, -c option needs to be specified to avoid accessing other chromosomes not existing in the bam during initialization. If only one or two of the evidences are included for VAF/genotype estimation, -x option is used for the specification. n stands for normal reads, d for discordant and v for clipped reads. -g turns on GC correction, which is possible only when a whole chromosome or a whole genome exists in the bam. Otherwise the algorithm will make a homogeneous assumption and takes the estimates from bam2cfg script. When GC correction is on, the reference sequence is required, and could be specified by -F option. When putative SV has size extremely large, BreakDown has the option to skip it to avoid stucking on one SV by option -S. Bin size can be tuned by option -b in unbalanced SVs. 100 for small SVs (size < 1Kbp) or 200 for large SVs (size >= 1Kbp) is reasonable. When GC correction is off, this option is deactivated. Skipping low quality bins in VAF estimation of genotyping is made through -P option, so that the specified mean mapping quality threshold disquality those highly repetitive bins. If a specific set of libraries is to be ignored, use -y option, with colon separating each of the library. The SV file as the second input argument to BreakDown should be in the following output format. In details, it should consists of the following columns: The output format ------------------- BreakDown's output file consists of the following columns: 1. Chromosome 1 2. Position 1 3. Chromosome 2 4. Position 2 5. Type of an SV 6. Size of an SV 7. Variant Score 8. Inferred Genotype 9. Inferred VAF 10. Genotype Score Columns 1-6 are given in the input SV file, describing the SV coordinates and the type as well as the size. Columns 7-10 are inferred from BreakDown algorithm. Specifically, column 9 is the inferred VAF, ranging from 0 to 1, with the corresponding variant score in column 7. Column 8 and 10 on genotype estimation (categorized into homo-ref, het-variant and homo-variant) are mainly for normal genomes without genetic diseases. Example ---------- Three examples of running BreakDown are listed (bam2cfg step is skipped and please refer to [1]). Example 1 Estimate VAF and genotype on a whole chromosome/genome bam file. Since the bam file contain the whole chromosome/genome, it is possible to do GC correction. Please refer to example 3 if this is not preferred. perl BreakDown.pl -g -F reference_file config_file SV_file If the bam file does not contain all 22 autosomes, please specify the chromosome from which the GC correction is learned by -c. For example, if only chromosome 1 is existing in the bam file, the following is the command. perl BreakDown.pl -c 1 -g -F reference_file config_file SV_file It takes around 15-20 minutes for GC correction initialization on a standard 30-50X NGS file. Once the initialization is done, the process of each SV goes much faster, < 1s for a medium sized SV. A few temporary files from GC correction are written down for potential rerun to remove randomness. They are "gc.tmp.[library_names]", and "cfg.tmp". Please store them in a separate folder in case they are overwritten. To rerun with the stored temporary files, please refer to Example 2. Example 2 Estimate VAF and genotype with GC files. perl BreakDown.pl -g -F reference_file config_file SV_file "gc.tmp.*" If the third input argument is not empty, the program will look for the temporary files in the designated directory with the default names. Once the contents of the files are loaded, it will directly do VAF and genotype estimation, skipping initialization step. Example 3 Estimate VAF and genotype without GC correction. When GC correction is not needed (such as in simple simulation data) or not possible (e.x. The bam file is a small part of the genome surrounding the SV), it is possible to turn GC correction off. The command is the same as that in Example 1 except without "-g" option. Homogeneous insert/bp versus GC content is assumed in this case. Dependence over other Perl Modules ------------------------------------ CPAN module Statistics::Descriptive is used by both bam2cfg and BreakDown in CPAN, which can be downloaded at http://search.cpan.org/~colink/Statistics-Descriptive-2.6/Descriptive.pm Ken Chen MD Anderson Cancer Center May 2nd, 2014 References [1] Fan, X., Abbott, T. E., Larson, D. and Chen, K. 2014. BreakDancer: Identification of Genomic Structural Variation from Paired-End Read Mapping. Current Protocols in Bioinformatics. 15.6.1-15.6.11.