BreakTrans, a bioinformatics tool that maps gene fusions to genomic breakpoints. Author: Ken Chen Email: kchen3@mdanderson.org Date: 2013/02/14 Synopsis -------- Identify Genomic SV breakpoints underlying fusion junctions Usage: BreakTrans.pl Options: -o STR Only analysis the specified chromosome -g FILE Use UCSC Human gene table at database/Human.Mar2006.RefSeqGenes.tab -l INT Proximity of breakpoints to gene annotations [200] bp -C Use UCSC selfchain alignment FILE, default: database/Human.Mar2006.chainSelf.tab -h FLOAT Normalized score cutoff in selfchain alignment, default: 1 -z INT Self-chain region cutoff, default: 100000 bp Version: BreakTrans-0.0.6 Input ----- A). The genomic breakpoints are listed in a tab-delimited text file in BreakDancer style e.g., 1 100036615 11+2- 1 165018239 13+9- DEL 64981668 99 7 The columns are: ....................breakpoint 1 1. Chromosome 1 2. Genomic Position 1 3. Read counts on + and - strand ....................breakpoint 2 4. Chromosome 2 5. Genomic Position 2 6. Read counts on + and - strand 7. Type of SV 8. Size of SV ....................below are optional 9. Confidence score 10. Number of supporting reads The readcount and the SV types are important information to determine if the fusion has correct strand orientation. This version also support CREST style output, although you will need to convert CREST result into similar format: e.g., 1 100299895 - 6 137576589 + CTX 47 76 5 All the columns were reformated from CREST output B). The gene fusion breakpints are also listed in a tab-delimited text file that has 5 columns This version also support CREST style output, although you will need to cover CREST result into similar format: e.g., 1 110000570 6 111475203 Tophat-fusion 1 112035519 14 74019468 Wang_et_al The columns are: ....................breakpoint 1 1. Chromosome 1 2. Genomic Position 1 ....................breakpoint 2 3. Chromosome 2 4. Genomic Position 2 5. Source of prediction (arbitrary string for tracking purpose) Options ------- A required option is -g which specifies a gene annotation file in the tab format as downloaded from UCSC table browser In the database folder, we redistribute the standard hg18 and hg19 UCSC gene tables, and a hg19 gaf2.1 tab used in TCGA project. Other gene models can be used if they can be converted to the same format -l pads some flanking regions to transcript start and end, in order to capture breakpoints in these regions We found it is useful to remove RNA breakpoints that overlap with self-chain regions. You can enable this filter using -C -z and -h option. -C specifies the location of the UCSC self-chain table, which can be downloaded from the UCSC table browser. Due to the large size of these tables, we require you to split these tables but chromosomes. For example, if you use -C your_path/selfchain.hg18.tab The program will expect that you have the following set of files: your_path/selfchain.hg18.chr1.tab your_path/selfchain.hg18.chr2.tab ... your_path/selfchain.hg18.chrY.tab Again, we included these files in the database files that you can download from our site The -h option simply remove predictions that have one of the breakpoints in a self-chain region that have scores above the threshold -z specifies the size of the self-chain region that we screen. Regions shorter than this size are ignored. You could use the default values in general. Output ------ The output file adds 3 additional columns to the RNA breakpoint file e.g., 8 125620346 17 35315760 Tophat TATDN1>GSDMB 0 8:125618280-93|17:35321200- 20 46797753 20 33678749 Defuse PREX1>CPNE1 0 20:46795673-17|20:33925625-0|20:33923847-18|20:33679982- The 3 columns are: 6. Gene fusion, upstream gene > downstream gene 7. Self-chain alignment scores of the RNA breakpoints, based on UCSC self-chain track 8. Genomic alleles that support the fusion (read Methods in our publications for detailed description of break string format) Example ------- perl BreakTrans.pl -g database/Human.Mar2006.RefSeqGenes.tab -C database/Human.Mar2006.chainSelf.tab testdata/SK-BR-3.dna.bd testdata/SK-BR-3.rna.bd > SK-BR-3.drbd This is the same dataset that is described in our publication Please try to reproduce the results on your system before applying to your own dataset