BreakTrans, a bioinformatics tool that maps gene fusions to genomic breakpoints. 
Author:	Ken Chen
Email:	kchen3@mdanderson.org
Date:	2013/02/14 

Synopsis
--------
Identify Genomic SV breakpoints underlying fusion junctions
Usage:   BreakTrans.pl <genome breakpoints> <gene fusion breakpoints>

Options:
         -o STR   Only analysis the specified chromosome
         -g FILE  Use UCSC Human gene table at database/Human.Mar2006.RefSeqGenes.tab
         -l INT   Proximity of breakpoints to gene annotations [200] bp
         -C       Use UCSC selfchain alignment FILE, default: database/Human.Mar2006.chainSelf.tab
         -h FLOAT Normalized score cutoff in selfchain alignment, default: 1
         -z INT   Self-chain region cutoff, default: 100000 bp
Version: BreakTrans-0.0.6


Input
-----
A). The genomic breakpoints are listed in a tab-delimited text file in BreakDancer style

e.g.,
1       100036615       11+2-   1       165018239       13+9-   DEL     64981668        99      7

The columns are:
....................breakpoint 1
1. Chromosome 1
2. Genomic Position 1
3. Read counts on + and - strand
....................breakpoint 2
4. Chromosome 2
5. Genomic Position 2
6. Read counts on + and - strand

7. Type of SV
8. Size of SV
....................below are optional
9. Confidence score
10. Number of supporting reads

The readcount and the SV types are important information to determine if the fusion has correct strand orientation.
This version also support CREST style output, although you will need to convert CREST result into similar format:
e.g.,
1       100299895       -       6       137576589       +       CTX     47      76      5
All the columns were reformated from CREST output

B). The gene fusion breakpints are also listed in a tab-delimited text file that has 5 columns
This version also support CREST style output, although you will need to cover CREST result into similar format:

e.g.,
1       110000570       6       111475203       Tophat-fusion
1       112035519       14      74019468        Wang_et_al

The columns are:
....................breakpoint 1
1. Chromosome 1
2. Genomic Position 1
....................breakpoint 2
3. Chromosome 2
4. Genomic Position 2
5. Source of prediction (arbitrary string for tracking purpose)


Options
-------
A required option is -g which specifies a gene annotation file in the tab format as downloaded from UCSC table browser
In the database folder, we redistribute the standard hg18 and hg19 UCSC gene tables, and a hg19 gaf2.1 tab used in TCGA project.  
Other gene models can be used if they can be converted to the same format
-l pads some flanking regions to transcript start and end, in order to capture breakpoints in these regions

We found it is useful to remove RNA breakpoints that overlap with self-chain regions.
You can enable this filter using -C -z and -h option.
-C specifies the location of the UCSC self-chain table, which can be downloaded from the UCSC table browser.
Due to the large size of these tables, we require you to split these tables but chromosomes. 
For example, if you use -C your_path/selfchain.hg18.tab
The program will expect that you have the following set of files:
your_path/selfchain.hg18.chr1.tab
your_path/selfchain.hg18.chr2.tab
...
your_path/selfchain.hg18.chrY.tab

Again, we included these files in the database files that you can download from our site
The -h option simply remove predictions that have one of the breakpoints in a self-chain region that have scores above the threshold
-z specifies the size of the self-chain region that we screen.  Regions shorter than this size are ignored.
You could use the default values in general.


Output
------
The output file adds 3 additional columns to the RNA breakpoint file
e.g.,
8       125620346       17      35315760        Tophat  TATDN1>GSDMB    0       8:125618280-93|17:35321200-
20      46797753        20      33678749        Defuse  PREX1>CPNE1     0       20:46795673-17|20:33925625-0|20:33923847-18|20:33679982-

The 3 columns are:
6. Gene fusion, upstream gene > downstream gene
7. Self-chain alignment scores of the RNA breakpoints, based on UCSC self-chain track
8. Genomic alleles that support the fusion (read Methods in our publications for detailed description of break string format)


Example
-------
perl BreakTrans.pl -g database/Human.Mar2006.RefSeqGenes.tab -C database/Human.Mar2006.chainSelf.tab testdata/SK-BR-3.dna.bd testdata/SK-BR-3.rna.bd > SK-BR-3.drbd

This is the same dataset that is described in our publication
Please try to reproduce the results on your system before applying to your own dataset