#Introduction

TIGRA is a computer program that performs targeted local assembly of structural variant (SV) breakpoints from next generation sequencing short-read data. It takes as input a list of putative SV calls and a set of bam files that contain reads mapped to a reference genome such as NCBI build36.  For each SV call, it assembles the set of reads that were mapped or partially mapped to the region of interest (ROI) in the corresponding bam files. Instead of outputing a single consensus sequence, tigra attempts to construct all the alternative alleles in the ROI as long as they received sufficient sequence coverage (usually >= 2x).  It also utilizes the variant type information in the input files to select reads for assembly.  Tigra_sv is effective at improving the SV prediction accuracy and resolution in short reads analysis and can produce accurate breakpoint sequences that are useful to understand the origin, mechanism and pathology underlying the SVs.

TIGRA was initially developed at the Genome Institute of Washington University in St. Louis and was further developed at the University of Texas MD Anderson Cancer Center.

This is an beta-release version 0.4.0 

#Install

	Download and compile samtools (version 0.1.19 and above) (http://samtools.sourceforge.net/) on your system 
	Modify the Makefile to point to the samtools folder on your system
        Type "make" and enter

#Usage:

./tigra <SV file> [<a.bam> <b.bam> ...]

Options: 
	-l INT	Assembly [500] bp from the SV breakpoints
	-a INT	Assembly [50] bp into the SV breakpoints
	-k STR	Comma separated list of kmers [15,25]
	-c STR  Only assemble calls on chromosome [STR]
	-o FILE	Save assembly contigs to [FILE]
	-s INT	Only output contigs longer than [50] bp
	-R FILE	Path to the wildtype reference fasta
	-r FILE Create pair-wise local reference sequence fastas in [FILE]
	-w INT	Pad local reference by additional [200] bp on both ends
	-q INT	Only assemble reads with mapping quality > [1]
	-N INT	Highlight segments supported by SVReads that differ from reference by at least [5] mismatches
	-p INT	Ignore cases that have average read depth greater than [10000]
	-d	Dump reads by case into fasta files
	-I STR	Save reads fasta into an existing directory
	-b	The input file is in breakdancer format
	-f	Provide a text file containing rows of sample:bam mapping
	-M INT  Skip SVs shorter than [3] bp
	-h INT  Skip complex contig graphs with more than [100] nodes
Version: 0.3.7


#Input:

	The minimally required input is a SV file.
	As shown in the usage, a group of bam files can be specified in the commandline, or using the -f option.

	TIGRA currently recognizes two types of input:

	1. The 1000 Genomes format
	   The SV calls must be recorded in a tab-delimited format with the following columns: 
		CHR     
		START_OUTER     
		START_INNER     
		END_INNER       
		END_OUTER       
		TYPE_OF_EVENT   
		SIZE_PREDICTION 
		MAPPING_ALGORITHM       
		SEQUENCING_TECHNOLOGY   
		SAMPLEs 
		TYPE_OF_COMPUTATIONAL_APPROACH  
		GROUP   
		OPTIONAL_ID

	   It is critical to have accurate information in CHR,START_INNER,END_INNER,TYPE_OF_EVENT, SIZE_PREDICTION, and SAMPLES.
	   SAMPLEs should be the sample names separated by comma. 
	   For example:
	   1       829757  829757  829865  829865  DEL       116     MAQ     SLX     NA19238,NA19240    RP      WashU
   
	   To let the program know the association between location of the bam files and the sample names, use the -f option to specify a bam_list_file in the following key:value pairs:
	 	sample_name:bam_file_location 
	   with no space in between. 
	   For example:
	   	NA19238:1000genomes/ftp/data/NA19238/alignment/NA19238.chrom1.SLX.SRP000032.2009_07.bam
	   	NA19240:1000genomes/ftp/data/NA19238/alignment/NA19240.chrom1.SLX.SRP000032.2009_07.bam
	   Each row can only declare only one sample.
	   Only the samples and the bams that contain the SV will be assembled.

	2. BreakDancer format
	   Please use option -b to declare the BreakDancer format.
	   You can use either the long format, e.g.,

10      89690279        +       10      89702321        +       DEL     12042   99      16      example1|16     0.01    2.37
10      85512695        +       10      85513886        +       DEL     1191    99      18      example1|11:example2|7  0.02    0.35

	   or the short format, e.g.,

10      89690279        +       10      89702321        +       DEL     12042
10      85512695        +       10      85513886        +       DEL     1191

	   In the long format, column 11 must contain a list of samples and number of SV supporting reads in each sample, separated by ":" and by "|". The numbers of supporting reads can just be placeholders when they are not available, but the sample names (example1, example2) must be meaningful and match exactly with the sample names in the bam list file
	   e.g.,
		example1:example1.bam
		example2:example2.bam
	   Column 11 is used in conjunction with the bam list to selectively assemble the subset of bams that may contain the predicted SV. For example, the first deletion above will be assembled using reads from example1 and the second using reads from both example1 and example2. 
	   When the short format is used, all the bams specified in the bamlist and by the command line argument will be used. The set of bams specified in the bamlist have higher precedence than those specified through the commandline arguments.

	
#SV types

	TIGRA can currently assemble the following type of SVs:
	
	DEL: deletion;
	INS: insertion;
	ITX: tandem duplication;
	CTX: transchromosomal translocation.
	
	Notice that in the BreakDancer file, the SVs are already recorded in these 3-letter abbreviations. For the 1000 Genomes format, please ensure that the TYPE columns use the same terminologies.


#Additional comments on the usages:
   
	-R
		If you'd like to see if part of the contigs are novel relative to the reference, i.e., supported by unmapped or poorly mapped reads, please provide the program with the samtools faidxed reference file with -R option followed by the full path. The novel part of the contigs will be in CAPITAL letters, while the parts identical to the reference will be in lower case.  This feature facilitates consistency analysis with split-reads type of algorithm (such as Pindel) that directly examines unmapped or poorly mapped reads. It could also help genotyping algorithms observing reads spanning the breakpoints.

	-N
		Use in conjunction with -R to define the set of poorly mapped reads
	-c
		If you'd like to parallelize the jobs by chromosome, please use option -c followed by the chromosome id, so that the program will skip the other chromosomes for this job.  Please make sure that the bams in bam_list_file contains the chromosome of interest.

	-r
		Use in conjunction with -R. This is useful when you want to obtain a fasta file that contains a matched set of local wild-type sequences for breakpoint annotation. 


#Example commands:

	1. Assemble SVs using example1.bam
        tigra -b -R NCBI36.example.fa -o output1.fa example.breakdancer.sv example1.bam 2> output1.log

	2. Assemble SVs from a list of bam files
        tigra -b -R NCBI36.example.fa -o output2.fa -f example.bamlist example.breakdancer.sv  2> output2.log


#Contact
Ken Chen (kchen3@mdanderson.org)

#Acknowledgement
MD Anderson Cancer Center
Washington University in St. Louis
1000 Genomes project