#Introduction TIGRA is a computer program that performs targeted local assembly of structural variant (SV) breakpoints from next generation sequencing short-read data. It takes as input a list of putative SV calls and a set of bam files that contain reads mapped to a reference genome such as NCBI build36. For each SV call, it assembles the set of reads that were mapped or partially mapped to the region of interest (ROI) in the corresponding bam files. Instead of outputing a single consensus sequence, tigra attempts to construct all the alternative alleles in the ROI as long as they received sufficient sequence coverage (usually >= 2x). It also utilizes the variant type information in the input files to select reads for assembly. Tigra_sv is effective at improving the SV prediction accuracy and resolution in short reads analysis and can produce accurate breakpoint sequences that are useful to understand the origin, mechanism and pathology underlying the SVs. TIGRA was initially developed at the Genome Institute of Washington University in St. Louis and was further developed at the University of Texas MD Anderson Cancer Center. This is an beta-release version 0.4.0 #Install Download and compile samtools (version 0.1.19 and above) (http://samtools.sourceforge.net/) on your system Modify the Makefile to point to the samtools folder on your system Type "make" and enter #Usage: ./tigra [ ...] Options: -l INT Assembly [500] bp from the SV breakpoints -a INT Assembly [50] bp into the SV breakpoints -k STR Comma separated list of kmers [15,25] -c STR Only assemble calls on chromosome [STR] -o FILE Save assembly contigs to [FILE] -s INT Only output contigs longer than [50] bp -R FILE Path to the wildtype reference fasta -r FILE Create pair-wise local reference sequence fastas in [FILE] -w INT Pad local reference by additional [200] bp on both ends -q INT Only assemble reads with mapping quality > [1] -N INT Highlight segments supported by SVReads that differ from reference by at least [5] mismatches -p INT Ignore cases that have average read depth greater than [10000] -d Dump reads by case into fasta files -I STR Save reads fasta into an existing directory -b The input file is in breakdancer format -f Provide a text file containing rows of sample:bam mapping -M INT Skip SVs shorter than [3] bp -h INT Skip complex contig graphs with more than [100] nodes Version: 0.3.7 #Input: The minimally required input is a SV file. As shown in the usage, a group of bam files can be specified in the commandline, or using the -f option. TIGRA currently recognizes two types of input: 1. The 1000 Genomes format The SV calls must be recorded in a tab-delimited format with the following columns: CHR START_OUTER START_INNER END_INNER END_OUTER TYPE_OF_EVENT SIZE_PREDICTION MAPPING_ALGORITHM SEQUENCING_TECHNOLOGY SAMPLEs TYPE_OF_COMPUTATIONAL_APPROACH GROUP OPTIONAL_ID It is critical to have accurate information in CHR,START_INNER,END_INNER,TYPE_OF_EVENT, SIZE_PREDICTION, and SAMPLES. SAMPLEs should be the sample names separated by comma. For example: 1 829757 829757 829865 829865 DEL 116 MAQ SLX NA19238,NA19240 RP WashU To let the program know the association between location of the bam files and the sample names, use the -f option to specify a bam_list_file in the following key:value pairs: sample_name:bam_file_location with no space in between. For example: NA19238:1000genomes/ftp/data/NA19238/alignment/NA19238.chrom1.SLX.SRP000032.2009_07.bam NA19240:1000genomes/ftp/data/NA19238/alignment/NA19240.chrom1.SLX.SRP000032.2009_07.bam Each row can only declare only one sample. Only the samples and the bams that contain the SV will be assembled. 2. BreakDancer format Please use option -b to declare the BreakDancer format. You can use either the long format, e.g., 10 89690279 + 10 89702321 + DEL 12042 99 16 example1|16 0.01 2.37 10 85512695 + 10 85513886 + DEL 1191 99 18 example1|11:example2|7 0.02 0.35 or the short format, e.g., 10 89690279 + 10 89702321 + DEL 12042 10 85512695 + 10 85513886 + DEL 1191 In the long format, column 11 must contain a list of samples and number of SV supporting reads in each sample, separated by ":" and by "|". The numbers of supporting reads can just be placeholders when they are not available, but the sample names (example1, example2) must be meaningful and match exactly with the sample names in the bam list file e.g., example1:example1.bam example2:example2.bam Column 11 is used in conjunction with the bam list to selectively assemble the subset of bams that may contain the predicted SV. For example, the first deletion above will be assembled using reads from example1 and the second using reads from both example1 and example2. When the short format is used, all the bams specified in the bamlist and by the command line argument will be used. The set of bams specified in the bamlist have higher precedence than those specified through the commandline arguments. #SV types TIGRA can currently assemble the following type of SVs: DEL: deletion; INS: insertion; ITX: tandem duplication; CTX: transchromosomal translocation. Notice that in the BreakDancer file, the SVs are already recorded in these 3-letter abbreviations. For the 1000 Genomes format, please ensure that the TYPE columns use the same terminologies. #Additional comments on the usages: -R If you'd like to see if part of the contigs are novel relative to the reference, i.e., supported by unmapped or poorly mapped reads, please provide the program with the samtools faidxed reference file with -R option followed by the full path. The novel part of the contigs will be in CAPITAL letters, while the parts identical to the reference will be in lower case. This feature facilitates consistency analysis with split-reads type of algorithm (such as Pindel) that directly examines unmapped or poorly mapped reads. It could also help genotyping algorithms observing reads spanning the breakpoints. -N Use in conjunction with -R to define the set of poorly mapped reads -c If you'd like to parallelize the jobs by chromosome, please use option -c followed by the chromosome id, so that the program will skip the other chromosomes for this job. Please make sure that the bams in bam_list_file contains the chromosome of interest. -r Use in conjunction with -R. This is useful when you want to obtain a fasta file that contains a matched set of local wild-type sequences for breakpoint annotation. #Example commands: 1. Assemble SVs using example1.bam tigra -b -R NCBI36.example.fa -o output1.fa example.breakdancer.sv example1.bam 2> output1.log 2. Assemble SVs from a list of bam files tigra -b -R NCBI36.example.fa -o output2.fa -f example.bamlist example.breakdancer.sv 2> output2.log #Contact Ken Chen (kchen3@mdanderson.org) #Acknowledgement MD Anderson Cancer Center Washington University in St. Louis 1000 Genomes project