OVERLAP

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents

FUNCTION

DESCRIPTION

FUNCTION [ Top | Next ]

Overlap compares two sets of DNA sequences to each other in both orientations using a WordSearch style comparison.

DESCRIPTION [ Previous | Top | Next ]

Overlap accepts two sets of sequences as input and uses the algorithm of Wilbur and Lipman (Proc. Natl. Acad. Sci. USA 80; 726-730 (1983)) to compare each sequence of the first set with each sequence of the second set, in both orientations. Thus, Overlap runs a WordSearch reiteratively, using the first set of sequences as queries. Unlike WordSearch, Overlap looks for overlaps between sequences rather than simply regions of similarity. An overlap is a highly similar region between two sequences that runs the entire length of a register of comparison. Overlap lists the position, length, and stringency of discovered overlaps in an output file.

EXAMPLE [ Previous | Top | Next ]

Here is a session using Overlap:


% overlap

 OVERLAP what query sequences ?  mu*.seq

 To what other sequences (* mu*.seq *) ?

 What word size (* 5 *) ?

 What fraction of the words in an overlap must match (* 0.80 *) ?

 Integrate how many adjacent diagonals (* 3 *) ?

 What is the minimum overlap length (* 10 *) ?

 What should I call the output file (* overlap.dat *) ?

   Reading ............
 Comparing ............

%

OUTPUT [ Previous | Top | Next ]

Here is the output file:


 OVERLAP of: mu*.seq
         to: mu*.seq

 Min overlap fraction: 0.80  Min overlap length: 10  Integral width: 3

                       September 24, 1998 17:00

Sequence1 Strand  Pos Sequence2 Strand  Pos Length Matches Ratio  Len1  Len2 ..

mu10.seq       +    2 mu5.seq        -    1    230     205  0.89   361   230
mu6.seq        +  183 mu5.seq        +    1    114     109  0.96   296   230

mu2.seq        +    2 mu23.seq       +    1    256     254  0.99   328   256

mu27.seq       +   23 mu18b.seq      -    1    203     199  0.98   290   203
mu27.seq       +  260 mu26b.seq      +    1     31      34  1.10   290   173
mu27.seq       +  222 mu26.Seq       +    1     44      44  1.00   290    44
mu27.Seq       +  224 mu18.Seq       -    1     42      42  1.00   290    42
mu27.Seq       +  226 mu32.Seq       -    1     40      40  1.00   290    40
mu27.Seq       +  222 mu9.Seq        +    1     39      39  1.00   290    39
mu26b.Seq      +  163 mu27.Seq       +    1     11      29  2.64   173   290
mu26.Seq       +    3 mu18.Seq       -    1     42      42  1.00    44    42
mu26.Seq       +    5 mu32.Seq       -    1     40      40  1.00    44    40
mu26.Seq       +    1 mu9.Seq        +    1     39      39  1.00    44    39
mu18.Seq       +    1 mu32.Seq       +    1     40      40  1.00    42    40
mu18.Seq       +    6 mu9.Seq        -    1     37      37  1.00    42    39
mu32.Seq       +    6 mu9.Seq        -    1     35      35  1.00    40    39

In this example, the overlap pairs are divided into three groups, or overlap clusters, separated by blank lines. Each cluster consists of overlapping fragments that could be chained together into a single, continuous assembly.

The output file lists the length, position, and percent similarity (ratio) of each overlap in descending order of sequence and overlap length. It also gives the orientation of each sequence.

INPUT FILES [ Previous | Top | Next ]

Overlap accepts two separate groups of multiple (one or more) nucleotide sequences as input. You can specify multiple sequences in a number of ways: by using a list file, for example@project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for exampleGenEMBL:*. If Overlap rejects your nucleotide sequence, see Appendix VI for information on how to change or set the type of a sequence.

WordSearch uses the same comparison algorithm as Overlap. WordSearch, however, accepts a single query sequence as input and finds regions of similarity rather than overlaps. LineUp is a screen editor for editing and displaying overlapping sequences.

RESTRICTIONS [ Previous | Top | Next ]

The total length of bases in any sequence set may not exceed 350,000. The word size must be between 1 and 30. The minimum overlap length must be between 1 and 1,000. The program cannot store more than 10,000 overlaps. If this number is exceeded, the program stops after suggesting that you increase the stringency to reduce the number of overlaps.

ALGORITHM [ Previous | Top | Next ]

Overlap, like WordSearch, identifies sequence similarities using a Wilbur and Lipman-style word comparison (see the WordSearch entry in the Program Manual for information regarding the details of this algorithm and considerations about using this search). Overlap differs from WordSearch in that it accepts a set of query sequences as input and reports overlaps rather than regions of similarity.

Overlap removes gap characters (. and ~) from the input sequences before comparing them.

CONSIDERATIONS [ Previous | Top | Next ]

For considerations in using a word comparison, see the CONSIDERATIONS topic in the WordSearch entry of the Program Manual.

Overlap recognizes certain regions of similarity to be overlaps based upon the strength of similarity across the entire register of comparison and the length of the register itself. These requirements correspond to the stringency and minimum overlap values that are set in response to Overlap's prompts. A stringency of .95 means that 95% percent of the bases in a given register of comparison must match for that similarity to be recognized as an overlap. A minimum overlap of 10 means that a given register of comparison must contain at least 10 bases to qualify as an overlap. The figure at the end of this entry illustrates these requirements. Examples four through six of this figure are not overlaps for the following reasons:

4. -- Although the two sequences are highly similar from B to C, the similarity over the length of the entire register, A to D, is not particularly strong. Highly similar segments that are not positioned at the end of each sequence are not reported as overlaps. The exception to this is the third example, in which a short sequence is completely similar to an internal segment of a larger sequence.

5. -- These sequences are not similar enough to contain an overlap. The minimum stringency requirement for overlaps is not met.

6. -- The register of comparison containing the similarity is not long enough; overlaps must be larger than the minimum overlap length.

To make the search for overlaps more tolerant of gaps between sequences, Overlap combines the scores of a user-defined number of adjacent diagonals, or registers of comparison (see the ALGORITHM topic in the WordSearch entry of the Program Manual). Thus, the reported percent similarity or ratio may be larger than the actual ratio and may even be greater than 100%. Combining the scores of adjacent diagonals in Overlap may cause the listed overlap position to be a few bases removed from the actual overlap position.

SUGGESTIONS [ Previous | Top | Next ]

If you are looking only for weak overlaps, you can use -UPPERlimit to specify a maximum stringency. Overlaps containing more than this maximum fraction of matching bases are not reported in the output file. For example, if you run % overlap -STRIngency=0.6 -UPPERlimit=0.7, your output only contains overlaps in which 60 to 70 percent of the bases matched.

COMMAND-LINE SUMMARY [ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % overlap [-INfile1=]mu*.seq -Default

Prompted Parameters:

[-INfile2=]mu*.seq        specifies second search set
-WORdsize=5               sets the length of word for a match
-STRIngency=.80           sets the minimum fraction of required word matches
-MINOverlap=10            sets the minimum overlap length
-INTegrate=3              sets the number of diagonals to integrate
[-OUTfile=]overlap.dat    names the output file

Local Data Files:  None

Optional Parameters:

-UPPERlimit=.90           sets the upper limit on stringency

LOCAL DATA FILES [ Previous | Top | Next ]

None.

PARAMETER REFERENCE [ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-WORdsize=5

sets the word size for comparison between sequences. The range of word sizes is from 1 to 10.

-STRIngency=.80

sets the minimum fraction of bases that must match for a given register of comparison to qualify as an overlap.

-MINOverlap=10

sets the minimum number of bases that must be present for a given register of comparison to qualify as an overlap. The minimum overlap length must be between 1 and 1,000.

-INTegrate=3

sets the number of adjacent diagonals used to calculate scores.

-UPPERlimit=.90

sets an upper limit on the stringencies on which Overlap reports. This creates a stringency range between the upper limit and the minimum stringency within which overlaps must fall.

Printed: December 9, 1998 16:23 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.