XNU

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
CONSIDERATIONS
COMMAND-LINE SUMMARY
ACKNOWLEDGEMENT
LOCAL DATA FILES
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

Xnu replaces statistically significant tandem repeats in protein sequences with X characters. If a resulting protein sequence is used as a query for a BLAST search, the regions with X characters are ignored.

DESCRIPTION

[ Previous | Top | Next ]

The Karlin-Altschul statistics that underlie BLAST assume that the probability of finding a residue at any particular position in a sequence is simply proportional to its composition. Tandem repeats may violate this assumption. Such regions occur frequently in proteins. Query sequences containing such repeats may give significant similarity scores when compared to unrelated proteins containing similar repeats.

Xnu is a program described by Claverie and States in Computers and Chemistry, 17; 191-201 (1993) that is used to mask off tandem repeats in protein sequences. The output sequence is just like the input sequence except that if tandem repeats are found, the amino acid characters comprising such repeats are replaced by X's. Regions containing X's are ignored in a BLAST search.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using Xnu to mask off the repeats in a human major prion protein precursor.


% xnu

 XNU of what input sequence(s) ?  PIR:Ujhu

                  Begin (* 1 *) ?
                End (*   253 *) ?

 What should I call the output file (* ujhu.xnu *) ?

        PIR1:UJHU   Len:     253
%

OUTPUT

[ Previous | Top | Next ]

Each output file contains the input sequence with the amino acid characters that comprise statistically-significant tandem repeats changed into X's. Here is the output file from the session above.


!!AA_SEQUENCE 1.0
  XNU of: ujhu  check: 8781  from: 1  to: 253

P1;UJHU - major prion protein precursor - human
N;Alternate names: 11K amyloid protein; 27-30K sialoglycoprotein; PrP 27-30;
 PrP 33-35C; scrapie prion protein
C;Species: Homo sapiens (man)
C;Date: 25-Oct-1987 #sequence_revision 12-Apr-1996 #text_change 05-Sep-1997
C;Accession: A24173; A40372; A05017; S14078; I54322; I68597; I58135; I59184;
 I79633; I79634
R;Kretzschmar, H.A.; Stowring, L.E.; Westaway, D.; Stubblebine, W.H.; Prusiner,
 S.B.; Dearmond, S.J

ujhu.xnu  Length: 253  October 13, 1998 13:56  Type: P  Check: 1796  ..

       1  MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP

      51  PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWX

     101  XXXXXXXXXX XXXXXXXXXX XXXXXXXYML GSAMSRPIIH FGSDYEDRYY

     151  RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV NITIKQHTVT TTTKGENFTE

     201  TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPX XXXXXXXXXX

     251  XVG

INPUT FILES

[ Previous | Top | Next ]

You can specify either a single protein sequence or multiple protein sequences as input to Xnu. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. If Xnu rejects your protein sequence, see Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

Seg replaces low complexity regions in protein sequences with X characters. If a resulting protein sequence is used as a query for a BLAST search, the regions with X characters are ignored.

Repeat finds direct repeats in sequences. You must set the size, stringency, and range within which the repeat must occur; all the repeats of that size or greater are displayed as short alignments.

RESTRICTIONS

[ Previous | Top | Next ]

Xnu does not recognize repeats if the width is set much longer than the length of either the repeat or the sequence. Its behavior is not characterized for sequence symbols that are not among the standard unambiguous IUPAC-IUB amino acid single-letter symbols (ACDEFGHIKLMNPQRSTVWY).

CONSIDERATIONS

[ Previous | Top | Next ]

Repeat sequences are scored as segment pairs (short gapless alignments). All of the residues in both of the segments of a significant pair are replaced with X's.

Xnu uses a PAM120 scoring matrix for scoring similarities. You cannot select any other scoring matrix. By default, repeats less than five residues long are eliminated unless you set a different maximum repeat length with -WIDth.

Many single tandem repeats will not be masked, while triplet repeats of the same kind will be. STUSTU would not be found where STUSTUSTU will be.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % xnu [-INfile=]PIR:Ujhu -Default

Prompted Parameters: (for single sequences)

-BEGin=1  -END=253      sets the range of interest
[-OUTfile=]ujhu.xnu     names the output file

Local Data Files:       None

Optional Parameters:

-BEGin=1  -END=100      sets the range of interest (for multiple sequences)
-PRObability=.01        sets the expectation level for a repeat
-WIDth=4                sets the minimum size of a repeat
-EXTension=.xnu         sets the default output file name extension
-LIStfile[=xnu.list]    writes a list file of output sequence names
-NOMONitor              suppresses screen monitor of input sequence names
-NOSUMmary              suppresses the screen summary

ACKNOWLEDGEMENT

[ Previous | Top | Next ]

Xnu was written by Jean Michel Claverie and David States while they were working at the National Center for Biotechnology Information (NCBI). Their public-domain program was modified by Scott Rose for distribution with Version 9 of the Wisconsin Package. The document you are now reading was written by John Devereux. We are very grateful to Claverie and States for their work on Xnu and to NCBI for making this program available to the scientific community.

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-BEGin=1

sets the beginning position for all input sequences. When the beginning position is set from the command line, Xnu ignores beginning positions specified for individual sequences in a list file.

-END=100

sets the ending position for all input sequences. When the ending position is set from the command line, Xnu ignores ending positions specified for sequences in a list file.

-PRObability=.01

For a repeat to be recognized, it must score high enough so that you would not expect to see a higher score more than once in 100 searches of random sequences of average length and composition. Use this parameter to change that expectation cutoff. Setting this cutoff lower than its default of 0.01 makes the search more stringent and the number of repeats masked off fewer. The minimum and maximum values of this parameter are 0.0001 and 0.1.

-WIDth=4

sets the maximum size of a repeat. If a repeat were of length five, even if it were significant, it would not be found if this parameter were set to four. When this value is set to zero, Xnu will search for repeats of any size. Very short repeats may not score above the default probability cutoff (see -PRObability above). The maximum value of this parameter is 100. The larger it is, the longer the search will take.

-EXTension=.xnu

This program normally creates output file names by using the original input file name for the base name and the program name for the name extension. Use this parameter to specify some other file name extension.

-LIStfile=xnu.list

writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then Xnu makes one up using xnu for the file name and .list for the file name extension.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-SUMmary

writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: December 9, 1998 16:29 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997, 1998 Genetics Computer Group Inc., a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com