[ Program Manual | User's Guide | Data Files | Databases ]
Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.
Motifs looks for protein motifs by searching protein sequences for regular-expression patterns described in the PROSITE Dictionary. Motifs can recognize the patterns with some of the symbols mismatched, but not with gaps. Motifs can only be used to search for patterns in protein sequences.
There is a very informative abstract on every motif in the PROSITE Dictionary. These abstracts are included in the output if any motif is found in your sequence.
The PROSITE Dictionary was compiled and is maintained by Dr. Amos Bairoch of the University of Geneva.
Here is a session using Motifs to look for sequence motifs in PIR:Kihua:
% motifs MOTIFS from what protein sequence(s) ? PIR:Kihua What should I call the output file (* kihua.motifs *) ? KIHUA len: 194 ................................ Total finds: 1 Total length: 194 Total sequences: 1 CPU time (sec): 3.60 Output file:"kihua.motifs" %
Here is some of the output file:
MOTIFS from: PIR:Kihua Mismatches: 0 September 25, 1998 11:39 .. KIHUA Check: 1665 Length: 194 ! adenylate kinase (EC 2.7.4.3) 1 - human ______________________________________________________________________________ Adenylate_Kinase (L,I,V,M,F,Y,W)3DG(F,Y,I)PRx3(N,Q) (L,I,F){3}DG(Y)PRx{3}(Q) 90: NTSKG FLIDGYPREVQQ GEEFE ****************************** * Adenylate kinase signature * ****************************** Adenylate kinase (EC 2.7.4.3) (AK) [1] is a small monomeric enzyme that catalyzes the reversible transfer of MgATP to AMP (MgATP + AMP = MgADP + ADP). In mammals there are three different isozymes: - AK1 (or myokinase), which is cytosolic. - AK2, which is located in the outer compartment of mitochondria. - AK3 (or GTP:AMP phosphotransferase), which is located in the mitochondrial matrix and which uses MgGTP instead of MgATP. The sequence of AK has also been obtained from different bacterial species and from plants and fungi. Two other enzymes have been found to be evolutionary related to AK. These are: - Yeast uridylate kinase (EC 2.7.4.-) (UK) (gene URA6) [2] which catalyzes the transfer of a phosphate group from ATP to UMP to form UDP and ADP. - Slime mold UMP-CMP kinase (EC 2.7.4.14) [3] which catalyzes the transfer of a phosphate group from ATP to either CMP or UMP to form CDP or UDP and ADP. Several regions of AK family enzymes are well conserved, including the ATP- binding domains. We have selected the most conserved of all regions as a signature for this type of enzyme. This region includes an aspartic acid residue that is part of the catalytic cleft of the enzyme and that is involved in a salt bridge. It also includes an arginine residue whose modification leads to inactivation of the enzyme. -Consensus pattern: [LIVMFYW](3)-D-G-[FYI]-P-R-x(3)-[NQ] -Sequences known to belong to this class detected by the pattern: ALL, except for Schistosoma mansoni (blood fluke) and Yersinia enterocolitica AK. -Other sequence(s) detected in SWISS-PROT: NONE. -Note: archaebacterial AK do not belong to this family [4]. -Last update: November 1997 / Pattern and text revised. [ 1] Schulz G.E. Cold Spring Harbor Symp. Quant. Biol. 52:429-439(1987). [ 2] Liljelund P., Sanni A., Friesen J.D., Lacroute F. Biochem. Biophys. Res. Commun. 165:464-473(1989). [ 3] Wiesmueller L., Noegel A.A., Barzu O., Gerisch G., Schleicher M. J. Biol. Chem. 265:6339-6345(1990). [ 4] Kath T.H., Schmid R., Schaefer G. Arch. Biochem. Biophys. 307:405-410(1993). ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Above each find, the regular expression found by the program is displayed ((L,I,V,M,F,Y,W)3DG(F,Y)PRx3(N,Q)). Below this is a simplification of the expression showing selected amino acids and ranges ((L,I,F){3}DG(Y)PRx{3}(Q)) so that you can better see what was actually found. The find is displayed between five flanking residues to the N-terminus and C-terminus of the protein. The number to the left of the find is the first coordinate of the motif (not of the flanking symbols). In the example above, 90 is the coordinate of the first F in FLIDGYPREVQQ, not of the first N in NTSKG.
The PROSITE Dictionary contains an extensive abstract summarizing current information for a motif. Motifs displays the abstract below each pattern that is found. If the same pattern is found in more than one sequence, the abstract is only shown below the pattern in the first sequence in which the pattern is found. Several different patterns may share the same abstract. If you want to reduce the size of your output you can suppress these abstracts with -NOREFerence. When abstracts are being suppressed there will be a filename, such as 0179.pdoc, that appears in parentheses below each pattern found. You can use the Fetch program to make a copy of this file in order to look at the abstract.
Motifs takes as input one or more protein sequence files. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. If Motifs rejects your protein sequence, see Appendix VI for information on how to change or set the type of a sequence.
FindPatterns and all of the Wisconsin Package(TM) mapping programs use the same search algorithm and pattern file format as Motifs. ProfileScan uses a database of profiles to find structural and sequence motifs in protein sequences.
The pattern motifs may not be more than 350 characters long.
Motifs will not introduce gaps, but it can tolerate mismatches when with -MISmatch=n. Mismatched finds are shown in the output in lowercase. Mismatches cannot occur within NOT expressions (see the DEFINING PATTERNS topic below).
In addition to your input protein sequence files, Motifs reads a local data file like the one below to find the search patterns. This file is modeled on the enzyme data files for the mapping programs described in Appendix VII. The offset field is not used by Motifs, but the field must have a number in it to make the file compatible with the mapping files.
The exact column used for each field does not matter, only the order of the fields in the line. You may give several patterns the same name, but put all of the entries for that name on adjacent lines of this file. The patterns may not be more than 350 characters long. Blank lines and lines that start with an exclamation point (!) are ignored.
Here is part of the default data file used by Motifs:
PROSITETOGCG of: prosite.doc and prosite.dat August 20, 1998 15:57 Release 15.0 (7/1998) Name Offset Pattern .. PDoc_Name 11s_Seed_Storage 1 NGx(D,E)2x(L,I,V,M,F)C(S,T)x{11,12}(P,A,G)D 0284.pdoc 1433_1 1 RNL(L,I)SV(G,A)YKN(I,V) 0633.pdoc 1433_2 1 YK(D,E)STLIMQLL(R,H)DNLTLW(T,A)(S,A) 0633.pdoc 25a_Synth_1 1 GGSx(A,G)(K,R)xTxL(K,R)(G,S,T)xSD(A,G) 0653.pdoc 25a_Synth_2 1 RPVILDPx(D,E)PT 0653.pdoc //////////////////////////////////////////////////////////////////////////// Zinc_Finger_C2h2 1 Cx{2,4}Cx3(L,I,V,M,F,Y,W,C)x8Hx{3,5}H 0028.pdoc Zinc_Finger_C3hc4 1 CxHx(L,I,V,M,F,Y)Cx2C(L,I,V,M,Y,A) 0449.pdoc Zinc_Protease 1 (G,S,T,A,L,I,V,N)x2HE(L,I,V,M,F,Y,W)~(D,E,H,R,K,P ... Zn2_Cy6_Fungal 1 (G,A,S,T,P,V)Cx2C(R,K,H,S,T,A,C,W)x2(R,K,H)x2Cx{5 ... Zp_Domain 1 (L,I,V,M,F,Y,W)x7(S,T,A,P,D,N)x3(L,I,V,M,F,Y,W)x( ...
The PROSITE Dictionary contains a number of short sequence patterns that occur frequently in protein sequences. Most of these frequently found patterns are post-translational modifications, but more specific patterns such as leucine zippers also fall into this category. Such frequently found patterns are not normally shown by Motifs, but you can display them with -FREquent. More so than with other patterns in the PROSITE Dictionary, the presence of these frequently occurring patterns does not assure you that the protein actually contains the corresponding function.
Here are some of the patterns that the PROSITE Dictionary classifies as frequently occurring:
;Amidation 1 xG(R,K)(R,K) 0009.pdoc ;Asn_Glycosylation 1 N~(P)(S,T)~(P) 0001.pdoc ;Camp_Phospho_Site 1 (R,K)2x(S,T) 0004.pdoc ;Ck2_Phospho_Site 1 (S,T)x2(D,E) 0006.pdoc ;Glycosaminoglycan 1 SGxG 0002.pdoc ;Leucine_Zipper 1 Lx6Lx6Lx6L 0029.pdoc ;Microbodies_Cter 1 (S,A,G,C,N)(R,K,H)(L,I,V,M,A,F)> 0299.pdoc ;Myristyl 1 G~(E,D,R,K,H,P,F,Y,W)x2(S,T,A,G,C,N)~(P) 0008.pdoc ;Pkc_Phospho_Site 1 (S,T)x(R,K) 0005.pdoc ;Rgd 1 RGD 0016.pdoc ;Tyr_Phospho_Site 1 (R,K)x{2,3}(D,E)x{2,3}Y 0007.pdoc
The PDoc_Name field in the pattern file prosite.patterns has the name of a PDoc (PROSITE Document) file containing the abstract for each pattern. You can use Fetch to look at any abstracts of interest. If you run Motifs with -NOREFerence, the name of the corresponding PDoc file is shown below each pattern found.
If you specify more than one sequence, Motifs displays each one's name on the screen as it is searched. However, unless you use -SHOw, the output file shows only those sequences in which a motif was actually found.
If you run Motifs with -NAMes, the output file is a list file. (See "Using List Files" in Chapter 2, Using Sequence Files and Databases of the User's Guide for more information about list files.)
With the publication of the PROSITE Dictionary, Amos Bairoch has shown that regular expressions can reliably recognize known protein pattern motifs. When new examples of a known motif are discovered, these expressions can usually be modified to recognize the new example. The process of modifying a regular expression so that it covers all of the members of a newly expanded family of similar sequence patterns could be referred to as "ambiguation."
The problem with regular expressions is that they often fail to recognize sequences that are not yet known to be members of the sequence family. You should consider using Profile technology if your aim is to bring together similar sequences whose association has not yet been recognized.
There are a few patterns in PROSITE that are defined with rules rather than regular expressions. Motifs does not look for these patterns.
FindPatterns, Map, MapSort, MapPlot, and Motifs all let you search with ambiguous expressions that match many different sequences. The expressions can include any legal GCG sequence character (see Appendix III). The expressions can also include several non-sequence characters, which are used to specify OR matching, NOT matching, begin and end constraints, and repeat counts. For instance, the expression TAATA(N){20,30}ATG means TAATA, followed by 20 to 30 of any base, followed by ATG. Following is an explanation of the syntax for pattern specification.
Parentheses () enclose one or more symbols that can be repeated some number of times. Braces {} enclose numbers that tell how many times the symbols within the preceding parentheses must be found.
Sometimes, you can leave out part of an expression. If braces appear without preceding parentheses, the numbers in the braces define the number of repeats for the immediately preceding symbol. One or both of the numbers within the braces may be missing. For instance, both the pattern GATG{2,}A and the pattern GATG{2}A mean GAT, followed by G repeated from 2 to 350,000 times, followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0 to 350,000 times, followed by A; the pattern GAT(TG){,2}A means GAT, followed by TG repeated from 0 to 2 times, followed by A; the pattern GAT(TG){2,2}A means GAT, followed by TG repeated exactly 2 times, followed by A. (If the pattern in the parentheses is an OR expression (see below), it cannot be repeated more than 2,000 times.)
If you are searching nucleic acids, the ambiguity symbols defined in Appendix III let you define any combination of G, A, T, or C. If you are searching proteins, you can specify any of several symbol choices by enclosing the different choices in parentheses and separating the choices with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A followed by S. The length of each choice need not be the same, and there can be up to 31 different choices within each set of parentheses. The pattern GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from 1 to 4 times followed by A. The sequence GATTGGA matches this pattern. There can be several parentheses in a pattern, but parentheses cannot be nested.
The pattern GC~CAT means GC, followed by any symbol except C, followed by AT. The pattern GC~(A,T)CC means GC, followed by any symbol except A or T, followed by CC.
The pattern <GACCAT can only be found if it occurs at the beginning of the sequence range being searched. Likewise, the pattern GACCAT> would only be found if it occurs at the end of the sequence range.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
Minimal Syntax: % motifs [-INfile=]PIR:Kihua -Default Prompted Parameters: [-OUTfile=]kihua.motifs names the output file Local Data Files: -DATa=prosite.patterns names the file of protein sequence patterns Optional Parameters: -NOREFerence suppresses the PROSITE abstract for each pattern found -FREquent shows motifs that are frequently found in proteins -MISmatch=1 allows one mismatch -NAMes writes the output as a list file -APPend appends the pattern data file to your output file -SHOw shows every file searched, even if no pattern was found -MINCuts=2 limits finds to patterns found a minimum of 2 times -MAXCuts=2 limits finds to patterns found a maximum of 2 times -ONCe limits finds to patterns found only once -EXCLude=n1,n2 excludes patterns found between positions n1 and n2 -RSF[=motifs.rsf] saves motifs as features in an RSF file -NOMONitor suppresses the screen trace showing each file -NOSUMmary suppresses the screen summary at the end of the program
The publication of the PROSITE Dictionary of Protein Sites and Patterns by Dr. Amos Bairoch of the University of Geneva is one of the great achievements of sequence analysis. Dr. Bairoch's prodigious efforts can be seen in every abstract of this extraordinary collection. His generosity in distributing it, and his patience in compiling it so carefully, puts all of us in his debt.
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files the User's Guide.
Motifs reads the regular expressions for the motifs of interest from the file prosite.patterns.
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
suppresses the PROSITE abstract that normally appears below each pattern that is found.
displays frequently found patterns, such as post-translational modifications.
causes Motifs to recognize places where patterns are found with one or fewer mismatches. The display uses case to distinguish between matches and mismatches.
writes the output file as a list file suitable for input to other Wisconsin Package programs that support indirect file specification (see "Using List Files" in Chapter 2, Using Sequence Files and Databases of the User's Guide). The output showing the location of the patterns found is suppressed when you choose the list file format.
appends the pattern data file to your output file. (See the PATTERN FILE topic above.)
Usually, Motifs shows that a motif was searched only if there were one or more matches in the sequence. With -SHOw, Motifs shows every motif searched whether or not a pattern was actually found in the sequence. ( -SHOw is equivalent to setting -MINCuts=0.)
The descriptions of the exclusionary parameters below were written for the Wisconsin Package mapping programs. A find in these applications is referred to as a cut while a pattern is referred to as a restriction enzyme recognition site.
The -MINCuts, -MAXCuts, -ONCe, and -EXClude parameters suppress the display of selected enzymes. The list of excluded enzymes in the program output includes both selected enzymes that cut within excluded ranges and selected enzymes that did not cut the right number of times.
excludes enzymes that do not cut at least two times.
excludes enzymes that cut more than two times.
excludes, from the set of enzymes displayed, those enzymes that cut your sequence more than once (equivalent to setting both mincuts and maxcuts to one).
excludes enzymes that cut anywhere within one or more ranges of the sequence. If an enzyme is found within an excluded range, then the enzyme is not displayed. The list of excluded enzymes includes enzymes that cut within excluded ranges. The ranges are defined with sets of two numbers. The numbers are separated by commas. Spaces between numbers are not allowed. The numbers must be integers that fall within the sequence beginning and ending points you have chosen. The range may be circular if circular mapping is being done. Exclusion is not done if there are any non-numeric characters in the numbers or numbers out of range or if there is an odd number of integers following the parameter.
writes an RSF (rich sequence format) file containing the input sequences annotated with features generated from the results of Motifs. This RSF file is suitable for input to other Wisconsin Package programs that support RSF files. In particular, you can use SeqLab to view this features annotation graphically. If you don't specify a file name with this parameter, then the program creates one using motifs for the file basename and .rsf for the extension. For more information on RSF files, see "Using Rich Sequence Format (RSF) Files" in Chapter 2 of the User's Guide. Or, see "Rich Sequence Format (RSF) Files" in Appendix C of the SeqLab Guide.
This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.
writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.
You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.
[ Program Manual | User's Guide | Data Files | Databases ]
Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com
Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997, 1998 Genetics Computer Group Inc., a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.