[ Program Manual | User's Guide | Data Files | Databases ]
MEME finds conserved motifs in a group of unaligned sequences. MEME saves these motifs as a set of profiles. You can search a database of sequences with these profiles using the MotifSearch program.
MEME uses the method of Bailey and Elkan (see ACKNOWLEDGEMENTS) to identify likely motifs within the input set of sequences. You may specify a range of motif widths to target, as well as the number of unique motifs to search for. MEME uses Bayesian probability to incorporate prior knowledge of the similarities among amino acids into its predictions of likely motifs. The resulting motifs are output as profiles. A profile is a log-odds matrix used to judge how well an unknown sequence segment matches the motif.
Here is a session with MEME that was used to find motifs in a group of calcium-transporting membrane proteins listed in the file pircat.list.
% meme Find motifs in what sequences? @pircat.list How many motifs should I search for (* 6 *) ? What should I call the profile file (* meme.prf *) ? What should I call the report file (* meme.meme *) ? Reading sequences ... PIR2:S39163 ( 89 aa) PIR2:A42764 ( 919 aa) PIR2:S71168 ( 946 aa) PIR1:PWBYR1 ( 950 aa) PIR2:S24359 ( 994 aa) PIR2:A32792 ( 994 aa) PIR2:A48849 ( 994 aa) PIR2:B31981 ( 997 aa) Identifying motifs in: 8 sequences Shortest sequence (aa): 89 Longest sequence (aa): 997 Total aa: 6883 Finding 1st motif Testing starts of width 8 ... done Testing starts of width 11 ... done Testing starts of width 15 ... done Testing starts of width 21 ... done Testing starts of width 29 ... done Testing starts of width 41 ... done Testing starts of width 57 ... done Running EM from 21 starting motifs ......... done Finding 2nd motif Testing starts of width 8 ... done /////////////////////////////////////////////////////////// Search completed after finding the 6 motifs requested. Sequences searched: 8 Number of motifs identified: 6 Output profile file: meme.prf Output report: meme.meme %
MEME generates a report and a file containing one or more ungapped GCG profiles. (See RELATED PROGRAMS for notes on how this "multiple profile file" differs from earlier versions of profile files).
MEME's report file gives details about the motifs that help you analyze the validity and usefulness of the results. The file first lists the training set, or input sequences. ("Training set" is a common term for a set of examples from which an intelligent program learns a general concept.) After echoing the parameters you specified, the file gives a detailed description of each motif found. This report includes three different representations of the motif: Two versions of a letter-probability matrix, and a consensus sequence showing all likely letters for each position. (A fourth representation is the ungapped profile that is written to the other output file.) There are six different types of information presented:
For more details about the output, consult Tim Bailey's MEME website at http://www.sdsc.edu/MEME. (Note that the log-odds matrices referred to at the website correspond to the profiles that appear in a separate output file from GCG's MEME.) Here is some of the output from the EXAMPLE:
******************************************************************************** TRAINING SET ******************************************************************************** DATAFILE= @pircat.list ALPHABET= ACDEFGHIKLMNPQRSTVWY Sequence name Weight Length Sequence name Weight Length ------------- ------ ------ ------------- ------ ------ PIR2:S39163 1.0000 89 PIR2:A42764 1.0000 919 PIR2:S71168 1.0000 946 PIR1:PWBYR1 1.0000 950 PIR2:S24359 1.0000 994 PIR2:A32792 1.0000 994 PIR2:A48849 1.0000 994 PIR2:B31981 1.0000 997 ******************************************************************************** meme ******************************************************************************** MOTIF 1 width = 14 sites = 8.0 ******************************************************************************** Simplified A :::::::::::::: motif letter- C :a:::::::::::: probability D :::9:::::::::: matrix E :::::::::::::: F :::::::::::::: G ::::::9::::::: H ::::::::::::1: I 8::::::::::::: K ::::9:::::1::: L ::::::::9::::: M :::::::::::::9 N :::::::::::9:: P :::::::::::::: Q ::::::::::::8: R :::::::::::::: S ::9:::::::1::: T :::::9:9:97::: V 1::::::::::::: W :::::::::::::: Y :::::::::::::: bits 6.2 5.6 5.0 * 4.4 * * Information 3.7 * * content 3.1 * ***** * * * (47.3 bits) 2.5 ********** *** 1.9 ************** 1.2 ************** 0.6 ************** 0.0 -------------- Multilevel ICSDKTGTLTTNQM consensus sequence -------------------------------------------------------------------------------- Motif 1 in BLOCKS format -------------------------------------------------------------------------------- BL MOTIF 1 width=14 seqs=8 PIR2:S39163 ( 14) ICSDKTGTLTTNQM 1 PIR2:A42764 ( 347) ICSDKTGTLTKNEM 1 PIR2:S71168 ( 453) ICSDKTGTLTTNHM 1 PIR1:PWBYR1 ( 368) ICSDKTGTLTSNHM 1 PIR2:S24359 ( 348) ICSDKTGTLTTNQM 1 PIR2:A32792 ( 348) ICSDKTGTLTTNQM 1 PIR2:A48849 ( 348) ICSDKTGTLTTNQM 1 PIR2:B31981 ( 348) ICSDKTGTLTTNQM 1 // --------------------------------------------------------------------------- Possible examples of motif 1 in the training set --------------------------------------------------------------------------- Sequence name Start Score Site ------------- ----- ----- -------------- PIR2:S39163 14 57.51 SVETLGCTSV ICSDKTGTLTTNQM SVCKANACNS PIR2:A42764 347 49.26 IVETLGCCNV ICSDKTGTLTKNEM TVTHILTSDG PIR2:S71168 453 54.51 ACETMGSATT ICSDKTGTLTTNHM TVVKACICEQ PIR1:PWBYR1 368 51.59 SVETLGSVNV ICSDKTGTLTSNHM TVSKLWCLDS PIR2:S24359 348 57.51 SVETLGCTSV ICSDKTGTLTTNQM SVCRMFVIDK PIR2:A32792 348 57.51 SVETLGCTSV ICSDKTGTLTTNQM SVCKMFIVDK PIR2:A48849 348 57.51 SVETLGCTSV ICSDKTGTLTTNQM SVCKMFIIDK PIR2:B31981 348 57.51 SVETLGCTSV ICSDKTGTLTTNQM SVCRMFILDR --------------------------------------------------------------------------- letter-probability matrix: alength= 20 w= 14 n= 6779 0.007049 0.002044 0.002037 0.002293 0.005429 0.002439 . . . 0.002778 0.005443 0.961227 0.001396 0.002037 0.001528 0.001679 . . . 0.000830 0.015012 0.002823 0.003197 0.002442 0.001933 0.006176 . . . 0.001759 0.007101 0.001366 0.889213 0.027866 0.002153 0.005074 . . . 0.002383 0.006191 0.001517 0.002092 0.003289 0.001078 0.002866 . . . 0.001355 0.009921 0.002392 0.002885 0.002475 0.002042 0.004017 . . . 0.001481 0.014315 0.001548 0.006802 0.004851 0.001597 0.919485 . . . 0.001744 0.009921 0.002392 0.002885 0.002475 0.002042 0.004017 . . . 0.001481 0.005699 0.001894 0.001479 0.002568 0.011028 0.002253 . . . 0.003273 0.009921 0.002392 0.002885 0.002475 0.002042 0.004017 . . . 0.001481 0.016505 0.004503 0.006077 0.005728 0.003797 0.005281 . . . 0.002597 0.005424 0.001835 0.011030 0.003643 0.002797 0.005789 . . . 0.002762 0.013339 0.002372 0.006233 0.046657 0.002630 0.004246 . . . 0.002541 0.003377 0.001543 0.001401 0.001498 0.002749 0.001885 . . . 0.003170 ******************************************************************************** MOTIF 2 width = 21 sites = 7.0 ******************************************************************************** //////////////////////////////// Search completed after finding the 6 motifs requested.
And here is an excerpt from the profile file:
!!AA_PROFILE 2.0 (Peptide) .. { MEME v2.2 of: @pircat.list Length: 14 ! Sequences: 8 MaxScore: 1.00 October 30, 1998 12:55 !PIR2:S39163 From: 14 To: 27 Weight: 1.000000 !PIR2:A42764 From: 347 To: 360 Weight: 1.000000 !PIR2:S71168 From: 453 To: 466 Weight: 1.000000 !PIR1:PWBYR1 From: 368 To: 381 Weight: 1.000000 !PIR2:S24359 From: 348 To: 361 Weight: 1.000000 !PIR2:A32792 From: 348 To: 361 Weight: 1.000000 !PIR2:A48849 From: 348 To: 361 Weight: 1.000000 !PIR2:B31981 From: 348 To: 361 Weight: 1.000000 Gap: 1.00 Len: 1.00 GapRatio: 0.0 LenRatio: 0.0 Cons A C D E F G H . . . W Y Gap Len } I -337 -315 -466 -476 -289 -482 -444 . . . -404 -355 100 100 ! 1 C -374 572 -521 -493 -472 -536 -455 . . . -537 -530 100 100 S -228 -268 -401 -467 -438 -348 -385 . . . -437 -421 100 100 D -336 -373 410 -116 -422 -377 -258 . . . -397 -377 100 100 K -356 -358 -462 -424 -522 -459 -349 . . . -403 -459 100 100 T -288 -292 -416 -465 -430 -410 -377 . . . -426 -446 100 100 G -235 -355 -292 -368 -465 372 -337 . . . -389 -422 100 100 T -288 -292 -416 -465 -430 -410 -377 . . . -426 -446 100 100 L -368 -326 -512 -460 -186 -494 -377 . . . -328 -331 100 100 T -288 -292 -416 -465 -430 -410 -377 . . . -426 -446 100 100 T -214 -201 -308 -344 -340 -371 -272 . . . -343 -365 100 100 ! 11 N -375 -330 -222 -409 -384 -358 -105 . . . -339 -356 100 100 Q -245 -293 -305 -41 -393 -402 124 . . . -286 -368 100 100 M -443 -355 -520 -537 -387 -520 -454 . . . -320 -336 100 100 * 0 8 8 1 0 8 2 . . . 0 0 { MEME v2.2 of: @pircat.list Length: 21 ! Sequences: 8 MaxScore: 1.00 October 30, 1998 12:57 /////////////////////////////////////////////////////////////////////////////// * 7 9 6 7 6 12 1 . . . 0 2
The input to MEME is a set of either nucleotide or protein sequences (not both). The function of MEME depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, see Appendix VI for information on how to change or set the type of a sequence.
MEME respects the begin and end attributes for controlling the range of interest for sequences in list files (but see RESTRICTIONS, below). MEME also respects the strand list file attribute for nucleotide sequences.
PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap). ProfileSearch uses a profile (representing a group of aligned sequences) as a query to search the database for new sequences with similarity to the group. The profile is created with the program ProfileMake. ProfileScan uses a database of profiles to find structural and sequence motifs in protein sequences. ProfileGap makes an optimal alignment between a profile and one or more sequences.
MEME's output can best be appreciated by running the output profiles through MotifSearch, another program in the Wisconsin Package. You will probably want to run MotifSearch at least twice. First, you should use the profiles to search the original training set of sequences. Second, you may wish to search a larger database to identify similar sequences. See the documentation for MotifSearch for details.
You can analyze at most 1,000,000 residues at one time.
If you wish to use both strands of nucleotide sequences, you must specify the one-per model (described in ALGORITHM, below) via the -ONEEXactly parameter.
MEME cannot process multiple sequences with the same name. If MEME encounters a second sequence with the identical name as a previous one, it will ignore the second. Thus, you cannot analyze several segments of a single sequence by creating several list file entries of that sequence and specifying different begin and end attributes for each entry.
MEME implements the method of Bailey and Elkan (see ACKNOWLEDGMENTS), to find one or more motifs that characterize a family of sequences. The core of MEME is Expectation Maximization (EM), an unsupervised learning algorithm guaranteed to converge to a local maximum. That is, any motif found by MEME will be "better" (according to MEME's statistical criteria) than any other motif that differs infinitesimally from the first.
One of the criteria applied by MEME depends on your choice of a model. MEME can either a) favor motifs that appear exactly once in each sequence in the training set (the one-per model); b) favor motifs that appear zero or one time in each sequences in the training set (the zero-or-one-per model); or c) give no preference to the number of occurrences (the zero-or-more-per model).
MEME makes use of Dirichlet priors in its EM calculations for protein sequences. These are empirical statistical measures of the interchangeability of amino acids within subsequences of similar function. Suppose there are two amino acid sequences, S1 and S2, having the same length. If the first residue in S1 is I, and the first residue in S2 is V, then there is some likelihood that S1 and S2 have the same function, given their similarity in the first position. We can estimate that likelihood by analyzing the set of subsequences whose functionality is established.
A drawback to EM is that the maximum it finds is only local. There may be better solutions that were overlooked due to an unlucky choice of the starting point -- EM's initial guess at the solution. This is a nontrivial and heavily studied problem. One approach is to run the algorithm from a large subset of the possible starting points. You may choose the subset to be evenly distributed across the solution space, or to be randomly selected. In any case, this may take a daunting amount of time.
MEME refines this approach by taking a carefully chosen subset of possible solutions and running a single iteration of EM on each. It then chooses one from among these as its best candidate, and runs EM to convergence from there. When searching for a starting point, MEME does not consider all possible starting points within the range of widths it is given; rather, it surveys starting points at particular steps within the range given. Thus, if using the default range of 8 to 57, MEME will only consider initial motifs whose widths are in the set {8, 11, 15, 21, 28, 41, 57}.
Despite limiting the initial set of widths under consideration, MEME can find a motif of any width in the given range. This is due to a shortening technique that trims low-information columns from the ends of the motif. However, the motif will never be shortened below the minimum width specified for the search.
Version 2.0 profile files
When reading version 2.0 profile files generated by MEME, most GCG programs (e.g. ProfileSearch, ProfileGap) will read only the first profile found. At this time, the only exception is MotifSearch, which reads and processes all of the profiles.
Also note that MEME's profiles always have Gap and Len values of 100 -- MEME's profiles should always be thought of as ungapped. This is a characteristic of MEME, not of the version 2.0 profile file format.
For more details about version 2.0 profiles, see Appendix VII.
Time-complexity of the algorithm
In any event, running on a training set of more than 20 or 30 typical proteins will require a lot of processor cycles.
Effects of the choice of
model
Multiple motifs
Choosing the minimum and maximum
search widths
If the training set may include proteins that are not related to the family of interest, you might first run with -MINWidth and -MAXWidth both set to the same small number (perhaps 10 for proteins), and NMOtifs set to 1 or 2. (Be sure to use the default one-or-zero-per model!) This may find a motif (possibly part of a larger motif) that discriminates between family and non-family members, allowing you to remove the unrelated proteins before running a more exhaustive MEME over a larger range of widths.
Finding repeats in a sequence
MEME was written by Dr. Timothy L. Bailey of the San Diego Supercomputing Center. (Bailey, T.L., and Elkan, C., (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36, AAAI Press, Menlo Park, California.)
MEME was adapted for the Wisconsin Package by Scott Swanson with the assistance of Dr. Bailey.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
Minimal Syntax: % meme [-INfile=]@pircat.list -Default Prompted Parameters: -BEGin=1 -END=100 sets the range of interest for all sequences -REVerse uses the reverse strand of all sequences [-OUTfile1=]meme.prf specifies the output file of profiles [-OUTfile2=]meme.meme specifies the output report file -NMOTifs=6 sets the maximum number of motifs to search for Local Data Files: -DATa=prior30.plib specifies Dirichlet priors for proteins Optional Parameters: -ONEEXactly requires each motif to occur exactly once in each sequence -ONEORZero allows each motif to occur up to once in each sequence -ZEROORMore allows motifs to occur any number of times in any sequence -TWOStrands searches both strands of nucleotide sequence -MINWidth=8 requires motifs to be at least this wide -MAXWidth=57 limits motifs to a maximum of this width -EMTHReshold=.001 sets the convergence criterion for EM -MAXEMiterations=50 stops EM after this many iterations without convergence -NOSUMmary suppresses report of run information to screen at exit -NOMONitor suppresses screen trace during processing -NOREPort suppresses creation of report file -BATch submits program to the batch queue
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see see Chapter 4, Using Data Files in the User's Guide. When processing proteins, MEME uses a data file of Dirichlet priors for its Bayesian statistics. By default, the file is GenRunData:prior30.plib. Although it is possible to specify your own priors, it not advised unless you have a very strong understanding of MEME's inner workings.
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
sets the beginning position for all input sequences. When the beginning position is set from the command line, MEME ignores beginning positions specified for individual sequences in a list file.
sets the ending position for all input sequences. When the ending position is set from the command line, MEME ignores ending positions specified for sequences in a list file.
sets the program to use the reverse strand for each input sequence. When -REVerse or -NOREVerse is on the command line, MEME ignores any strand designation for individual sequences in a list file.
gives the number of unique motifs for which to search.
specifies a model in which each motif should occur zero or one times in any sequence in the training set. If a given motif scores well at more than one position in a sequence, the motif might still be chosen, but the additional scores "hits" will not contribute to its score. This is the default model. This model is about two times slower than the -ONEEXactly model.
specifies a model in which each motif may occur any number of times in any sequence in the training set. In this case, additional "hits" after the first within a sequence will contribute to the motif's score. This model is about ten times slower than the -ONEEXactly model.
searches forward and reverse strands of nucleotide sequences. This parameter may be used only with the -ONEEXactly parameter!
specifies the smallest acceptable motif for the search. When shortening the chosen motif, MEME will NOT shorten below this value.
specifies the largest acceptable motif for the search. If -MINWidth is equal to -MAXWidth, MEME will either find a motif of that width, or find nothing at all.
gives a convergence criterion for the EM phase of the algorithm. Raising this criterion will make MEME run faster, but give inferior results.
overrules the convergence criterion given by EMTHReshold. That is, if EM has failed to converge to the EMTHReshold after MAXEMiterations, the program will cut off the calculation and settle for its result to that point.
writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.
You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.
This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.
tells the program not to generate a report file.
submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.
[ Program Manual | User's Guide | Data Files | Databases ]
Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com
Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997, 1998 Genetics Computer Group Inc., a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.