MotifSearch*

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents

FUNCTION

DESCRIPTION

FUNCTION [ Top | Next ]

MotifSearch uses a set of profiles (representing similarities within a family of sequences) as a query to either a) search a database for new sequences similar to the original family, or b) annotate the members of the the original family with details of the matches between the profiles and each of the members. Normally, the profiles are created with the program MEME.

DESCRIPTION [ Previous | Top | Next ]

MotifSearch searches a set of sequences for one or more ungapped profiles, using the method of Bailey and Gribskov (see the ACKNOWLEDGMENT topic). The profiles and the sequences should all be of the same type, protein or nucleotide.

The algorithm calculates position scores for each profile at each possible position within a sequence. These scores are translated into p-values, which represent the likelihood of the given profile scoring that well against a randomly generated sequence. The best (i.e. lowest) position p-values for each profile are then adjusted to take into account the length of the sequence. These adjusted p-values are then used to calculate a combined p-value, which is the p-value of the product of the adjusted p-values. Motifsearch never tries to introduce gaps in the profiles or in the search sequence. Any gapping information in the profiles is ignored.

MotifSearch reports its results as a sorted list file of the best-scoring sequences, or alternatively the subsequences that correspond to individual profile hits. It can also generate an RSF file in which profile hits appear as annotated features. This is your best choice if you use SeqLab to view your results.

EXAMPLE [ Previous | Top | Next ]

Here is a session with MotifSearch that was used to search PIR with the output file from the MEME program.


% motifsearch meme.prf -MONitor=1000

  Search for motifs in what sequence(s) (* PIR:* *) ?

  What is the threshold combined p-value
  for displaying sequences (* 0.000100 *) ?

  What should I call the output file (* motifsearch.ms *) ?

  Processing motif profiles .........
    Found 6 motif profiles in meme.prf
  Searching for motifs in sequences ...

          1 Sequences         105 aa searched    PIR1:CCHU
////////////////////////////////////////////////////////////

    109,001 Sequences  34,799,349 aa searched    PIR4:I54264
     Sequences searched: 109075
         CPU time (sec): 383.52
            Output file: motifsearch.ms
%

OUTPUT [ Previous | Top | Next ]

By default, MotifSearch generates a list file as output. This file contains all the sequences whose combined p-value was below the threshold specified during the run. The sequences are sorted, with the best score (i.e. lowest p-value) at the top. Each entry shows the combined p-value for that sequence, as well as an E-Value, which is the number of sequences expected to score this well in a search set of this size. (The E-Value is the p-value multiplied by the number of sequences in the set.) Each entry includes the number of different motifs that hit against the sequence as well as the total number of hits. If 2 of 3 profiles scored hits (i.e. had position p-values above the -HITTHReshold), but one of those profiles scored 2 hits, the list would report 2 motifs with hits, and a total of 3 hits overall.

If you use the -FRAGments parameter, the list file instead contains entries for each profile hit against the high-scoring sequences.

After printing the sorted list, MotifSearch appends a section showing the individual profile hits within each sequence. The hits are presented in two formats.

The hits are first displayed in a diagram, with each hit shown in the form (32 <2> 53), where 32 would be the start of the hit, 53 would be the end, and 2 would be the position of the profile in the input file of profiles (think of it as an ID number for the profile). The hits are displayed in the order in which they appear in the sequence, with as many hits to a line as will fit, except when two hits overlap. In this case, the second hit is displayed on the following line, indented from the first hit.

After displaying this position diagram for a sequence, MotifSearch displays the hits again, this time including the p-values as well as the sequence segment that scored the hit. In this case, the hits are sorted first by profile ID and then by order in the sequence. Thus, all of the hits for profile 1 in a sequence are displayed before any of the hits for profile 2, etc.

Here is some of the output from the EXAMPLE session:


MotifSearch of PIR:*  October 1, 1998 12:56

  Family profiles: meme.prf
  Maximum pVal for match (threshold): 1.00E-04
                                             combined         # motifs  total
Here are the matching files:                 p-Value  E-Value  matched matches
#..
PIR2:A32792
! Ca2+-transporting ATPase (EC 3.6.1.38), f 2.4E-166 2.6E-161      6      8
PIR2:S24359
! Ca2+-transporting ATPase (EC 3.6.1.38), f 1.8E-165 2.0E-160      6      6
PIR2:A48849
! Ca2+-transporting ATPase (EC 3.6.1.38) SE 1.8E-165 2.0E-160      6      7
PIR1:PWRBFC
////////////////////////////////////////////////////////////
PIR2:S73562
! phosphopyruvate hydratase (EC 4.2.1.11) -  9.4E-05  1.0E+01      2      2
PIR2:JC4519
! heat-shock protein GroEL - Pasteurella mu  9.5E-05  1.0E+01      2      4
PIR2:D69305
! conserved hypothetical protein AF0444 - A  9.7E-05  1.1E+01      2      2
\\End of list
-------------------------------------------------------
PIR2:A32792          Combined PVal:2.38E-166 EVal: 2.6E-161
P1;A32792 - Ca2+-transporting ATPase (EC 3.6.1.38), fast twitch skeletal muscle
 - chicken
C;Species: Gallus gallus (chicken)
C;Date: 08-Dec-1989 #sequence_revision 08-Dec-1989 #text_change 20-Mar-1998
C;Accession: A32792
R;Karin, N.J.; Kaprielian, Z.; Fambrough, D.M.
Mol. Cell. Biol. 9, 1978-1986, 1989 . . .

  Position diagram of Motif hits in PIR2:A32792:
#.(165 <1> 178).(295 <4> 343).(348 <1> 361).(421 <4>
 469).(513 <6> 525)
#.(700 <2> 720).(737 <3> 763).(791 <5> 822)

                     Position  Sequence
                     p-Value   p-Value
  ** Hits for Motif 1 **
 (165 <1> 178)       1.8E-05   1.7E-02
    IISIKSTTLRVDQS
 (348 <1> 361)       6.3E-19   6.1E-16
    ICSDKTGTLTTNQM
  ** Hits for Motif 2 **
 (700 <2> 720)       9.7E-27   9.5E-24
    MTGDGVNDAPALKKAEIGIAM
  ** Hits for Motif 3 **
 (737 <3> 763)       2.6E-35   2.5E-32
    DDNFSTIVAAVEEGRAIYNNMKQFIRY
  ** Hits for Motif 4 **
 (295 <4> 343)       4.4E-60   4.2E-57
    YFKIAVALAVAAIPEGLPAVITTCLALGTRRMAKKNAIVRSLPSVETLG
 (421 <4> 469)       5.3E-05   4.9E-02
    NDSSLDYNEAKGIYEKVGEATETALTCLVEKMNVFNTDVRSLSKVERAN
  ** Hits for Motif 5 **
 (791 <5> 822)       1.4E-40   1.4E-37
    QLLWVNLVTDGLPATALGFNPPDLDIMDKPPR
  ** Hits for Motif 6 **
 (513 <6> 525)       3.1E-17   3.0E-14
    FVKGAPEGVIDRC
-------------------------------------------------------
////////////////////////////////////////////////////////////
-------------------------------------------------------
PIR2:D69305          Combined PVal:9.66E-05 EVal: 10.5
P1;D69305 - conserved hypothetical protein AF0444 - Archaeoglobus fulgidus
C;Species: Archaeoglobus fulgidus
C;Date: 05-Dec-1997 #sequence_revision 05-Dec-1997 #text_change 05-Jun-1998
C;Accession: D69305
R;Klenk, H.P.; Clayton, R.A.; Tomb, J.F.; White, O.; Nelson, K.E.; Ketchum,
 K.A.; Dodson, R.J.; Gwinn, M.; Hickey, E.K.; Peterson, J.D.; Richardson, D.L.;
 Kerlavage, A.R.; Graham, D.E.; Kyrpides, N.C.; Fleischmann, R.D.; Quackenbush,
 J.; Lee, N.H.; Sutton, G.G.; Gill, S.; Kirkness, E.F.; Dougherty, B.A.;
 McKenny, K.; Adams, M.D.; Loftus, B.; Peterson, S.; Reich, C.I.; McNeil, L.K.;
 Badger, J.H.; Glodek, A.; Zhou, L.; Overbeek, R.; Gocayne, J.D.; Weidman, J.F.;
 McDonald, L.; Utterback, T.
Nature 390, 364-370, 1997 . . .

  Position diagram of Motif hits in PIR2:D69305:
#.(7 <1> 20).(169 <2> 189)

                     Position  Sequence
                     p-Value   p-Value
  ** Hits for Motif 1 **
 (7 <1> 20)          7.1E-05   1.5E-02
    IAVDIDGTLTDRKR
  ** Hits for Motif 2 **
 (169 <2> 189)       1.2E-06   2.5E-04
    VIGDSENDIDMFRVAGFGIAV
  ** Hits for Motif 3 **
  ** Hits for Motif 4 **
  ** Hits for Motif 5 **
  ** Hits for Motif 6 **

USING RSF OUTPUT [ Previous | Top | Next ]

MotifSearch gives you the option of writing RSF output files. See the -RSF entry in the PARAMETER REFERENCE section of this document for more information about how to use this option on the command line.

One of the advantages of RSF output is the ability to graphically view the results of MotifSearch as features using SeqLab. MotifSearch annotates the input sequences with features as described below.

In the RSF file, hits for the first four profiles in the input receive unique feature types. All other hits are annotated as generic MotifSearch motifs.

INPUT FILES [ Previous | Top | Next ]

The inputs to MotifSearch are a set of profiles used as a query, and a set of either nucleotide or protein sequences (not both) to be searched. The former set will usually have been generated by MEME; however, it is possible to cut and paste profiles into a single file and use this as the input to MotifSearch, provided they meet the format criteria for version 2.0 profile files (Appendix VII). The function of MotifSearch depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N orType: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, see Appendix VI for information on how to change or set the type of a sequence.

When searching nucleotide sequences, MotifSearch always searches both strands.

PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap). ProfileSearch uses a profile (representing a group of aligned sequences) as a query to search the database for new sequences with similarity to the group. The profile is created with the program ProfileMake. ProfileScan uses a database of profiles to find structural and sequence motifs in protein sequences. ProfileGap makes an optimal alignment between a profile and one or more sequences.

MEME finds conserved motifs in a group of unaligned sequences. MEME saves these motifs as a set of profiles. You can search a database of sequences with these profiles using the MotifSearch program. MEME and MotifSearch are intended to work hand-in-hand.

ALGORITHM [ Previous | Top | Next ]

MotifSearch uses a method similar to that of Bailey and Gribskov (see ACKNOWLEDGMENT) to score each sequence. First, each profile is scored against every possible position in the sequence, using a straightforward scoring matrix, and NOT ALLOWING GAPS. These raw scores are then assigned "position p-values" giving the likelihood of achieving such a score with a randomly generated sequence segment. Each position p-value is next made into a "sequence p-value", reflecting the likelihood of achieving such a score in randomly generated sequence of the same length as the search sequence. This calculation makes the simplifying assumption that scores at different position within the sequence are independent of each other. If P(S_(p)) is the position p-value, then the sequence p-value, P(S_(s)) is given by:



P(S_(s)) = 1 - (1 
- P(S_(p)))^(k)

where k is the number of possible starting positions for the motif within the sequence.

The final step is to calculate a combined p-value for the sequence. MotifSearch creates a set of best p-values by selecting each profile's best sequence p-value within the current sequence. The program then applies the QFAST algorithm (given by Bailey and Gribskov) to find the p-value of the product of the p-values in the set.

CONSIDERATIONS [ Previous | Top | Next ]

MotifSearch may report inappropriately good combined p-values if two or more of the input profiles are highly correlated. One indication that this has happened is an overlap between the hits for different profiles.

The difficulty is caused by the algorithm's use of simplifying "assumptions of independence". The calculation of the combined p-value assumes that the scores of different profiles against the search sequence are independent of one another. To see how this affects the final result, consider the case when profile A and profile B are exactly the same, and have sequence p-values of 1E-6. In this case, the QFAST algorithm will give as its result 2.76E-11, when the correct result is clearly 1E-6.

Again, the best way to guard against this is to look for significant overlap between the hits of different profiles. If this is happening, you should treat the combined p-values with some skepticism; you might even want to remove the unwanted profile from the input file and try the run again.

A related problem can occur when the sequences or the profiles contain repeated elements. A dramatic example is scoring a nucleotide profile against a segment with a long string of A's. If the profile contains even an average number of A's, it may score quite well, resulting in a lot of hits that are not very informative.

ACKNOWLEDGEMENTS [ Previous | Top | Next ]

MotifSearch owes much to the work of Timothy Bailey and Michael Gribskov, who developed the idea of using the p-value of the product of the p-value and published the corresponding QFAST algorithm (Bailey, T.L., and Gribskov, M. (1998). Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14(1), 48-54.)

MotifSearch was written by Scott Swanson.

COMMAND-LINE SUMMARY [ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % motifsearch [-INfile1=]meme.prf -Default

Prompted Parameters:

[-INfile2=]PIR:*          specifies the search set
[-OUTfile=]motifsearch.ms names the output file
-BEGin=1 -END=513         sets the range of interest
-THRESHold=0.0001         sets max combined p-value for displaying sequences

Local Data Files: None

Optional Parameters:

-LIStsize=1000          limits the size of the output list
-SEQLimit=1000000000    sets maximum number of sequences to search
-HITTHReshold=0.0001    sets max position pValue allowed for profile hits
-FRAGments              generates list file as individual profile hits
-RSF=[MotifSearch.RSF]  generates RSF file of high-scoring sequences
-NOSUMmary              suppresses report of run information to screen at exit
-NOMONitor              suppresses screen trace for the search set sequences
-BATch                  submits program to the batch queue

LOCAL DATA FILES [ Previous | Top | Next ]

None.

PARAMETER REFERENCE [ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-THRESHold=0.0001

gives the maximum combined p-value for a sequence to be displayed in the output list.

-LIStsize=1000

sets a limit to the number of entries in the output list.

-SEQLimit=1000000000

sets a limit to the number of sequences to search.

-HITTHReshold=0.0001

specifies the position p-value that defines a profile "hit." This does not affect any of the program's calculations -- it only controls which hits are displayed.

-FRAGments

tells the program to output the list file as individual profile hits. That is, each sequence will have one entry in the list for each profile hit; each such entry will includeBegin andEnd attributes corresponding to the segment where the hit occurred.

-RSF=motifsearch.rsf

writes an RSF (rich sequence format) file containing the input sequences annotated with features generated from the results of MotifSearch. This RSF file is suitable for input to other Wisconsin Package programs that support RSF files. In particular, you can use SeqLab to view this features annotation graphically. If you don't specify a file name with this parameter, then the program creates one using motifsearch for the file basename and .rsf for the extension. For more information on RSF files, see "Using Rich Sequence Format (RSF) Files" in Chapter 2 of the User's Guide. Or, see "Rich Sequence Format (RSF) Files" in Appendix C of the SeqLab Guide.

-SUMmary

writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

-MONitor=100

monitors this program's progress on your screen. Use this parameter to see this same monitor in the log file for a batch process. If the monitor is slowing down the program because your terminal is connected to a slow modem, suppress it with-NOMONitor.

The monitor is updated every time the program processes 100 sequences or files. You can use a value after the parameter to set this monitoring interval to some other number.

-BATch

submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

Printed: December 9, 1998 16:24 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.