[ Program Manual | User's Guide | Data Files | Databases ]
LookUp identifies sequence database entries by name, accession number, author, organism, keyword, title, reference, feature, definition, length, or date. The output is a list of sequences.
LookUp uses the Sequence Retrieval System (SRS) created by Dr. Thure Etzold to identify sequences in sequence databases (CABIOS 9(1); 49-57 (1993)). For example, you can find all of the protein sequences published by a particular author or all of the sequences whose annotation contains a particular word.
The expressions you use to find sequences in a database are known as queries. LookUp presents a form on your screen that lets you enter the elements of your query. Then LookUp finds all the sequences that contain those elements. The output of LookUp is a list file that can be used as input to any GCG programs that accept multiple sequence input.
Here is a session with LookUp that finds the sequences in the PIR database that were published by any author whose last name starts with Smithies.
% lookup LOOKUP in what sequence libraries: a) sptrembl b) pir c) embl d) genbank e) em_tags f) gb_tags g) All libraries q) quit Please choose one or more (* h *): b Complete the query form below: All text: Definition: Author: smithies<Ctrl>D Keyword: Sequence name: Accession number: Organism: Reference: Title: Feature: On or after (dd-mmm-yyyy): On or before (dd-mmm-yyyy): Shortest sequence length: Longest sequence length: Inter-field operator: AND Form of output list: Whole Entries Press <Ctrl>D to continue. Searching pir 16 entries were found. Do you wish to: 1) write out this list to a file 2) preview the results 3) refine the query 4) choose different libraries q) quit Please choose one (* 1 *): What should I call the output file (* lookup.list *) ? 16 entries were written to "lookup.list" %
LookUp writes a list file naming the sequences which conform to your query. Associated with each sequence in the list file is an ID number. If you use this list file to specify the search set for another session with LookUp (for example with -INfile=@lookup.list), the ID numbers help LookUp quickly find the entries in the database.
!!SEQUENCE_LIST 1.0 LOOKUP in: pir of: "[SQ-AUT: smithies*]" 16 entries October 22, 1998 15:03 .. PIR1:HPHUR ! ID: 310d0003 ! haptoglobin-related protein precursor - human PIR1:UDHUP2 ! ID: 0c130003 ! cystatin SN precursor - human /////////////////////////////////////////////// PIR4:I78580 ! ID: 27a90103 ! hemoglobin gamma-G - human (fragment) PIR4:I58221 ! ID: 28a90103 ! hemoglobin gamma-A chain - human (fragment)
In most cases the search set for LookUp is an entire database. On the command line, this is specified like -LIBrary=pir. Note that this usage is different from that used by other GCG programs, which specify databases with a wildcard expression such as PIR:*. Alternatively, the search set can be specified by a list file created by a previous LookUp session. This is done by placing a parameter such as -INfile=@lookup.list on the command line. Any sequences in the list that are not indexed for use with LookUp are ignored.
StringSearch identifies sequences by searching for character patterns such as "globin" or "human" in the sequence documentation. Names identifies GCG data files and sequence entries by name. It can show you what set of sequences is implied by any sequence specification. FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal.
BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds. NetBLAST searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA. The programs FastA, TFastA, FastX, TFastX, and SSearch can also be used to search databases or sequence sets local to your installation for sequences that are similar to a query sequence.
You can never be certain that the list of sequences in an output list contains every sequence of interest. Usually this is because of inconsistent annotation within the databases. See the CONSIDERATIONS topic for more information about this problem.
LookUp is still experimental, so please check your results carefully!
The Wisconsin Package(TM) cuts sequences in GenBank that are longer than 350,000 bases into fragments of 110,000 bases each. Queries that make finds in such fragmented sequences return only the first fragment in the series of fragments. See the VERY LONG SEQUENCES topic for more information.
If you search both protein and nucleotide databases in the same session of LookUp, your output list will usually contain sequences of both types. Most GCG programs that analyze multiple sequences do not support lists of mixed sequence type, and so these lists are not suitable for input to programs such as PileUp, WordSearch, FastA, FrameSearch, etc.
If you use a list file that is the output from another GCG program to specify the search set for LookUp (for example with -INfile=@lookup.list), the program ignores the sequences that are not indexed for use with LookUp.
Most of the advanced features of GCG list files are not supported by LookUp. In particular, you cannot include a reference to another list within a list. You cannot include sequences specified by accession number. You cannot include sequences that are specified ambiguously. For instance, the specification GenBank:Pp* has no meaning to LookUp. Note that the specification Viral:Ppv is also ambiguous, as this could refer either to EM_Vi:Ppv or GB_Vi:Ppv. The specification GB_Vi:Ppv is allowed, since this plum pox virus sequence is indexed for LookUp.
Very ambiguous queries do not always work. For example, if you search for all sequences whose names start with hum, LookUp loops endlessly.
Not all fields are present in every database. For example, PIR does not have a Feature or Date index, and SWISS-PROT does not have a Title index.
A database is a structured way to represent a group of things that have common attributes. Most sequence databases consist of different fields such as accession number, definition, author, etc., that are filled with appropriate values like U01317, Human beta globin region on chromosome 11 , Smithies, etc. Fields are grouped together into larger units referred to as entries. LookUp identifies sequences in GenBank, PIR, SWISS-PROT, and EMBL based on the values found in the different fields associated with each sequence entry.
An indexed database has one or more of its fields organized into a data structure that allows rapid searching. An indexed field is like the index of a book. The subjects in the book are organized alphabetically into an index at the back of the book. When you look up a subject in the index, you will find the page numbers where that subject is mentioned in the body of the book. Likewise in an indexed database, if there were an author index, you could find all of the entries in the database where Smithies is one of the authors just by looking up the name in the index.
The Sequence Retrieval System (SRS) on which LookUp is based has indices for each of several fields that usually occur in the annotation of a sequence database. These indexed fields are: accession number, author, date, definition, feature, keyword, length, entry name, organism, reference, and title. Each of these indices is described in detail under the INDEXED FIELDS topic below.
The query form has a line for each field that has been indexed for retrieval with LookUp. You can search for values in one or more fields, and LookUp finds all the sequences containing those values. To move the cursor from field to field, use the <Up-Arrow>, <Down-Arrow>, or <Return> keys.
The field on the query form that is labeled "Form of output list" toggles between the values Whole entries (the default) and Fragments when the cursor is positioned in the field and you press the <Space Bar>. You may want to use the Fragments value if you are searching the Features index for a particular feature that occurs in the sequences. LookUp can represent that feature more precisely by showing its beginning and ending positions and, for nucleic acid sequences, the strand. See the FRAGMENT OUTPUT topic below.
Keep the following guidelines in mind as you write LookUp queries.
You can type one or more values on each line of the query form. There are three logical operators that let you combine values in different ways:
- AND. Use & to specify AND. A & B means find all entries that contain both A and B.
- OR. Use | to specify OR. A | B means find all entries that contain either A or B.
- BUT-NOT. Use ! to specify BUT-NOT. A ! B means find all entries that contain A but do not contain B. Notice that BUT-NOT is order dependent. A but not B would find a completely different set of entries from B but not A.
Special Case: (C shell only) If you are specifying a query on the command line, and the query expression contains the ! (BUT-NOT) logical operator, you must preface the ! with a backslash ( \), for example -AUThor=McDonald\!Strand. (This does not apply to Korn shell users.) If you are using an init file to specify a query, you must enclose any query expression that contains a ! in double quotation marks. In this case, you do not include a backslash before a !, for example -AUThor="McDonald!Strand".
If you type values for more than one index, LookUp finds entries where each field conforms to the values you have typed. This is equivalent to saying that LookUp joins the values on the different lines with the logical operator AND. You can change this to OR by moving the cursor to the field labeled "Inter-field operator" and pressing the <Space Bar>.
All queries are case insensitive. Regardless of whether you type uppercase or lowercase letters, LookUp converts all queries to uppercase.
If more than one logical operator appears in an expression without parentheses, LookUp evaluates the expression from left to right. However, you can group expressions within a query to define the order in which they are performed. Use parentheses to group expressions you want LookUp to evaluate first. For example, when you type Smithies & (Slightom | Blechl) as the value for the Author field, LookUp first searches for sequence entries containing references with Slightom or Blechl as authors. Then, out of those entries it searches for those which also contain Smithies as an author.
LookUp accepts question marks (?) or asterisks (*) as wildcards anywhere within a value. A question mark represents any single character. If you type s?ith, you will retrieve entries containing authors named Smith, Slith, Sjith, etc., but not named Sith.
An asterisk represents zero or more characters. Typing *smith* will retrieve entries with authors named Smith, Hocsmith, Smithies, Hocsmithels, etc. Values with leading wildcards, such as *Smith, significantly reduce the speed of LookUp. Trailing wildcards usually have little effect on performance.
By default, LookUp treats every value in your query as if it ended with an asterisk wildcard. This automatic wildcard extension means that when you type pseudo, LookUp treats it as pseudo* and will retrieve entries containing the patterns pseudo, pseudo-, pseudogene, pseudoknot, etc.
You can turn off automatic wildcard extension for a single value by appending a pound sign (#) to the value, for instance pseudo#. LookUp then will find only those entries where the word pseudo occurs by itself. You can turn off automatic wildcard extension with -NOWILdcardextension. If you have automatic wildcard extension turned off, you can still use * to tell LookUp to extend a particular value. When automatic wildcard extension is turned off, the # character is treated as a literal part of your query.
If you are specifying a query on the command line, and the value contains a shell special character, such as a space, #, or &, you must enclose the value in double quotation marks ("), for example -KEYword="ribosomal proteins" or -DEFInition="transport#".
Special Case: If you are specifying a query on the command line, and the value has a comma in it, you must enclose the value in single (') AND double quotation marks ("), for example -AUThor='"Slightom,J.L."'. (Within init files however, use only the double quotation marks.) If you specify a query where the value has a comma or dash in it, you must enclose the value in double quotation marks ("), for example -AUThor="Slightom,J.L.".
Below is a description of each indexed field on the LookUp query form. Following the example for each index is the parameter you would use to set a value for this index from the command line.
This index is composed of most indices combined, including: Author, Definition, Feature, Keyword, Organism, Reference, and Title. If you think a word like globin or duplication might occur in a title, or a definition, or a feature, this index can search all three indices at once without making you type the value for each index separately.
This index contains each word in the definition of every database entry. The words are each indexed separately, without any regard for the order in which they appear. A definition like Human beta globin region on chromosome 11 generates completely independent indices for the words human, beta, globin, region, on, chromosome, and 11. A query value that would likely find this definition would be human & beta & globin. Hyphenated terms, such as beta-globin, are indexed as two separate words.
This index contains all of the authors cited in each database entry. Most databases do not use "et al.," so second, third, and fourth authors will usually be present. The index includes the author's surname followed by first and middle initials. No spaces separate the surname and initials. For example, Dr. J. L. Slightom would be indexed as Slightom,J.L. If you do not include the initials, LookUp will find all entries with an author whose surname starts with slightom.
This index contains every keyword in each database entry. Unlike most of the fields in LookUp, keyword values may contain spaces, as in ribosomal proteins. The discipline of assigning keywords differs greatly from database to database; for example, you cannot be sure that both the organism name and enzyme superfamily name will appear in every entry's keyword list. (See the All text index and also the CONSIDERATIONS topic.)
This index contains all of the sequence names in each database entry. These names (referred to as locus names in GenBank) should be unique, so you should not find more than one entry in a database for any name.
This index contains all of the accession numbers in each database entry. While primary accession numbers are supposed to be unique in each database, secondary accession numbers can appear in more than one sequence.
Most sequence databases name the organism from which the sequence is derived. Organism values may contain spaces. In recent years both EMBL and GenBank have used systematic nomenclature whenever possible. If you want to specify a species name, the genus must precede the species (for example Homo sapiens). Typing just sapiens will find nothing.
The higher-order systematic names like Eukaryota, Animalia, Metazoa, Chordata, Vertebrata, Mammalia, Theria, Eutheria, Primates, Haplorhini, Catarrhini, Hominidae are indexed independently. If your query is not a species name, use only one higher-order systematic name.
*** The Reference index does not
work correctly in this release! ***
Journal names are indexed exactly as they appear in each database. If the curators of one database call a journal Nucleic Acids Res., the curators of another call the same journal Nucl. Acids Res., and a third database uses NAR, these differences will be reflected in the indices used by LookUp. Notice, however, that the expression (NAR | NUCL) & 1989 would probably find all of the sequences published in 1989 in Nucleic Acids Research.
The volume is a number less than 1950, the date is a number greater than 1950, and the beginning page is a number followed by a hyphen. If you specify values for more than one subfield, you must join the subfields with logical operators (see Logical Operators in the WRITING QUERIES topic).
For most references, specifying both the volume number and starting page number is definitive.
This index contains all words in the titles of each citation in the databases. Some databases do not include the title for each citation, so failure to find a word that you think occurs does not imply that the reference of interest is not cited in one of the databases (see the Reference index). The words are indexed without regard for the order in which they first appeared. A title like A history of the human fetal globin gene duplication generates independent indices for each separate word: a, history, of, the, human, fetal, globin, gene, and duplication. An expression likely to find this title, if it were present, would be: globin & duplication & history.
This index is not available for SWISS-PROT.
A feature is a region of a sequence that is identified in the feature table of a sequence database. Associated with each feature is a set of words that may include a gene name, a function, an EC number, etc. Every word associated with each feature is indexed independently without regard for order. If you type cds as the value, every coding sequence that is documented by a CDS feature will be found.
You can have the output show where each feature occurs within a sequence by selecting Fragments instead of Whole entries on the query form (see the FRAGMENT OUTPUT topic).
This index is not available
for PIR.
A Date index contains the date sequences were entered or updated in each database. With these two fields, you can identify sequences that were updated between any two dates. The format for a date value is DD-MMM-YYYY where D, M, and Y stand for day, month, and year respectively. The English abbreviations for the months of the year are: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec.
This index is not available
for PIR.
The lengths of every sequence in each database are indexed. You can use these fields to restrict the sequences found to those with certain lengths, for example between 10 and 400 characters.
Normally LookUp's search set consists of all of the sequences in one or more of the sequence databases. If you have a list file created by an earlier session with LookUp, you can use this smaller set of sequences as your search set by adding a parameter like -INfile=@lookup.list to the command line. (An at sign (@) must precede the name of the list file.)
You should use only list files that were created by earlier sessions with LookUp, as the input list file can contain only sequences that have been indexed for searching with LookUp. You cannot add sequences to the list that were not in the libraries when the LookUp indices were created.
Most of the advanced features of GCG list files are not supported by LookUp. In particular, you cannot include a reference to another list within a list. You cannot include sequences specified by accession number. You cannot include sequences that are specified ambiguously. For instance, the specification GenBank:Pp* has no meaning to LookUp. Note that the specification Viral:Ppv is also ambiguous, as this could refer either to EM_Vi:Ppv or GB_Vi:Ppv. The specification GB_Vi:Ppv is allowed, since this plum pox virus sequence is indexed for LookUp.
LookUp normally writes a list of sequences defined only by their database and entry names. Each element of such a list refers to the whole entry. The field on the query form that is labeled "Form of output list" toggles between the values Whole entries (the default) and Fragments when the cursor is positioned in the field and you press the <Space Bar>. You may want to use the Fragments value if you are searching the Features index for a particular feature that occurs in the sequences. LookUp can represent that feature more precisely by showing its beginning and ending positions and, for nucleic acid sequences, the strand. Features consisting of separate fragments are listed contiguously in the output list file and share the same Join name.
Here is some fragment output from a query designed to find complete coding regions for genes encoding xanthine dehydrogenase.
LOOKUP in: embl,genbank of: "[SQ-ALL: complete* & xanthine* & dehydrogenase*] > [SQ-FTS: cds*]" 6 features July 3, 1995 14:34 .. GB_IN:DROXDHA Begin: 1086 End: 1139 Strand: + Join: DROXDHA-8 GB_IN:DROXDHA Begin: 2164 End: 4776 Strand: + Join: DROXDHA-8 GB_IN:DROXDHA Begin: 4839 End: 5981 Strand: + Join: DROXDHA-8 GB_IN:DROXDHA Begin: 6049 End: 6216 Strand: + Join: DROXDHA-8 GB_IN:DROXDHA Begin: 6284 End: 6334 Strand: + Join: DROXDHA-8 GB_PR:HSU06117 Begin: 64 End: 4065 Strand: + Join: HSU06117-3 GB_PR:HUMXDH Begin: 131 End: 4147 Strand: + Join: HUMXDH-2 GB_PR:HUMXDHA Begin: 58 End: 4059 Strand: + Join: HUMXDHA-2 GB_RO:RATXDHA Begin: 27 End: 4022 Strand: + Join: RATXDHA-2
If you are querying LookUp from the command line, you can get this form of output with the -FRAgments parameter.
PIR does not support fragment
output.
Extracting Features
Each fragment in the fragment output list file is accompanied by Begin, End, Strand and Join sequence attributes. You can use the Assemble program to extract the features into separate GCG sequence files. All sequences listed contiguously in the list file that share the same Join name are concatentated into a single sequence and the resulting sequence file is given the same name as the Join name. Fragments listed individually are extracted into separate sequence files, each with the same name as the corresponding Join name.
Using the features list file example above, Assemble writes five new sequence files. The first file, called droxdha-8.seg, contains the assembly from the first five sequence segments in the list file. The second file, hsu06117-3.seg, contains the assembly from the sixth sequence segment in the list file. Similarly, the remaining three sequence segments in the list file are extracted into separate sequence files. See the entry for Assemble in the Program Manual for more information about extracting features according to sequence attributes in a list file.
You can use the Translate program to translate features according to the sequence attributes in a features list file and write each translated sequence into its own GCG sequence file. See the entry for Translate in the Program Manual for more information about translating features according to sequence attributes in a list file.
Use -COMplete to ignore features whose start or end positions are not accurately identified. See the PARAMETER REFERENCE topic for more information.
Note that the databases are
inconsistent in their annotation.
You can find misspellings as
well as
differences in hyphenation and the
type of information entered in
fields. As a result,
you can never be
certain that the list of
entries in an output list
contains every sequence of interest.
In addition, be aware that all databases contain spelling errors; the misspelling psuedo occurred 10 times in the definitions of GenBank Release 95.0.
Hyphens in particular are used inconsistently. For example, to find as many entries as possible that are pseudogenes, you should search for pseudo-gene as well as pseudogene. Another example where inconsistent use of hyphens can cause problems is the globin family. GenBank definition lines may contain the terms beta-globin, beta globin, and beta-hemoglobin. One way to deal with this is to specify just one of the words, since LookUp indexes the words on either side of a hyphen separately. You can also use a leading wildcard. For example, if you type *globin, LookUp will retrieve the following members of the globin family: haptoglobin, hemoglobin, haemoglobin, myoglobin, cyanoglobin, plakoglobin, alphaglobin, alpha-globin, alpha-globin-3, alpha-1 globin, beta-min-globin, beta-3-globin, beta-2-globin, beta-H1-globin, beta-B globin, uteroglobin, y2-globin, beta-major globin, zeta-globin, and so on. Note that using leading wildcards significantly reduces the speed of LookUp.
Another consideration is that a value such as pseudo may occur in words other than pseudogene. In addition to pseudogene sequences, your output list may also contain RNA sequences known to form pseudoknots or sequences from the organism Pseudomonas.
Future releases of GenBank are
expected not to have any
sequences longer than than 350,000
bases.
However as release 10.0 of
the Wisconsin Package was being
prepared, two sequences longer than
350,000 bases were still present
in GenBank proper (GB_Ba:Ecouw67 and
GB_Pl:Scchrix) and several
dozen such sequences were present
in the High Throughput Genome
(HTG) division. These sequences
are broken into overlapping fragments
in the Wisconsin Package.
The 372 kilobase sequence Ecouw67,
for instance, is divided into
four fragments: Ecouw67_0, Ecouw67_1, Ecouw67_2,
and Ecouw67_3.
Each fragment is 110,000 bases
long and overlaps the one
following it by 10,000 bases.
All of the
annotation appears with the first
fragment, so LookUp normally returns
only the first fragment if
your
query makes a hit on
one of these long sequences.
If you are searching
for features and you are
asking
for fragment output, LookUp tries
to infer which fragment contains
the feature of interest.
If a feature
you find spans two of
the fragments, it will not
be represented correctly.
Note that the output list can contain any number of entries and may result in an extremely large output list file.
Become familiar with the format of each database by doing a number of simple queries and looking at the output carefully. The topic ANNOTATING LISTS tells you how to display the original records from each database.
Use the # symbol to turn off automatic wildcard extension, thereby reducing the number of entries in your output (see Wildcard Extension in the WRITING QUERIES topic).
If you search both protein and nucleotide databases in a single session, your output list will probably contain sequences of both types. Most GCG programs that do multiple sequence analysis do not support lists of mixed sequence type. For example, mixed lists are not suitable for input to programs such as PileUp, WordSearch, FastA, and FrameSearch. Therefore, if you want to use the output list as input to other Wisconsin Package programs, you should search protein and nucleotide databases separately.
LookUp normally writes a simple list of sequences identified by entry name and definition. The -ANNotate parameter lets you add other annotation from the original sequence record to each sequence in the list to help you identify the sequence and understand how LookUp processed your query.
The values you can use with this parameter correspond to fields that are indexed for LookUp: ACCession, AUThor, DATe, DEFInition, FEAture, NAMe, KEYword, ORGanism, REFerence, and TITle. For example, -ANNotate=AUThor annotates each sequence in an output list with author names.
If you have chosen Whole entries for the "Form of output list" field of the query form when -ANNotate=FEAture, LookUp includes the whole feature table next to each sequence. This can create large output files. If you have chosen Fragments for this field when -ANNotate=FEAture, LookUp includes only the feature of interest.
The date does not appear on a separate line in Genbank, so if you want to see the date for GenBank entries, use -ANNotate=NAMe instead of -ANNotate=DATe. Note that LookUp does not support date or reference annotation for PIR.
Annotated lists, like other lists, are compatible with GCG programs that support multiple sequence specifications.
You can turn off annotation altogether with -NOANNotate. LookUp is much faster with annotation turned off.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
Minimal Syntax: % lookup [-ALLtext=]globin -Default Prompted Parameters: -LIBrary=pir[,...] specifies one or more data libraries -ALLtext=globin searches all text indices -DEFInition=globin searches definition index for one or moe words indexed independently, eg. "Globin & Region" -AUThor=smithies searches author index for one or more, e.g. "Smithies, O. & Slightom, J.L." -KEYword=globin see document before using keywords -NAMe=hsggl3 searches entry name index -ACCessionnumber=s12345 searches accession number index -ORGanism="Homo Sapiens" searches genus and species index -REFerence=cell&1981 searches complete reference index -TITle=history searches title of citation index -FEAture=gamma searches for any word in a feature table -SHOrtest=100 finds only sequences of length 100 or more -LONgest=400 finds only sequences of length 400 or less -EARliest=01-apr-1992 searches for sequences modified on or after specified date -LATest=30-apr-1992 searches for sequences modified on or before specified date -MATch=or specifies inter-field logic (AND is default) -OUTfile=lookup.list names output file for list of sequences Local Data Files: None Optional Parameters: -NOWILdcardextension turns off automatic wildcard extension -INfile=@lookup.list searches in lookup.list instead of libraries -ANNotate=feature[,...] shows fields from original annotation in output acceptable values include: ACCession, AUThor, DATe, DEFinition, FEAture, NAMe, KEYword, ORGanism, REFerence, and TITle -FRAgments shows features as fragments instead of whole entries -COMplete shows only features with unambiguous coordinates -MONitor shows databases searched and how many hits found
LookUp is a database application that makes use of the Sequence Retrieval System (SRS) written by Dr. Thure Etzold of the European Molecular Biology Laboratory (EMBL). SRS is described in CABIOS 9(1); 49-57 (1993) and Appl. Biosci 9; 59-64 (1993). We are grateful to Thure Etzold, Patrick Argos, and EMBL for making the internals of SRS available for use with LookUp. Scott Rose wrote the LookUp application for GCG and John Devereux wrote the documentation.
If you know how to browse the World Wide Web, you can look at SRS under the URLs http://srs.ebi.ac.uk:5000/. http://srs.ebi.ac.uk.
None.
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
searches the SwissProt and GenBank data libraries.
searches all text indices for the word globin. The text indices are Author, Definition, Feature, Keyword, Organism, Reference, and Title. (Note that the Name and Accession Number indices are not included.)
searches for entries whose definition line contains the word globin.
searches for entries derived from publications containing an author whose surname is Smithies.
searches for entries that contain the word globin in their KEYWORDS field.
searches for the sequence entry whose name is HSGGL3. Depending on the database, the name may correspond to the LOCUS name, the ID name, the ENTRY name, etc.
searches for the sequence entry whose accession number is S12345.
searches for any sequence entries deriving from the organism Homo sapiens. The genus and species names are indexed as a unit. If you want to search on the species name alone, you must preface it with a wild card: -ORGanism=*Sapiens.
searches for entries reported in the journal Cell in 1981. (This index does not work correctly in this release.)
searches for sequences reported in articles whose name contains the word history.
searches for sequence entries whose feature table contains the word gamma.
searches for sequences containing 100 or more residues.
searches for sequences containing 400 or fewer residues.
searches for sequence entries that were entered or last modified on or after April 1, 1992
searches for sequence entries that were entered or last modified on or before April 30, 1992.
specifies the logic to be used to combine index fields (the default is AND).
LookUp normally treats all values in your query as if they ended with an asterisk wildcard (See Wildcard Extension in the WRITING QUERIES topic). You can suppress this automatic wildcard extension either by adding a # to the end of any value that you do not want extended or by using this parameter to suppress it for all values. With automatic wildcard extension turned off, you must explicitly append an asterisk to make any particular field value wild.
LookUp can use a list file created during a previous session with LookUp as the search set. A parameter like the one in this example can be used in place of -LIBrary=swissprot. If both -LIBrary and -INfile are used, the program uses the list file and ignores the library parameter.
LookUp normally writes a simple list of sequences identified by entry name and definition. The -ANNotate parameter lets you add other annotation from the original sequence record to each sequence in the list to help you identify the sequence and understand how LookUp processed your query.
The values you can use with this parameter correspond to fields that are indexed for LookUp: ACCession, AUThor, DATe, DEFInition, FEAture, NAMe, KEYword, ORGanism, REFerence, and TITle. For example, -ANNotate=AUThor annotates each sequence in an output list with authors.
If you have chosen Whole entries for the "Form of output list" field of the form and -ANNotate=FEAture, LookUp includes the whole feature table next to each sequence. This can create large output files. If you have chosen Fragments for this field and -ANNotate=FEAture, LookUp includes only the feature of interest.
The date does not appear on a separate line in Genbank, so if you want the date for GenBank entries in your output list, use -ANNotate=NAMe instead of -ANNotate=DATe. Note that PIR entries are not indexed for date.
Annotated lists, like other lists, are compatible with GCG programs that support multiple sequence specifications.
You can turn off annotation altogether with -NOANNotate. LookUp is much faster with annotation turned off.
LookUp normally writes a list file even if you search for sequences in the Feature index. In addition, it can show the exact locations of most features. If you select Fragments from the "Form of output list" field in the form or use -FRAgments, LookUp will represent features with their beginning and ending coordinates, the strand on which they are found, and whether they are joined to other features appearing below them in the list. These multi-fragment features can be joined together into a new composite sequence with Assemble or Translate.
Some features have starting and ending positions that are beyond the bounds of the sequence data archived in a particular database entry. In the features table of GenBank and EMBL, these features are represented with ranges that have a < before the beginning coordinate and/or a > before the ending coordinate. Here is a feature whose beginning lies before the first base stored in the sequence entry:
CDS <1. .81 /note="gamma globin; NCBI gi: 386767"
GBPR:HUMHBG3E Begin: 1 End: 81 Strand: + Join: HUMHBG3E-2 !Id: 0200... ! CDS <1. .81 ! /note="gamma globin; NCBI gi: 386767"
! GBPR:HUMHBG3E Begin: 1 End: 81 Strand: + Join: HUMHBG3E-2 !Id: 0200... ! CDS <1. .81 ! /note="gamma globin; NCBI gi: 386767"
This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.
LookUp prints a period on your screen every time it writes 100 lines into your output file.
[ Program Manual | User's Guide | Data Files | Databases ]
Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com
Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997, 1998 Genetics Computer Group Inc., a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.