FROMEMBL

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
RELATED PROGRAMS
INPUT FILES
RESTRICTIONS
COMMAND-LINE SUMMARY
LOCAL DATA FILES
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

FromEMBL reformats sequences from the distribution (flat file) format of the EMBL database into individual sequence files in GCG format.

DESCRIPTION

[ Previous | Top | Next ]

Use FromEMBL when you want to use sequences in EMBL's distribution format with the Wisconsin Package(TM). Since EMBL maintains many sequences in one file, FromEMBL must write many output files, one for each sequence in the EMBL file. Each output file is named according to the identifier word on the ID line at the beginning of each sequence entry. All documentation from the EMBL input files is preserved in the GCG output files. The nucleic acid ambiguity codes are preserved except that the hyphen (-) symbol in the EMBL sequences is changed to an N in the GCG files.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using FromEMBL to convert the EMBL distribution file embl.dat into separate files in GCG format:


% fromembl

  FromEMBL of what EMBL flat sequence data file ?  embl.dat

         a1mvrna2.embl 2593  bp
           a7nifh.embl 1271  bp
           a7nifx.embl 3169  bp
            a7xag.embl 1395  bp
          aagigii.embl 3411  bp

  Finished FROMEMBL

  Sequences: 5
      Bases: 11,839

%

OUTPUT

[ Previous | Top | Next ]

Here is part of the first output file, a1mvrna2.embl, from the example above:


ID   A1MVRNA2   standard; RNA; 2593 BP.

AC   X01572;

DT   03-AUG-1987  (an correction)
DT   30-JAN-1986  (author review)
DT   17-JUL-1985  (first entry)

DE   Alfalfa mosaic virus (A1M4) RNA 2

KW   unidentified reading frame.

OS   Alfalfa mosaic virus
OC   Viridae; ss-RNA nonenveloped viruses; Alfamovirus.

RN   [1]   (bases 1-2593; enum. 1 to 2593)
RA   Cornelissen B.J.C., Brederode F.T., Veeneman G.H., van Boom J.H.,
RA   Bol J.F.;
RT   "Complete nucleotide sequence of alfalfa mosaic virus RNA 2";
RL   Nucl. Acids Res. 11:3019-3025(1983).

CC   Data kindly reviewed (30-JAN-1986) by J.F. Bol

FH   Key        From     To       Description
FH
FT   TRANSCR       1   2593       A1MV RNA 2
FT   CDS          55   2424       unidentified reading frame (aa 1-790)

SQ   Sequence  2593 BP;  736 A;  533 C;  547 G;  777 U;

a1mvrna2.embl  Length: 2593 September 30, 1998 Type: N  Check: 1871  ..

       1  GUUUUUAUCU UUUCGCGAUU GAAAAGAUAA GUUUUUCAGU UUAAUCUUUU

      51  CAAUAUGUUC ACUCUUUUGA GAUGUCUCGG AUUCGGUGUU AAUGAACCUA

    ////////////////////////////////////////////////////////////

    2501  UCCUGAUAGG AGAAAUUCUA UAUUGCUUAU AUAUGUGCUU ACGCACAUAU

    2551  AUAAAUGCUC AUGCAAAACU GCAUGAAUGC CCCUAAGGGA UGC

RELATED PROGRAMS

[ Previous | Top | Next ]

The following programs convert sequences between other formats and GCG format: FromEMBL, FromGenBank, FromIG, FromPIR, FromStaden, FromFastA, ToIG, ToPIR, ToStaden and ToFastA.

DataSet creates a GCG data library from any set of sequences in GCG format. GCGToBLAST creates a database that can be searched by the BLAST program from any set of sequences in GCG format.

INPUT FILES

[ Previous | Top | Next ]

FromEMBL accepts a single sequence file in EMBL's distribution format as input. Each input file may contain multiple (one or more) sequences. Here is part of the input file used for the example above:


ID   A1MVRNA2   standard; RNA; 2593 BP.
XX
AC   X01572;
XX
DT   03-AUG-1987  (an correction)
DT   30-JAN-1986  (author review)
DT   17-JUL-1985  (first entry)
XX
DE   Alfalfa mosaic virus (A1M4) RNA 2
XX
KW   unidentified reading frame.
XX
OS   Alfalfa mosaic virus
OC   Viridae; ss-RNA nonenveloped viruses; Alfamovirus.
XX
RN   [1]   (bases 1-2593; enum. 1 to 2593)
RA   Cornelissen B.J.C., Brederode F.T., Veeneman G.H., van Boom J.H.,
RA   Bol J.F.;
RT   "Complete nucleotide sequence of alfalfa mosaic virus RNA 2";
RL   Nucl. Acids Res. 11:3019-3025(1983).
XX
CC   Data kindly reviewed (30-JAN-1986) by J.F. Bol
XX
FH   Key        From     To       Description
FH
FT   TRANSCR       1   2593       A1MV RNA 2
FT   CDS          55   2424       unidentified reading frame (aa 1-790)
XX
SQ   Sequence  2593 BP;  736 A;  533 C;  547 G;  777 U;
     GUUUUUAUCU UUUCGCGAUU GAAAAGAUAA GUUUUUCAGU UUAAUCUUUU CAAUAUGUUC
     ACUCUUUUGA GAUGUCUCGG AUUCGGUGUU AAUGAACCUA CUAACACUUC CUCAUCAGAG

     /////////////////////////////////////////////////////////////////

     ATGCCCTCCT GTGCAGCAGC AGGTACTGCT GGATGAGGAG CCATCGGTCT CTGCACGCAA
     ACCCAACTTC CTCTTCATTC TCACGGATGA TCAGGATCTC CGGATGAATT C
//

When FromEMBL writes GCG sequence files, it assigns the sequence type based on the composition of the sequence characters. This method is not fool-proof, so to ensure that the output files are written with the correct sequence type, use -PROtein or -NUCleotide on the command line when running FromEMBL.

If FromEMBL is run interactively, you can watch the program monitor to see if the sequences are assigned the correct type. As each new file is written, its name and the number of bases (bp) or amino acids (aa) appears on the screen. If the wrong abbreviation appears (for example, bp appears for a protein sequence), the sequence file was assigned the wrong type. The sequence type also appears in the sequence file. Look on the last line of the text heading just above the sequence itself for Type: N or Type: P.

If the sequence type was incorrectly assigned, see Appendix VI for information on how to change or set the type of a sequence.

RESTRICTIONS

[ Previous | Top | Next ]

The Wisconsin Package does not accept sequences longer than 350,000 characters. If an EMBL flat file contains a sequence longer than 350,000 characters, FromEMBL truncates the sequence after the 350,000th character.

Each sequence entry in the input flat file must be in EMBL format. In particular, FromEMBL must find these three lines:
1. Each entry starts with "ID SeqName"
2. Each heading ends with "SQ "
3. Each sequence ends with "//"

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % fromembl [-INfile=]embl.dat -Default

Prompted Parameters: None

Local Data Files: None

Optional Parameters:

-PROtein               insists that the input sequences are proteins
-NUCleotide            insists that the input sequences are nucleic acids
-LIStfile[=embl.list]  writes a list file of output sequence names
-DIRectory=DirName     writes output to another directory
-NOMONitor             suppresses the screen trace for each output sequence
-NOSUMmary             suppresses the summary at the end of the program

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-PROtein and -NUCleotide

set the program to expect protein or nucleic acid sequences, respectively. Normally, FromEMBL determines whether an input sequence is protein or nucleic acid by looking at its composition. If the first 300 alphabetic characters in a sequence are composed entirely of IUB-IUPAC nucleotide codes (see Appendix III), it is reformatted as a nucleic acid sequence in GCG format; otherwise it is reformatted as a protein sequence. Using these command-line parameters, you can insist that your sequences are proteins (-PROtein) or nucleic acids (-NUCleotide).

-LIStfile=fromembl.list

writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then FromEMBL makes one up using fromembl for the file name and .list for the file name extension.

-DIRectory=DirName

writes the output files into a directory other than your current working directory.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-SUMmary

writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: December 9, 1998 16:27 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997, 1998 Genetics Computer Group Inc., a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com