FROMGENBANK

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
COMMAND-LINE SUMMARY
LOCAL DATA FILES
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

FromGenBank reformats one or more sequences in the flat file format of the GenBank database into individual sequence files in GCG format.

DESCRIPTION

[ Previous | Top | Next ]

Use FromGenBank to convert sequences from GenBank flat file distribution format to GCG format for use with the Wisconsin Package(TM). Since GenBank maintains many sequences in one file, FromGenBank must write many output files, one for each sequence in the GenBank file. Each output file is named according to the identifier word on the LOCUS line at the beginning of each sequence entry. All the documentation from the GenBank input files is preserved in the GCG output files.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using FromGenBank to convert the GenBank distribution file genbank.seq into separate sequence files in GCG format:


% fromgenbank

    Reformat what GenBank data file?  genbank.seq

         ecoogt.seq   719 bp.
       ecoomega.seq   764 bp.
        ecoompa.seq  2270 bp.

    reformatted: genbank.seq
    total files: 3
    total bases: 3753

%

OUTPUT

[ Previous | Top | Next ]

Here is part of the first output file, ecoogt.seq, from the example above:


LOCUS       ECOOGT        719 bp ds-DNA             BCT       15-SEP-1989
DEFINITION  E. coli ogt gene for O-6-alkylguanine-DNA-alkyltransferase.
ACCESSION   Y00495
KEYWORDS    DNA repair; O-6-alkylguanine-DNA-alkyltransferase; ogt gene.
SOURCE      Escherichia coli.
  ORGANISM  Escherichia coli
            Prokaryota; Bacteria; Gracilicutes; Scotobacteria; Facultatively
            anaerobic rods; Enterobacteriaceae.
REFERENCE   1  (bases 1 to 719)
  AUTHORS   Potter,P.M., Wilkinson,M.C., Fitton,J., Carr,F.J., Brennand,J.,
            Cooper,D.P. and Margison,G.P.
  TITLE     Characterisation and nucleotide sequence of ogt, the
            O6-alkylguanine-DNA-alkyltransferase gene of E.coli
  JOURNAL   Nucleic Acids Res. 15, 9177-9193 (1987)
  STANDARD  simple automatic
REFERENCE   2  (bases 1 to 719)
  AUTHORS   Potter,P.
  JOURNAL   Unpublished (1988) see COMMENT for author address
  STANDARD  simple automatic
COMMENT        EMBL features not translated to GenBank features:
               key        from     to       description

               PRM         101    106       -35 region
               PRM         124    132       -10 region
               SITE        137    141       region of transcription initiation
               RBS         174    183       pot. ribosome binding site

               SITE        601    603       pot. alkylgroup acceptor

               [Unpublished (1988) see COMMENT for author address] Author
            address Potter P., Paterson Institute  for Cancer
            Research, Christie Hospital, Wilmslow Road, Manchester M20 9BX,
            UK.

            Submitted (19-JAN-1988) to the EMBL data library.
FEATURES             Location/Qualifiers
     CDS             187. .702
                     /note="O-6-alkylguanine-DNA-alkyltransferase"
BASE COUNT      163 a    172 c    196 g    188 t
ORIGIN

ecoogt.seq  Length: 719  September 30, 1998 09:21  Type: N  Check: 3921  ..

       1  TTCCACTGTT TCTTGGATTC CTGCAACGCT ACAAACCAGA CGCGAAACTG

      51  GGTACTTACT ATTCGTTAGT CTTGCCCTAT CCGCTTATCT TTTTGGTGGT

     ///////////////////////////////////////////////////////////

     651  AGTTCAGCGA AAAGAGTGGT TATTGCGCCA TGAAGGTTAT CTTTTGCTGT

     701  AAACATTAAA CAATTTGTG

INPUT FILES

[ Previous | Top | Next ]

FromGenBank accepts a single file in GenBank's flat file distribution format as input. Each input file may contain multiple (one or more) sequences. Here is part of the input file used for the example above:


LOCUS       ECOOGT        719 bp ds-DNA             BCT       15-SEP-1989
DEFINITION  E. coli ogt gene for O-6-alkylguanine-DNA-alkyltransferase.
ACCESSION   Y00495
KEYWORDS    DNA repair; O-6-alkylguanine-DNA-alkyltransferase; ogt gene.
SOURCE      Escherichia coli.
  ORGANISM  Escherichia coli
            Prokaryota; Bacteria; Gracilicutes; Scotobacteria;
            Facultatively anaerobic rods; Enterobacteriaceae.
REFERENCE   1  (bases 1 to 719)
  AUTHORS   Potter,P.M., Wilkinson,M.C., Fitton,J., Carr,F.J.,
            Brennand,J., Cooper,D.P. and Margison,G.P.
  TITLE     Characterisation and nucleotide sequence of ogt, the
            O6-alkylguanine-DNA-alkyltransferase gene of E.coli
  JOURNAL   Nucleic Acids Res. 15, 9177-9193 (1987)
  STANDARD  simple automatic
REFERENCE   2  (bases 1 to 719)
  AUTHORS   Potter,P.
  JOURNAL   Unpublished (1988) see COMMENT for author address
  STANDARD  simple automatic
COMMENT        EMBL features not translated to GenBank features:
               key      from     to       description

               PRM       101    106       -35 region
               PRM       124    132       -10 region
               SITE      137    141       region of transcription init.
               RBS       174    183       pot. ribosome binding site

               SITE      601    603       pot. alkylgroup acceptor

               [Unpublished (1988) see COMMENT for author address] Author
            address Potter P., Paterson Institute  for Cancer
            Research, Christie Hospital, Wilmslow Road, Manchester M20 9BX,
            UK.

            Submitted (19-JAN-1988) to the EMBL data library.
FEATURES             Location/Qualifiers
     CDS             187..702
                     /note="O-6-alkylguanine-DNA-alkyltransferase"
BASE COUNT      163 a    172 c    196 g    188 t
ORIGIN
   1 ttccactgtt tcttggattc ctgcaacgct acaaaccaga cgcgaaactg ggtacttact
  61 attcgttagt cttgccctat ccgcttatct ttttggtggt atggctgctg atgttgctgg

 /////////////////////////////////////////////////////////////////////

 601 tgccatcggg ttattggccg aaacggcacc atgaccggat atgcaggcgg agttcagcga
 661 aaagagtggt tattgcgcca tgaaggttat cttttgctgt aaacattaaa caatttgtg
//

When FromGenBank writes GCG sequence files, it assigns the sequence type based on the composition of the sequence characters. This method is not fool-proof, so to ensure that the output files are written with the correct sequence type, use -PROtein or -NUCleotide on the command line when running FromGenBank.

If FromGenBank is run interactively, you can watch the program monitor to see if the sequences are assigned the correct type. As each new file is written, its name and the number of bases (bp) or amino acids (aa) appears on the screen. If the wrong abbreviation appears (for example, bp appears for a protein sequence), the sequence file was assigned the wrong type. The sequence type also appears in the sequence file. Look on the last line of the text heading just above the sequence itself for Type: N or Type: P.

If the sequence type was incorrectly assigned, see Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

The following programs convert sequences between other formats and GCG format: FromEMBL, FromGenBank, FromIG, FromPIR, FromStaden, FromFastA, ToIG, ToPIR, ToStaden and ToFastA.

DataSet creates a GCG data library from any set of sequences in GCG format. GCGToBLAST creates a database that can be searched by the BLAST program from any set of sequences in GCG format.

RESTRICTIONS

[ Previous | Top | Next ]

The Wisconsin Package does not accept sequences longer than 350,000 characters. If a GenBank flat file contains a sequence longer than 350,000 characters, FromGenBank divides it into more than one output entry. Each extra output entry has a number appended to the input entry's name. Because of this, the number of output entries may be greater than the number of input entries.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % fromgenbank [-INfile=]genbank.seq -Default

Prompted Parameters: None

Local Data Files: None

Optional Parameters:

-PROtein            insists that the input sequences are proteins
-NUCleotide         insists that the input sequences are nucleic acids
-LIStfile[=fromgenbank.list]  writes a list file of output sequence names
-DIRectory=DirName            writes output to another directory
-NOMONitor          suppresses the screen trace for each output sequence
-NOSUMmary          suppresses the summary at the end of the program

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-PROtein and -NUCleotide

set the program to expect protein or nucleic acid sequences, respectively. Normally, FromGenBank determines whether an input sequence is protein or nucleic acid by looking at its composition. If the first 300 alphabetic characters in a sequence are composed entirely of IUB-IUPAC nucleotide codes (see Appendix III), it is reformatted as a nucleic acid sequence in GCG format; otherwise it is reformatted as a protein sequence. Using these command-line parameters, you can insist that your sequences are proteins (-PROtein) or nucleic acids (-NUCleotide).

-LIStfile=fromgenbank.list

writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then FromGenBank makes one up using fromgenbank for the file name and .list for the file name extension.

-DIRectory=DirName

writes the output files into a directory other than your current working directory.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-SUMmary

writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: December 9, 1998 16:27 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997, 1998 Genetics Computer Group Inc., a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com