[ Program Manual | User's Guide | Data Files | Databases ]
FromGenBank reformats one or more sequences in the flat file format of the GenBank database into individual sequence files in GCG format.
Use FromGenBank to convert sequences from GenBank flat file distribution format to GCG format for use with the Wisconsin Package(TM). Since GenBank maintains many sequences in one file, FromGenBank must write many output files, one for each sequence in the GenBank file. Each output file is named according to the identifier word on the LOCUS line at the beginning of each sequence entry. All the documentation from the GenBank input files is preserved in the GCG output files.
Here is a session using FromGenBank to convert the GenBank distribution file genbank.seq into separate sequence files in GCG format:
% fromgenbank Reformat what GenBank data file? genbank.seq ecoogt.seq 719 bp. ecoomega.seq 764 bp. ecoompa.seq 2270 bp. reformatted: genbank.seq total files: 3 total bases: 3753 %
Here is part of the first output file, ecoogt.seq, from the example above:
LOCUS ECOOGT 719 bp ds-DNA BCT 15-SEP-1989 DEFINITION E. coli ogt gene for O-6-alkylguanine-DNA-alkyltransferase. ACCESSION Y00495 KEYWORDS DNA repair; O-6-alkylguanine-DNA-alkyltransferase; ogt gene. SOURCE Escherichia coli. ORGANISM Escherichia coli Prokaryota; Bacteria; Gracilicutes; Scotobacteria; Facultatively anaerobic rods; Enterobacteriaceae. REFERENCE 1 (bases 1 to 719) AUTHORS Potter,P.M., Wilkinson,M.C., Fitton,J., Carr,F.J., Brennand,J., Cooper,D.P. and Margison,G.P. TITLE Characterisation and nucleotide sequence of ogt, the O6-alkylguanine-DNA-alkyltransferase gene of E.coli JOURNAL Nucleic Acids Res. 15, 9177-9193 (1987) STANDARD simple automatic REFERENCE 2 (bases 1 to 719) AUTHORS Potter,P. JOURNAL Unpublished (1988) see COMMENT for author address STANDARD simple automatic COMMENT EMBL features not translated to GenBank features: key from to description PRM 101 106 -35 region PRM 124 132 -10 region SITE 137 141 region of transcription initiation RBS 174 183 pot. ribosome binding site SITE 601 603 pot. alkylgroup acceptor [Unpublished (1988) see COMMENT for author address] Author address Potter P., Paterson Institute for Cancer Research, Christie Hospital, Wilmslow Road, Manchester M20 9BX, UK. Submitted (19-JAN-1988) to the EMBL data library. FEATURES Location/Qualifiers CDS 187. .702 /note="O-6-alkylguanine-DNA-alkyltransferase" BASE COUNT 163 a 172 c 196 g 188 t ORIGIN ecoogt.seq Length: 719 September 30, 1998 09:21 Type: N Check: 3921 .. 1 TTCCACTGTT TCTTGGATTC CTGCAACGCT ACAAACCAGA CGCGAAACTG 51 GGTACTTACT ATTCGTTAGT CTTGCCCTAT CCGCTTATCT TTTTGGTGGT /////////////////////////////////////////////////////////// 651 AGTTCAGCGA AAAGAGTGGT TATTGCGCCA TGAAGGTTAT CTTTTGCTGT 701 AAACATTAAA CAATTTGTG
FromGenBank accepts a single file in GenBank's flat file distribution format as input. Each input file may contain multiple (one or more) sequences. Here is part of the input file used for the example above:
LOCUS ECOOGT 719 bp ds-DNA BCT 15-SEP-1989 DEFINITION E. coli ogt gene for O-6-alkylguanine-DNA-alkyltransferase. ACCESSION Y00495 KEYWORDS DNA repair; O-6-alkylguanine-DNA-alkyltransferase; ogt gene. SOURCE Escherichia coli. ORGANISM Escherichia coli Prokaryota; Bacteria; Gracilicutes; Scotobacteria; Facultatively anaerobic rods; Enterobacteriaceae. REFERENCE 1 (bases 1 to 719) AUTHORS Potter,P.M., Wilkinson,M.C., Fitton,J., Carr,F.J., Brennand,J., Cooper,D.P. and Margison,G.P. TITLE Characterisation and nucleotide sequence of ogt, the O6-alkylguanine-DNA-alkyltransferase gene of E.coli JOURNAL Nucleic Acids Res. 15, 9177-9193 (1987) STANDARD simple automatic REFERENCE 2 (bases 1 to 719) AUTHORS Potter,P. JOURNAL Unpublished (1988) see COMMENT for author address STANDARD simple automatic COMMENT EMBL features not translated to GenBank features: key from to description PRM 101 106 -35 region PRM 124 132 -10 region SITE 137 141 region of transcription init. RBS 174 183 pot. ribosome binding site SITE 601 603 pot. alkylgroup acceptor [Unpublished (1988) see COMMENT for author address] Author address Potter P., Paterson Institute for Cancer Research, Christie Hospital, Wilmslow Road, Manchester M20 9BX, UK. Submitted (19-JAN-1988) to the EMBL data library. FEATURES Location/Qualifiers CDS 187..702 /note="O-6-alkylguanine-DNA-alkyltransferase" BASE COUNT 163 a 172 c 196 g 188 t ORIGIN 1 ttccactgtt tcttggattc ctgcaacgct acaaaccaga cgcgaaactg ggtacttact 61 attcgttagt cttgccctat ccgcttatct ttttggtggt atggctgctg atgttgctgg ///////////////////////////////////////////////////////////////////// 601 tgccatcggg ttattggccg aaacggcacc atgaccggat atgcaggcgg agttcagcga 661 aaagagtggt tattgcgcca tgaaggttat cttttgctgt aaacattaaa caatttgtg //
When FromGenBank writes GCG sequence files, it assigns the sequence type based on the composition of the sequence characters. This method is not fool-proof, so to ensure that the output files are written with the correct sequence type, use -PROtein or -NUCleotide on the command line when running FromGenBank.
If FromGenBank is run interactively, you can watch the program monitor to see if the sequences are assigned the correct type. As each new file is written, its name and the number of bases (bp) or amino acids (aa) appears on the screen. If the wrong abbreviation appears (for example, bp appears for a protein sequence), the sequence file was assigned the wrong type. The sequence type also appears in the sequence file. Look on the last line of the text heading just above the sequence itself for Type: N or Type: P.
If the sequence type was incorrectly assigned, see Appendix VI for information on how to change or set the type of a sequence.
The following programs convert sequences between other formats and GCG format: FromEMBL, FromGenBank, FromIG, FromPIR, FromStaden, FromFastA, ToIG, ToPIR, ToStaden and ToFastA.
DataSet creates a GCG data library from any set of sequences in GCG format. GCGToBLAST creates a database that can be searched by the BLAST program from any set of sequences in GCG format.
The Wisconsin Package does not accept sequences longer than 350,000 characters. If a GenBank flat file contains a sequence longer than 350,000 characters, FromGenBank divides it into more than one output entry. Each extra output entry has a number appended to the input entry's name. Because of this, the number of output entries may be greater than the number of input entries.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
Minimal Syntax: % fromgenbank [-INfile=]genbank.seq -Default Prompted Parameters: None Local Data Files: None Optional Parameters: -PROtein insists that the input sequences are proteins -NUCleotide insists that the input sequences are nucleic acids -LIStfile[=fromgenbank.list] writes a list file of output sequence names -DIRectory=DirName writes output to another directory -NOMONitor suppresses the screen trace for each output sequence -NOSUMmary suppresses the summary at the end of the program
None.
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
set the program to expect protein or nucleic acid sequences, respectively. Normally, FromGenBank determines whether an input sequence is protein or nucleic acid by looking at its composition. If the first 300 alphabetic characters in a sequence are composed entirely of IUB-IUPAC nucleotide codes (see Appendix III), it is reformatted as a nucleic acid sequence in GCG format; otherwise it is reformatted as a protein sequence. Using these command-line parameters, you can insist that your sequences are proteins (-PROtein) or nucleic acids (-NUCleotide).
writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then FromGenBank makes one up using fromgenbank for the file name and .list for the file name extension.
writes the output files into a directory other than your current working directory.
This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.
writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.
You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.
[ Program Manual | User's Guide | Data Files | Databases ]
Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com
Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997, 1998 Genetics Computer Group Inc., a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.