PEPDATA

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
OUTPUT FILE NAMES
COMMAND-LINE SUMMARY
LOCAL DATA FILES
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

PepData translates DNA sequence(s) in all six frames, concantenates the translations, and creates a single, protein output file.

DESCRIPTION

[ Previous | Top | Next ]

It is sometimes necessary to look for protein sequence patterns in nucleic acid sequences where coding regions have not been defined or where there is some suspicion that a coding sequence may contain a frame-shift error. PepData simply translates the entire sequence in all six frames so that every possible protein translation can be examined. PepData translates nucleotide sequences in all six frames and concatenates these six amino acid sequence together into one output sequence. Stop codons appear as '*'s in the output sequence.

If you translate several different nucleotide sequences, the translations are written into separate output files with the same name as the nucleotide sequence but with the file name extension .pdt.

You can find out how to specify a group of sequences that interest you in Chapter 2, Using Sequence Files and Databases in the User's Guide.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using PepData to translate some of the human globin sequences in GenBank:


% pepdata

 PEPDATA from what sequence(s) ?  GenBank:Humhb*

       Humhb16aa   1,216 bp   2,428 aa
        Humhb1az     483 bp     962 aa
         Humhb24   2,231 bp   4,458 aa

       ///////////////////////////////

 PEPDATA complete with

       Input files:     103
       Amino acids: 431,184
      Output files:     103
 Output file names: *.pdt

%

OUTPUT

[ Previous | Top | Next ]

Here is the part of the first output file, which is humhb16aa.pdt:


!!AA_SEQUENCE 1.0
 PEPDATA from: humhb16aa  check: 8158  from: 1  to: 1,216

M31630 Human cyclic AMP response element-binding protein (HB16) mRNA, 3' end.
 6/90

 bases    1 to: 1216   translated into:    1 to: 405
 bases    2 to: 1216   translated into:  406 to: 810
 bases    3 to: 1216   translated into:  811 to: 1214
 reverse of bases    1 to: 1216   translated into: 1215 to: 1619
 reverse of bases    1 to: 1215   translated into: 1620 to: 2024
 reverse of bases    1 to: 1214   translated into: 2025 to: 2428

humhb16aa.pdt  Length: 2428  September 30, 1998 15:35 Type: P  Check: 3306  ..

       1  VPGPFPLLLH LPNGQTMPVA IPASITSSNV HVPAAVPLVR PVTMVPSVPG

      51  IPGPSSPQPV QSEAKMRLKA ALTQQHPPVT NGDTVKGHGS GLVRTQSEES

     101  RPQSLQQPAT STTETPASPA HTTPQTQSTS GRRRRAANED PDEKRRKFLE

     ///////////////////////////////////////////////////////////

INPUT FILES

[ Previous | Top | Next ]

PepData accepts multiple (one or more) nucleotide sequences as input. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. If PepData rejects your nucleotide sequence, see Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

Translate and Map are other programs that translate nucleotide sequences into protein. BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds. TFastA does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?" FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts. FrameAlign creates an optimal alignment of the best segment of similarity (local alignment) between a protein sequence and the codons in all possible reading frames on a single strand of a nucleotide sequence. Optimal alignments may include reading frame shifts.

RESTRICTIONS

[ Previous | Top | Next ]

Since GCG sequences cannot contain more than 350,000 symbols, PepData cannot translate sequences longer than 175,000 bp.

OUTPUT FILE NAMES

[ Previous | Top | Next ]

PepData names output files with the base name of the input file (without the directory) and the file name extension .pdt and writes them in your current working directory.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % pepdata [-INfile=]GenBank:Humhb*

Prompted Parameters: None

Local Data Files:

[-TRANslate=]translate.txt   contains the genetic code

Optional parameters:

-EXTension=.pdt    lets you specify an output file name extension
-DIRectory=DirName writes output to another directory

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

The translation of codons to amino acids, the identification of potential start codons and stop codons, and the mappings of one-letter to three-letter amino acid codes are all defined in a translation table in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file in your working directory or name an alternative file on the command line with an expression like -TRANSlate=mycode.txt. Translation tables are discussed in more detail in Appendix VII.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-TRANSlate=filename.txt

Usually, translation is based on the translation table in a default or local data file called translate.txt. This parameter allows you to use a translation table in a different file. (See Appendix VII for information about translation tables.)

-EXTension=extensionname

allows you to choose a file name extension for the output files other than the .pdt that PepData uses by default.

-DIRectory=DirName

writes the output files into a directory other than your current working directory.

Printed: December 9, 1998 16:29 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997, 1998 Genetics Computer Group Inc., a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com