FROMPIR

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
CONSIDERATIONS
COMMAND-LINE SUMMARY
LOCAL DATA FILES
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

FromPIR reformats sequences from the protein database of the Protein Identification Resource (PIR) into individual files in GCG format.

DESCRIPTION

[ Previous | Top | Next ]

Use FromPIR when you want to use sequences in PIR format (as written by PSQ's COPY command) with the Wisconsin Package(TM). Since PIR allows many sequences in one file, FromPIR must write many output files, one for each sequence in the PIR input file. Each output file is named according to the identifier word at the beginning of each sequence entry. All the documentation from the PIR input files is preserved in the GCG output files.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using FromPIR to convert the PIR-format sequences in qqqss.psq into separate sequence files in GCG format:


% frompir

  FromPIR of what PIR sequence file ?  qqqss.psq

         kgrt.gcg   179 aa
        jdvls.gcg   881 aa
       fwsyg3.gcg   516 aa

  FROMPIR complete:

      Files written:     3
      Total length:  1,576

%

OUTPUT

[ Previous | Top | Next ]

Here is what the first output file (kgrt.gcg) looks like:


P1;KGRT - Gamma casein precursor - Rat
C;Species: Rattus norvegicus (Norway rat)
C;Accession: A03111
R;Hobbs, A.A., and Rosen, J.M.
Nucl. Acids Res. 10, 8079-8098, 1982 (Sequence translated from the mRNA
 sequence)
A;Residues 1-15 are the signal sequence.

 kgrt.gcg  Length: 179  September 29, 1998 18:05  Type: P  Check: 450  ..

       1  MKFFIFTCLV AAALAKHAVK DKPSSEESAS VYLGKYKQGN SVFFQTPQDS

      51  ASSSSSEESS EEISEKIEQS EEQKVNLNQQ KKSKQFSQDS SFPQICTPYQ

     101  QQSSVNQRPQ PNAIYDVPSQ ESTSTSVEEI LKKIIDIVKY FQYQQLTNPH

     151  FPQAVHPQIR VSSWAPSKDY TFPTARYMA

INPUT FILES

[ Previous | Top | Next ]

FromPIR accepts a single file in PIR format (as written by PSQ's COPY command) as input. Each input file may contain multiple (one or more) sequences. Here is part of the input file used for the example above:


>P1;KGRT
Gamma casein precursor - Rat
 M K F F I F T C L V A A A L A K H A V K D K P S S E E S A S
 V Y L G K Y K Q G N S V F F Q T P Q D S A S S S S S E E S S
 E E I S E K I E Q S E E Q K V N L N Q Q K K S K Q F S Q D S
 S F P Q I C T P Y Q Q Q S S V N Q R P Q P N A I Y D V P S Q
 E S T S T S V E E I L K K I I D I V K Y F Q Y Q Q L T N P H
 F P Q A V H P Q I R V S S W A P S K D Y T F P T A R Y M A *
C;Species: Rattus norvegicus (Norway rat)
C;Accession: A03111
R;Hobbs, A.A., and Rosen, J.M.
Nucl. Acids Res. 10, 8079-8098, 1982 (Sequence translated from . . .
A;Residues 1-15 are the signal sequence.

When FromPIR writes GCG sequence files, it assigns the sequence type based on the composition of the sequence characters. This method is not fool-proof, so to ensure that the output files are written with the correct sequence type, use -PROtein or -NUCleotide on the command line when running FromPIR.

If FromPIR is run interactively, you can watch the program monitor to see if the sequences are assigned the correct type. As each new file is written, its name and the number of bases (bp) or amino acids (aa) appears on the screen. If the wrong abbreviation appears (for example, bp appears for a protein sequence), the sequence file was assigned the wrong type. The sequence type also appears in the sequence file. Look on the last line of the text heading just above the sequence itself for Type: N or Type: P.

If the sequence type was incorrectly assigned, see Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

The following programs convert sequences between other formats and GCG format: FromEMBL, FromGenBank, FromIG, FromPIR, FromStaden, FromFastA, ToIG, ToPIR, ToStaden and ToFastA.

DataSet creates a GCG data library from any set of sequences in GCG format. GCGToBLAST creates a database that can be searched by the BLAST program from any set of sequences in GCG format.

RESTRICTIONS

[ Previous | Top | Next ]

None of the sequences in the PIR file should be longer than 350,000 characters.

CONSIDERATIONS

[ Previous | Top | Next ]

If you need it, ask us to modify FromPIR to read only some subset of the sequences in the PIR flat file.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % frompir [-INfile=]qqqss.psq -Default

Prompted Parameters: None

Local Data Files: None

Optional Parameters:

-PROtein                  insists that the input sequences are proteins
-NUCleotide               insists that the input sequences are nucleic acids
-LIStfile[=frompir.list]  writes a list file of output sequence names
-DIRectory=DirName        writes output to another directory
-EXTension=.pir           chooses a file name extension (other than .GCG)
-NOMONitor                suppresses the screen monitor
-NOSUMmary                suppresses the screen summary
-CARd                     accepts PIR files in their "card-image" format

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-PROtein and -NUCleotide

set the program to expect protein or nucleic acid sequences, respectively. Normally, FromPIR determines whether an input sequence is protein or nucleic acid by looking at its composition. If the first 300 alphabetic characters in a sequence are composed entirely of IUB-IUPAC nucleotide codes (see Appendix III), it is reformatted as a nucleic acid sequence in GCG format; otherwise it is reformatted as a protein sequence. Using these command-line parameters, you can insist that your sequences are proteins (-PROtein) or nucleic acids (-NUCleotide).

-LIStfile=frompir.list

writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then FromPIR makes one up using frompir for the file name and .list for the file name extension.

-DIRectory=DirName

writes the output files into a directory other than your current working directory.

-EXTension=.pir

This program normally creates output file names by using the input sequence name for the file name and .GCG for the file name extension. Use this parameter to specify some other file name extension.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-SUMmary

writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

-CARd

sets FromPIR to expect files in the old card image format in which PIR used to distribute protein data.

Printed: December 9, 1998 16:27 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997, 1998 Genetics Computer Group Inc., a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com