FROMFASTA

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents

FUNCTION

DESCRIPTION

FUNCTION [ Top | Next ]

FromFastA reformats one or more sequences from FastA format into single sequence files in GCG format.

DESCRIPTION [ Previous | Top | Next ]

Use FromFastA when you want to convert sequences that are in FastA format into a format suitable for use with programs in the Wisconsin Package^(TM). FastA format may maintain many sequences in one file; in such a case FromFastA writes many output files, one for each sequence in the FastA file. Each output file is named according to the first word (following the> character) on the documentation line just above the sequence data in the FastA file. The documentation line from the FastA input file(s) is preserved in the GCG output file(s).

The command % seqformat FastA sets a global switch to make Wisconsin Package programs accept sequences in FastA format without running FromFastA. (See "Using Global Switches" in Chapter 3, Using Programs of the User's Guide.) If this switch is set, only the first sequence in a FastA-format file containing multiple sequences is read by Wisconsin Package programs. Use the FromFastA program when you want to access other sequences in a FastA-format file or to convert sequences that you wish to keep in GCG format.

EXAMPLE [ Previous | Top | Next ]

Here is a session using FromFastA to convert the FastA sequence file fasta.aa into separate sequence files in GCG format.


% fromfasta

 FROMFASTA of what FastA sequence file(s) ?  fasta.aa

 egmsmg.pep  1217 aa.
 hshua1.pep  129 aa.
 lcbo.pep  230 aa.
 mchu.pep  149 aa.
 musplfm.pep  224 aa.
 mwkw.pep  1966 aa.
 mwrtc1.pep  428 aa.
 gt87.pep  217 aa.
 qrhuld.pep  860 aa.

 Finished FROMFASTA with 9 files written.
 5420 sequence characters were reformatted.

%

OUTPUT [ Previous | Top | Next ]

Here is part of the first output file, egmsmg.pep, from the example above:


!!AA_SEQUENCE 1.0
EGMSMG Epidermal growth factor precursor - Mouse

egmsmg.pep  Length: 1217  September 29, 1998 18:29  Type: P  Check: 9280  ..

       1  MPWGRRPTWL LLAFLLVFLK ISILSVTAWQ TGNCQPGPLE RSERSGTCAG

      51  PAPFLVFSQG KSISRIDPDG TNHQQLVVDA GISADMDIHY KKERLYWVDV

    ////////////////////////////////////////////////////////////

    1151  PHIDGMGTGQ SCWIPPSSDR GPQEIEGNSH LPSYRPVGPE KLHSLQSANG

    1201  SCHERAPDLP RQTEPVK

INPUT FILES [ Previous | Top | Next ]

FromFastA accepts multiple (one or more) files containing sequences in FastA format as input. You can specify multiple input files as a list file, for example@fastaseqs.list, or by using a file specification with an asterisk (*) wildcard, for example fasta*.seq. Each input file may contain multiple (one or more) sequences. Here is part of the input file used for the example above:


>EGMSMG Epidermal growth factor precursor - Mouse
MPWGRRPTWLLLAFLLVFLKISILSVTAWQTGNCQPGPLERSERSGTCAGPAPFLVFSQGKSISRIDPDG
TNHQQLVVDAGISADMDIHYKKERLYWVDVERQVLLRVFLNGTGLEKVCNVERKVSGLAIDWIDDEVLWV
DQQNGVITVTDMTGKNSRVLLSSLKHPSNIAVDPIERLMFWSSEVTGSLHRAHLKGVDVKTLLETGGISV
LTLDVLDKRLFWVQDSGEGSHAYIHSCDYEGGSVRLIRHQARHSLSSMAFFGDRIFYSVLKSKAIWIANK
HTGKDTVRINLHPSFVTPGKLMVVHPRAQPRTEDAAKDPDPELLKQRGRPCRFGLCERDPKSHSSACAEG
YTLSRDRKYCEDVNECATQNHGCTLGCENTPGSYHCTCPTGFVLLPDGKQCHELVS
CPGNVSKCSHGCVLTSDGPRCICPAGSVLGRDGKTCTGCSSPDNGGCSQICLPLRPGSWECDCFPGYDLQ
SDRKSCAASGPQPLLLFANSQDIRHMHFDGTDYKVLLSRQMGMVFALDYDPVESKIYFAQTALKWIERAN
MDGSQRERLITEGVDTLEGLALDWIGRRIYWTDSGKSVVGGSDLSGKHHRIIIQERISRPRGIAVHPRAR
RLFWTDVGMSPRIESASLQGSDRVLIASSNLLEPSGITIDYLTDTLYWCDTKRSVIEMANLDGSKRRRLI
QNDVGHPFSLAVFEDHLWVSDWAIPSVIRVNKRTGQNRVRLQGSMLKPSSLVVVHPLAKPGADPCLYRNG
GCEHICQESLGTARCLCREGFVKAWDGKMCLPQDYPILSGENADLSKEVTSLSNST
QAEVPDDDGTESSTLVAEIMVSGMNYEDDCGPGGCGSHARCVSDGETAECQCLKGFARDGNLCSDIDECV
LARSDCPSTSSRCINTEGGYVCRCSEGYEGDGISCFDIDECQRGAHNCAENAACTNTEGGYNCTCAGRPS

//////////////////////////////////////////////////////////////////////

When FromFastA writes GCG sequence files, it assigns the sequence type based on the composition of the sequence characters. This method is not fool-proof, so to ensure that the output files are written with the correct sequence type, use -PROtein or -NUCleotide on the command line when running FromFastA.

If FromFastA is run interactively, you can watch the program monitor to see if the sequences are assigned the correct type. As each new file is written, its name and the number of bases (bp) or amino acids (aa) appears on the screen. If the wrong abbreviation appears (for example, bp appears for a protein sequence), the sequence file was assigned the wrong type. The sequence type also appears in the sequence file. Look on the last line of the text heading just above the sequence itself for Type: N orType: P.

If the sequence type was incorrectly assigned, see Appendix VI for information on how to change or set the type of a sequence.

The following programs convert sequences between other formats and GCG format: FromEMBL, FromGenBank, FromIG, FromPIR, FromStaden, FromFastA, ToIG, ToPIR, ToStaden and ToFastA.

DataSet creates a GCG data library from any set of sequences in GCG format. GCGToBLAST creates a database that can be searched by the BLAST program from any set of sequences in GCG format.

RESTRICTIONS [ Previous | Top | Next ]

FastA format is not rigorously defined, so FastA files from different sources may not have exactly the same format. Please call us at (608) 231-5200 or send us e-mail at Help@GCG.Com if you encounter problems converting FastA sequences using FromFastA.

COMMAND-LINE SUMMARY [ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % fromfasta [-INfile=]fasta.aa -Default

Prompted Parameters: None

Local Data Files: None

Optional Parameters:

-PROtein                    insists that the input sequences are proteins
-NUCleotide                 insists that the input sequences are nucleic acids
-LIStfile[=fromfasta.list]  writes a list file of output sequence names

LOCAL DATA FILES [ Previous | Top | Next ]

None.

PARAMETER REFERENCE [ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-PROtein and -NUCleotide

set the program to expect protein or nucleic acid sequences, respectively. Normally, FromFastA determines whether an input sequence is protein or nucleic acid by looking at its composition. If the first 300 alphabetic characters in a sequence are composed entirely of IUB-IUPAC nucleotide codes (see Appendix III), it is reformatted as a nucleic acid sequence in GCG format; otherwise it is reformatted as a protein sequence. Using these command-line parameters, you can insist that your sequences are proteins (-PROtein) or nucleic acids (-NUCleotide).

-LIStfile=fromfasta.list

writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then FromFastA makes one up using fromfasta for the file name and .list for the file name extension.

Printed: December 9, 1998 16:27 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.