[ Program Manual | User's Guide | Data Files | Databases ]
FromFastA reformats one or more sequences from FastA format into single sequence files in GCG format.
Use FromFastA when you want to convert sequences that are in FastA format into a format suitable for use with programs in the Wisconsin Package(TM). FastA format may maintain many sequences in one file; in such a case FromFastA writes many output files, one for each sequence in the FastA file. Each output file is named according to the first word (following the > character) on the documentation line just above the sequence data in the FastA file. The documentation line from the FastA input file(s) is preserved in the GCG output file(s).
The command % seqformat FastA sets a global switch to make Wisconsin Package programs accept sequences in FastA format without running FromFastA. (See "Using Global Switches" in Chapter 3, Using Programs of the User's Guide.) If this switch is set, only the first sequence in a FastA-format file containing multiple sequences is read by Wisconsin Package programs. Use the FromFastA program when you want to access other sequences in a FastA-format file or to convert sequences that you wish to keep in GCG format.
Here is a session using FromFastA to convert the FastA sequence file fasta.aa into separate sequence files in GCG format.
% fromfasta FROMFASTA of what FastA sequence file(s) ? fasta.aa egmsmg.pep 1217 aa. hshua1.pep 129 aa. lcbo.pep 230 aa. mchu.pep 149 aa. musplfm.pep 224 aa. mwkw.pep 1966 aa. mwrtc1.pep 428 aa. gt87.pep 217 aa. qrhuld.pep 860 aa. Finished FROMFASTA with 9 files written. 5420 sequence characters were reformatted. %
Here is part of the first output file, egmsmg.pep, from the example above:
!!AA_SEQUENCE 1.0 EGMSMG Epidermal growth factor precursor - Mouse egmsmg.pep Length: 1217 September 29, 1998 18:29 Type: P Check: 9280 .. 1 MPWGRRPTWL LLAFLLVFLK ISILSVTAWQ TGNCQPGPLE RSERSGTCAG 51 PAPFLVFSQG KSISRIDPDG TNHQQLVVDA GISADMDIHY KKERLYWVDV //////////////////////////////////////////////////////////// 1151 PHIDGMGTGQ SCWIPPSSDR GPQEIEGNSH LPSYRPVGPE KLHSLQSANG 1201 SCHERAPDLP RQTEPVK
FromFastA accepts multiple (one or more) files containing sequences in FastA format as input. You can specify multiple input files as a list file, for example @fastaseqs.list, or by using a file specification with an asterisk (*) wildcard, for example fasta*.seq. Each input file may contain multiple (one or more) sequences. Here is part of the input file used for the example above:
>EGMSMG Epidermal growth factor precursor - Mouse MPWGRRPTWLLLAFLLVFLKISILSVTAWQTGNCQPGPLERSERSGTCAGPAPFLVFSQGKSISRIDPDG TNHQQLVVDAGISADMDIHYKKERLYWVDVERQVLLRVFLNGTGLEKVCNVERKVSGLAIDWIDDEVLWV DQQNGVITVTDMTGKNSRVLLSSLKHPSNIAVDPIERLMFWSSEVTGSLHRAHLKGVDVKTLLETGGISV LTLDVLDKRLFWVQDSGEGSHAYIHSCDYEGGSVRLIRHQARHSLSSMAFFGDRIFYSVLKSKAIWIANK HTGKDTVRINLHPSFVTPGKLMVVHPRAQPRTEDAAKDPDPELLKQRGRPCRFGLCERDPKSHSSACAEG YTLSRDRKYCEDVNECATQNHGCTLGCENTPGSYHCTCPTGFVLLPDGKQCHELVS CPGNVSKCSHGCVLTSDGPRCICPAGSVLGRDGKTCTGCSSPDNGGCSQICLPLRPGSWECDCFPGYDLQ SDRKSCAASGPQPLLLFANSQDIRHMHFDGTDYKVLLSRQMGMVFALDYDPVESKIYFAQTALKWIERAN MDGSQRERLITEGVDTLEGLALDWIGRRIYWTDSGKSVVGGSDLSGKHHRIIIQERISRPRGIAVHPRAR RLFWTDVGMSPRIESASLQGSDRVLIASSNLLEPSGITIDYLTDTLYWCDTKRSVIEMANLDGSKRRRLI QNDVGHPFSLAVFEDHLWVSDWAIPSVIRVNKRTGQNRVRLQGSMLKPSSLVVVHPLAKPGADPCLYRNG GCEHICQESLGTARCLCREGFVKAWDGKMCLPQDYPILSGENADLSKEVTSLSNST QAEVPDDDGTESSTLVAEIMVSGMNYEDDCGPGGCGSHARCVSDGETAECQCLKGFARDGNLCSDIDECV LARSDCPSTSSRCINTEGGYVCRCSEGYEGDGISCFDIDECQRGAHNCAENAACTNTEGGYNCTCAGRPS //////////////////////////////////////////////////////////////////////
When FromFastA writes GCG sequence files, it assigns the sequence type based on the composition of the sequence characters. This method is not fool-proof, so to ensure that the output files are written with the correct sequence type, use -PROtein or -NUCleotide on the command line when running FromFastA.
If FromFastA is run interactively, you can watch the program monitor to see if the sequences are assigned the correct type. As each new file is written, its name and the number of bases (bp) or amino acids (aa) appears on the screen. If the wrong abbreviation appears (for example, bp appears for a protein sequence), the sequence file was assigned the wrong type. The sequence type also appears in the sequence file. Look on the last line of the text heading just above the sequence itself for Type: N or Type: P.
If the sequence type was incorrectly assigned, see Appendix VI for information on how to change or set the type of a sequence.
The following programs convert sequences between other formats and GCG format: FromEMBL, FromGenBank, FromIG, FromPIR, FromStaden, FromFastA, ToIG, ToPIR, ToStaden and ToFastA.
DataSet creates a GCG data library from any set of sequences in GCG format. GCGToBLAST creates a database that can be searched by the BLAST program from any set of sequences in GCG format.
FastA format is not rigorously defined, so FastA files from different sources may not have exactly the same format. Please call us at (608) 231-5200 or send us e-mail at Help@GCG.Com if you encounter problems converting FastA sequences using FromFastA.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
Minimal Syntax: % fromfasta [-INfile=]fasta.aa -Default Prompted Parameters: None Local Data Files: None Optional Parameters: -PROtein insists that the input sequences are proteins -NUCleotide insists that the input sequences are nucleic acids -LIStfile[=fromfasta.list] writes a list file of output sequence names
None.
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
set the program to expect protein or nucleic acid sequences, respectively. Normally, FromFastA determines whether an input sequence is protein or nucleic acid by looking at its composition. If the first 300 alphabetic characters in a sequence are composed entirely of IUB-IUPAC nucleotide codes (see Appendix III), it is reformatted as a nucleic acid sequence in GCG format; otherwise it is reformatted as a protein sequence. Using these command-line parameters, you can insist that your sequences are proteins (-PROtein) or nucleic acids (-NUCleotide).
writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then FromFastA makes one up using fromfasta for the file name and .list for the file name extension.
[ Program Manual | User's Guide | Data Files | Databases ]
Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com
Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997, 1998 Genetics Computer Group Inc., a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.