COMPOSITION

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents

FUNCTION

DESCRIPTION

NAMING SETS OF SEQUENCES

COMMAND-LINE SUMMARY

LOCAL DATA FILES

PARAMETER REFERENCE

FUNCTION [ Top | Next ]

Composition determines the composition of sequence(s). For nucleotide sequence(s), Composition also determines dinucleotide and trinucleotide content.

DESCRIPTION [ Previous | Top | Next ]

Composition measures the composition of one or a group of sequences. If you specify only one sequence, you can choose a range within the sequence. Lowercase letters are converted to uppercase and counted with their uppercase equivalents. If you specify a group of sequences, Composition displays the name of each sequence as it finishes the measurement for that sequence.

EXAMPLE [ Previous | Top | Next ]

Here is a session using Composition to measure the composition and di- and trinucleotide content for all of the bacterial sequences in GenEMBL:


% composition

  COMPOSITION on what sequence(s) ?  Bacterial:*

  What should I call the output file (* bacterial.composition *) ?

  A33344
  A33349
  A34992

  /////////

  ZSSSURNAS1
  ZSSSURNAS2
  ZYP16SRNA

  COMPOSITION complete.

        Sequences: 48,022
     Total Length: 120,532,386
         CPU time: 54.37
      Output file: bacterial.composition

%

OUTPUT [ Previous | Top | Next ]

Here is the output file:


 COMPOSITION of: Primate:*  October 7, 1998 11:13

 Sequences: 48,022  Total_Length: 120,532,386  CPU_Time: 54.37

                            *****

     A: 31,313,460   B: 102          C: 28,735,001   D: 95
     G: 30,961,705   H: 314          K: 849          M: 777
     N: 91,852       R: 1,571        S: 1,568        T: 29,422,612
     V: 102          W: 964          Y: 1,414

                          Other: 0

                          Total: 120,532,386

                            *****

     GG: 8,050,342   GA: 7,862,220   GT: 6,360,816   GC: 8,666,422
     AG: 7,038,203   AA: 9,998,328   AT: 7,864,675   AC: 6,393,582
     TG: 7,923,448   TA: 6,046,193   TT: 8,693,358   TC: 6,736,239
     CG: 7,924,019   CA: 7,381,735   CT: 6,486,257   CC: 6,920,823

                          Other: 137,704

                          Total: 120,484,364

                            *****

     GGG: 1,809,031  GGA: 1,960,301  GGT: 1,901,998  GGC: 2,371,501
     GAG: 1,672,334  GAA: 2,489,765  GAT: 2,125,037  GAC: 1,570,445
     GTG: 1,747,910  GTA: 1,331,828  GTT: 1,828,370  GTC: 1,446,886
     GCG: 2,339,125  GCA: 2,092,399  GCT: 2,019,219  GCC: 2,210,163

     AGG: 1,758,829  AGA: 1,840,304  AGT: 1,406,725  AGC: 2,027,036
     AAG: 2,315,186  AAA: 3,373,559  AAT: 2,308,923  AAC: 1,993,654
     ATG: 1,990,442  ATA: 1,691,234  ATT: 2,235,792  ATC: 1,942,982
     ACG: 1,661,442  ACA: 1,628,545  ACT: 1,365,670  ACC: 1,733,627

     TGG: 2,225,297  TGA: 2,195,490  TGT: 1,492,723  TGC: 2,005,470
     TAG: 1,051,637  TAA: 1,930,346  TAT: 1,737,112  TAC: 1,323,841
     TTG: 2,036,860  TTA: 1,910,148  TTT: 2,790,885  TTC: 1,949,165
     TCG: 1,767,614  TCA: 1,899,594  TCT: 1,527,840  TCC: 1,535,419

     CGG: 2,250,632  CGA: 1,857,615  CGT: 1,553,902  CGC: 2,257,754
     CAG: 1,993,100  CAA: 2,196,518  CAT: 1,686,654  CAC: 1,502,049
     CTG: 2,143,874  CTA: 1,109,501  CTT: 1,833,157  CTC: 1,393,229
     CCG: 2,151,571  CCA: 1,757,503  CCT: 1,568,927  CCC: 1,436,717

                          Other: 173,936

                          Total: 120,436,342

                            *****

RESTRICTIONS [ Previous | Top | Next ]

Unknown.

CONSIDERATIONS [ Previous | Top | Next ]

You can infer the composition of the bottom strand of a nucleic acid sequence from the composition of the top strand. The-BOTHstrands parameter measures both strands, but information is lost because G=C and A=T, and so on.

INPUT FILES [ Previous | Top | Next ]

Composition takes either a single or a multiple sequence file specification. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. The function of Composition depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N orType: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, see Appendix VI for information on how to change or set the type of a sequence.

CodonFrequency tabulates codon frequencies for any range of a sequence in a particular reading frame, as opposed to counting all trinucleotides.

<CTRL>C [ Previous | Top | Next ]

If you need to stop this program, use <Ctrl>C to reset your terminal and session as gracefully as possible. Searches and comparisons write out the results from the part of the search that is complete when you use <Ctrl>C.

BATCH QUEUE [ Previous | Top | Next ]

You can run this program in the batch queue using a script that we supply. Use Fetch with a filename that starts with this program's name and ends with the filename extension .csh. Modify the file with any text editor so that it specifies the experiment you want to do and queue the script.

NAMING SETS OF SEQUENCES [ Previous | Top | Next ]

See the sections on specifying sequences in Chapter 2, Using Sequence Files and Databases of the User's Guide.

COMMAND-LINE SUMMARY [ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % composition [-INfile=]Bacterial:* -Default

Prompted Parameters:

-BEGin=1 -END=1000                sets the range of interest (single seqs only)
[-OUTfile=]bacterial.composition  names the output file

Local Data Files: None

Optional Parameters:

-BOTHstrands  determines composition of both strands of nucleic acids
-NOCOMmas     removes the commas from the numbers in the output
-NOMONitor    suppresses the screen monitor showing each sequence
-NOSUMmary    suppresses the screen summary at the end of the program

LOCAL DATA FILES [ Previous | Top | Next ]

None.

PARAMETER REFERENCE [ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-BOTHstrands

measures the composition of both strands of a nucleic acid sequence.

-NOCOMmas

Composition normally displays numbers greater than 999 with commas to make them easier to read; for example, the number 1234567 would look like 1,234,567. These commas make the numbers unreadable to a computer. If you are going to use the output file from this program for input to another program, you can suppress the commas with this parameter.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-SUMmary

writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: December 9, 1998 16:26 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.