[ Program Manual | User's Guide | Data Files | Databases ]
Corrupt randomly introduces small numbers of substitutions, insertions, and deletions into nucleotide sequence(s).
Corrupt uses a random number generator to add errors to nucleotide sequences. You can set the number of substitutions and length errors independently. Length errors can either be insertions or deletions; these two changes are now collectively referred to as indels in the literature of mathematical biology. The position of each error is picked at random somewhere within the range and on the strand that you chose. The length of each indel is chosen at random from one to the maximum indel size. If the indel is positive (insertion), then the symbols added are also chosen at random.
The output files contain a complete record of the errors introduced. The chosen and actual number of substitutions may vary since one in four substitutions will not change the sequence. The output file also shows the total amount of length added (or subtracted) when all of the indels are taken together. The current time is used to seed the random number generator, so each run with Corrupt yields different results.
If you give Corrupt a single input sequence, you can choose the range, strand, and output file name. Otherwise, Corrupt uses the top strand of the whole sequence and names the output file with the sequence's name followed by the file name extension .corrupt.
Here is a session using Corrupt to corrupt the first 200 bases of gamma.seq:
% corrupt Corrupt what sequence(s) ? gamma.seq Begin (* 1 *) ? End (* 11375 *) ? 200 Reverse (* No *) ? How many substitutions do you want (* 1 *) ? 3 How many length errors do you want (* 1 *) ? 3 What should I call the output file (* gamma.corrupt *) ? %
The file gamma.corrupt would contain the corrupted contents of the first 200 symbols in gamma.seq. Here is the output from this session:
!!NA_SEQUENCE 1.0 CORRUPT of: gamma.seq check: 6474 from: 1 to: 200 Human fetal beta globins G and A gamma from Shen, Slightom and Smithies, Cell 26; 191-203. Analyzed by Smithies et al. Cell 26; 345-353. Substitutions: G at 188, T at 115, G at 170, InDels: C inserted at 161, TCA removed at 116, G inserted at 16, InDels: 3 Substitutions: 3 MaxIndel: 3 Actual substitutions: 3 Length change from indels: -1 gamma.corrupt Length: 199 August 20, 1998 13:02 Type: N Check: 2187 .. 1 GGATCCTAGA TATTCGCTTA GTCTGAGGAG GAGCAATTAA GATTCACTTG 51 TTTAGAGGCT GGGAGTGGTG GCTCACGCCT GTAATCCCAG AATTTTGGGA 101 GGCCAAGGCA GGCAGTCCTG AGGTCAAGAG TTCAAGACCA ACCTGGCCAA 151 CATGGTGACA ATCCCATCGC TACAAAAATA CAAAAAGTAG ACAGGCATG
Corrupt accepts a single nucleotide sequence or multiple nucleotide sequences as input. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. If Corrupt rejects your nucleotide sequence, see Appendix VI for information on how to change or set the type of a sequence.
Sample extracts sequence fragments randomly from sequence(s). You can set a sampling rate to determine how many fragments Sample extracts. Shuffle randomizes the order of the symbols in a sequence without changing the composition. SeqEd is an interactive editor for entering and modifying sequences and for assembling parts of existing sequences into new genetic constructs. You can enter sequences from the keyboard or from a digitizer.
Corrupt only works on nucleotide sequences. Contact us if you would like to have it upgraded to also work with proteins. The output is renumbered to start at one.
If an indel is longer than 250 nucleotides, only the first 250 nucleotides of the indel are shown in the output file.
Corrupt makes the substitutions first followed by the insertions and deletions. The substitution algorithm is this: any of the four bases is chosen at random and then put into any position in the sequence randomly. This means that, on average, about one in four substitutions will not change the nucleotide .
You may find what happened hard to understand if you make a lot of indels. The best way we know of to reconstruct a corruption is to start with the original sequence and, using SeqEd, make the changes in exactly the same order as they appear in the output file trace. You can use Gap to display the original and corrupted sequences next to one another.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
Minimal Syntax: % corrupt [-INfile=]gamma.seq -Default Prompted Parameters: (for single sequences only) -BEGin=1 -END=11375 sets the range of interest -REVerse uses the back strand [-OUTfile=]gamma.corrupt specifies the output file name Other Prompted Parameters: -SUBstitutions=1 sets the number of substitutions to introduce -INDels=1 sets the number of length errors to introduce Local Data Files: None Optional Parameters: -MAXindel=3 sets the size of maximum insertion/deletion -NOTRAce suppresses the record of errors in the output file -EXTension=.corrupt sets the output file name extension -LIStfile[=corrupt.list] writes a list file of output sequence names -NOMONitor suppresses screen monitor (of input sequence names) -NOSUMmary suppresses the screen summary
None.
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
specifies the number of character substitutions to introduce.
sets the number of insertions and deletions (length errors) to introduce.
sets the maximum size of an insertion or deletion. The maximum is three unless you change it with this parameter.
Normally Corrupt writes a complete record in the output file of each substitution, insertion, and deletion. You can suppress this information with -NOTRAce.
This program normally creates output file names by using the original input file name for the base name and the program name for the name extension. Use this parameter to specify some other file name extension.
writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then Corrupt makes one up using corrupt for the file name and .list for the file name extension.
This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.
writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.
You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.
[ Program Manual | User's Guide | Data Files | Databases ]
Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com
Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997, 1998 Genetics Computer Group Inc., a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.