Gen_sequence is a program for generating random sequences of amino acids with lengths and compositions typical of those found in real protein databases.
The program comes with a small library of open-source routines for generating random variates according to normal, beta, Dirichlet, and mixture of Dirichlet distributions.
This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; version 2.1 of the License. This library is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU Lesser General Public License for more details.
The random-variate algorithms in this library were selected more for robustness and simplicity of implementation than raw speed. Despite that, the generation seems to be quite efficient, taking about 1 microsecond per beta generation and 0.6 per normal generation on a DEC alpha xp1000.
The random number generator can be changed by changing the DRAND macro definitions in the .c files. Since all the generators rely on successive pairs of uniformly distributed random numbers, a high-quality generator should be used. The additive random number generator "random" in the standard UNIX libraries is such a generator, so was chosen for this application.
Test programs are provided for each of the generators. The tests are far from exhaustive, checking only the first two moments a few parameter values (covering each of the different algorithms for gen_beta). The test programs do not make a decision about whether the generators are working or not---they simply report the first and second moments from the sample and what they should be analytically. It is up to the user to decide whether this match is adequate. Although the test programs were written for debugging the random-variate generators, their main function now is to determine the speed of the generators.
The length of each sequence is taken from a discretized log-normal
distribution that was fit to the sequences in
RSDB-60 (see
Park J, Holm L, Heger A, Chothia C
RSDB: representative protein sequence databases have high
information content
Bioinformatics 2000 May;16(5):458-64
).
The amino acids of a sequence are generated by an independent, identically distributed process. The probabilities for that distribution are selected from a mixture of Dirichlet densities. The mixture of Dirichlet densities was also trained on RSDB-60. The particular mixture chosen here is not the best fit, but a compromise between the number of components and the fit.
|
|
|
UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building