``Random'' Sequence Data

Next: Implementation & Online Analysis Up: Methods Previous: Databases Tested

``Random'' Sequence Data

Two types of random sequence databases were created to test false positive rates. The first database is generated by a fifth order Markov chain based on six-mer frequencies within the first 54 Mbp of genomic sequence from the C. elegans genome project. Two thousand cosmid-sized sequences, 50 kilobases (Kbp) each, were generated based on these frequencies, totaling 100 Mbp of random sequence which is tRNA-free. The second random database was created to roughly simulate the human genome in size and GC content. Not enough human genomic sequence is available to parameterize a fifth order Markov chain model, so human sequence was simulated based on isochore proportion and %GC content. Ten thousand 300 Kbp sequences were generated, each one with a GC content approximating one of the five isochore types (L1 or L2 = 40% GC, H1 = 45% GC, H2 = 49% GC, H3 = 53% GC; [Green & Vold, 1993]). The isochore identities for these random sequences were chosen to approximate the proportion each isochore represents in the human genome (L1 + L2 60%, H1 20%, H2 10%, H3 5%). The remaining 5% of the human genome attributed to ALU-type repeat elements were not included since ALU sequences were tested separately (the absent 5% was distributed proportionally among the other isochore types).

Next: Implementation & Online Analysis Up: Methods Previous: Databases Tested

Todd M. Lowe
2000-03-31