FPG stands for “Full Parental Genotyping”. That is the name we have given to efforts underway to develop a panel of SNPs to be used for very large scale parentage inference in hatchery populations of salmon. Currently this software suite consists of two C programs and a number of scripts. The theory underlying these software programs is described in:
The above paper will hereafter be referred to as “A&G2006”.
The two C programs are:
snpSumPed — a program that computes the probability of configurations of SNP genotypes on trios of individuals which are related in ways specified by the user using a combination of pedigree links and pairwise relationships.TrioTests — a program that employs an importance sampling algorithm that can be used to compute false positive and false negative rates for doing parentage inference. This program requires input from snpSumPed in order to work.
The scripts simplify the use of the above two programs. They are described below amongst descriptions of snpSumPed and TrioTests. This document provides a simple tutorial with some examples on how to use the programs.
Note that the two programs above are concerned with the calculation of power for parentage inference. As yet, there is not a released program in the suite for conducting the actual inference, i.e., for computing the likelihood of actual, observed trios of individuals.
snpSumPed computes the probability of genotypic configurations on trios of individuals. It has not been written in a way that is terribly efficient, so, if the individuals in the trio are related by a very large pedigree—even if it does not have loops—it will take a very long time. It is, however, suitable for small pedigrees of the sort one will typically be interested in for parentage inference.
The probability of different genotypic configurations at a single diallelic locus between three individuals depends on the frequency of the two alleles, the genotyping error rate, and the relationship between the individuals. All this information is provided to snpSumPed on the command line using command line options that are either a single dash followed by a single letter, or a double dash followed by a multi-letter word. Sometimes there are two ways of invoking an option. A short list of the options available in snpSumPed can be obtained by calling snpSumPed --help which yields the following output:
SNP_SumPed -- probability of pedigree subsets using binary loci
--help short listing of options
--help-full long listing of options
--help-nroff long listing of options in nroff man-page format
--version prints program version information
--version-history prints history of different versions
--command-file [S] inserts contents of file [S] into the command line
**** Command Line Switches for inputting variables ****
-p , --minor-freq [R] minor allele freq of the locus
-e , --gtyp-error [R1] [R2] ... per-gene-copy genotyping error rate for all loci
-f , --fixed-geno [J-kid] [J-pa] [J-ma] Indices of youth/father/mother
--ped [J-kid] [J-pa] [J-ma] pedigree entry. Individual names are positive integers.
--kappa [J1] [J2] [R0] [R1] [R2] pairwise relatedness between J1 and J2
--pair-k [J1] [J2] [R2] [R1] [R0] OLD STYLE pairwise relatedness between J1 and J2
--verbose spout out ridiculous amounts of output
The first 6 options shown are common to all the programs using Eric Anderson's options processing package. The remaining ones are used to give snpSumPed information to compute probabilities of genotypic configurations. For a more detailed description of all these options, you can enter snpSumPed --help-full on the command line. This full help should provide you with everything you need to know to run the program. To help with some examples, though, I give a few simple examples below, and discuss the output.
Let's imagine that we want to compute the probability of genotype configurations for an allele with a minor allele frequency of .23, in a trio that is totally unrelated, with a genotyping error rate of .005 per gene-copy. The first part of our command line would look like:
/eriq/--% snpSumPed -p .23 -e .005
Note that the yellow-green part that says /eriq/--% is my command line prompt. You would not type that part! The same command line could have been written like:
/eriq/--% snpSumPed --minor-freq .23 --gtyp-error .005
If you try to execute either of the above command lines, you will get an error message that says:
Error! Lacking required option -f or --fixed-geno
Error! Lacking required option --ped
Errors encountered processing command line. Use
snpSumPed --help
or
snpSumPed --help-full
for information on available options
This is because the program needs to know who the individuals in the trio are, and how they are related. Individuals are specified with positive integer identifiers (1,2,3,4,...). The relationships between them are specified using the --ped option and also the --kappa option (which we will get to later). The integer identifier 0 is reserved for an unknown parent, i.e. if an individual has parents (specified with --ped) who are 0's, then that individual is considered a founder in the pedigree. It is also necessary to tell the program who is the putative youth, the putative mother, and the putative father in the trio. This is specified using the -f or --fixed-geno option.
OK, let's specify a simple trio in which 1 is the putative kid, 2 is the putative father, and 3 is the putative mother, but the three of them are totally unrelated. The arguments to the --fixed-geno option will then be 1 2 3 and the pedigree statements giving the relationship between the individuals is very easy—they are all founders! So the command looks like:
/eriq/--% snpSumPed --minor-freq .23 --gtyp-error .005 --fixed-geno 1 2 3 --ped 1 0 0 --ped 2 0 0 --ped 3 0 0
If you run that, the output should look like:
span class="re7"> ALLELE_FREQ : 0.230000 GTYPING_ERROR_RATES : 0.005000 SUM_OVER_ALL_STATES_FIXED_ANO_OTHERWISE : 1.000000 TRIO_TRUE_STATES : 0 0 0 0.000148 TRIO_TRUE_STATES : 0 0 1 0.000991 TRIO_TRUE_STATES : 0 0 2 0.001659 TRIO_TRUE_STATES : 0 1 0 0.000991 TRIO_TRUE_STATES : 0 1 1 0.006637 TRIO_TRUE_STATES : 0 1 2 0.011109 TRIO_TRUE_STATES : 0 2 0 0.001659 TRIO_TRUE_STATES : 0 2 1 0.011109 TRIO_TRUE_STATES : 0 2 2 0.018596 TRIO_TRUE_STATES : 1 0 0 0.000991 TRIO_TRUE_STATES : 1 0 1 0.006637 TRIO_TRUE_STATES : 1 0 2 0.011109 TRIO_TRUE_STATES : 1 1 0 0.006637 TRIO_TRUE_STATES : 1 1 1 0.044437 TRIO_TRUE_STATES : 1 1 2 0.074384 TRIO_TRUE_STATES : 1 2 0 0.011109 TRIO_TRUE_STATES : 1 2 1 0.074384 TRIO_TRUE_STATES : 1 2 2 0.124512 TRIO_TRUE_STATES : 2 0 0 0.001659 TRIO_TRUE_STATES : 2 0 1 0.011109 TRIO_TRUE_STATES : 2 0 2 0.018596 TRIO_TRUE_STATES : 2 1 0 0.011109 TRIO_TRUE_STATES : 2 1 1 0.074384 TRIO_TRUE_STATES : 2 1 2 0.124512 TRIO_TRUE_STATES : 2 2 0 0.018596 TRIO_TRUE_STATES : 2 2 1 0.124512 TRIO_TRUE_STATES : 2 2 2 0.208422 TRIO_OBS_STATES : 0.005000 : 0 0 0 0.000159 TRIO_OBS_STATES : 0.005000 : 0 0 1 0.001047 TRIO_OBS_STATES : 0.005000 : 0 0 2 0.001726 TRIO_OBS_STATES : 0.005000 : 0 1 0 0.001047 TRIO_OBS_STATES : 0.005000 : 0 1 1 0.006905 TRIO_OBS_STATES : 0.005000 : 0 1 2 0.011385 TRIO_OBS_STATES : 0.005000 : 0 2 0 0.001726 TRIO_OBS_STATES : 0.005000 : 0 2 1 0.011385 TRIO_OBS_STATES : 0.005000 : 0 2 2 0.018770 TRIO_OBS_STATES : 0.005000 : 1 0 0 0.001047 TRIO_OBS_STATES : 0.005000 : 1 0 1 0.006905 TRIO_OBS_STATES : 0.005000 : 1 0 2 0.011385 TRIO_OBS_STATES : 0.005000 : 1 1 0 0.006905 TRIO_OBS_STATES : 0.005000 : 1 1 1 0.045538 TRIO_OBS_STATES : 0.005000 : 1 1 2 0.075078 TRIO_OBS_STATES : 0.005000 : 1 2 0 0.011385 TRIO_OBS_STATES : 0.005000 : 1 2 1 0.075078 TRIO_OBS_STATES : 0.005000 : 1 2 2 0.123781 TRIO_OBS_STATES : 0.005000 : 2 0 0 0.001726 TRIO_OBS_STATES : 0.005000 : 2 0 1 0.011385 TRIO_OBS_STATES : 0.005000 : 2 0 2 0.018770 TRIO_OBS_STATES : 0.005000 : 2 1 0 0.011385 TRIO_OBS_STATES : 0.005000 : 2 1 1 0.075078 TRIO_OBS_STATES : 0.005000 : 2 1 2 0.123781 TRIO_OBS_STATES : 0.005000 : 2 2 0 0.018770 TRIO_OBS_STATES : 0.005000 : 2 2 1 0.123781 TRIO_OBS_STATES : 0.005000 : 2 2 2 0.204076 OUTPUT_FOR_ENSUING_COMMAND_LINE : p 0.230000 mu 0.005000 probs 0.0001587736028230 0.0010470733600865 0.0017262986445947 0.0010470733600865 0.0069051945783786 0.0113845204125267 0.0017262986445947 0.0113845204125267 0.0187695369843827 0.0010470733600865 0.0069051945783786 0.0113845204125267 0.0069051945783786 0.0455380816501068 0.0750781479375310 0.0113845204125267 0.0750781479375310 0.1237805391329341 0.0017262986445947 0.0113845204125267 0.0187695369843828 0.0113845204125267 0.0750781479375310 0.1237805391329341 0.0187695369843827 0.1237805391329341 0.2040756503581872 PROB_LOC_IS_INCOMPAT : 0.231853
The purple parts at the beginning of each line are tags indicating what is on that line. For now, we will focus on the TRIO_OBS_STATES lines. It has fields separated by colons (shown in pink). The second field shows the genotyping error rate, which is 0.005. The third field starts with three columns that show the genotypic configuration. For example, 0 2 2 says that the kid (indiv 1) has a genotype with zero “1” alleles, the putative father has a genotype with two “1” alleles, and so does the mother. The fourth column of the third field is the probability of that genotypic configuration.
The most important line in the output is the OUTPUT_FOR_ENSUING_COMMAND_LINE which holds all the genotypic configuration probabilities in a single line that can be used as input for TrioTests. We'll come back to this when we start using TrioTests. In the meantime, we will do two more examples with snpSumPed.
Let's do the same thing now, but have the three individuals be related as a nuclear family. To do this, we merely change the --ped statement in which individual 1 is the child, to reflect such a relationship, i.e., the command line becomes:
/eriq/--% snpSumPed --minor-freq .23 --gtyp-error .005 --fixed-geno 1 2 3 --ped 1 2 3 --ped 2 0 0 --ped 3 0 0
Simple enough. Note that it is tempting to drop the --ped 2 0 0 and the --ped 3 0 0 parts of the command line. In fact, this still works, but it does not do so in all cases. So, you should always include lines like --ped X 0 0 to explicitly declare that X is a founder.
As the relationships get more complex, it will become cumbersome to specify the pedigree connecting the three individuals on the command line. We can use the --command-file option to put the pedigree statements into a file and call that from the command line. The directory pedfiles has the pedigree files for all the relationships used in A&G2006. Let's look at the file for the H_Si relationship. It is called H_Si.ped. Note that in all the files in pedfiles we follow the convention that 1 is the youth, 2 is the father, and 3 is the mother. This is a convention you will want to get into the habit of, because the program TrioTests, which can use output from snpSumPed expects to get data ordered in just the same way.
& One putative parent is a half sib of the youth and the other is a full sib of the true parent that the youth does not share with the other putative parent & --ped 1 4 5 --ped 2 4 6 --ped 3 0 0 --ped 4 0 0 --ped 5 0 0 --ped 6 0 0 --kappa 3 5 .25 .5 .25
The parts between successive ”&” symbols in a command file are ignored. Thus, you can put comments in there. This file also shows the use of the --kappa option. Recall from the --help output that the syntax for the kappa option is:
--kappa [J1] [J2] [R0] [R1] [R2]
In the command file, the kappa option says that individuals 3 (the putative mother) and 5 (the true mother) are related via identity coefficients (.25,.5,.25), which are the coefficients describing a full sib relationship. These coefficients means that with probability 25% the pair shares no genes IBD at a locus; with probability 50% they share 1 gene IBD at the locus, and with 25% they share two genes IBD. To run snpSumPed using this pedfile, we merely use the --command-file option and specify the path to the file H_Si.ped. Depending on which directory you are working in, the command might look like:
/distrib/--% snpSumPed --minor-freq .23 --gtyp-error .005 --fixed-geno 1 2 3 --command-file pedfiles/H_Si.ped
The program TrioTests simulates the genotypes of trios of individuals and it computes the probability of those genotypes under various hypotheses of the relationship between the trio. It is to be used in conjunction with snpSumPed for approximating false positive rates and false negative rates given a certain set of loci. It is written with flexibility in mind, which means that it is necessary to give it a lot of information (most of it being output from snpSumPed) in order to make it run, and the output typically has to be processed in order to get a result out of it. To make the process of running TrioTests more straightforward, I supply some scripts in the scripts directory that help to automate the process.
This tutorial will not attempt to cover all the options in the program, but will show the simple application of TrioTests and some of the scripts that come with it. More complete documentation is available using the --help-full option to TrioTests.
The most important option (and the only one required) to TrioTests is the -l or --new-locus-group option. This allows you to specify that you wish to include in the simulated genotypes for each individual a collection of loci that all have the same frequency of the 0 allele. The full documentation for this option is:
-l , --new-locus-group [Complex Format]
Enter information about a group of loci that all have the same probs
of the different trios. Can be used up to 5000 times. The syntax is
--new-locus-group [J1] p [R1] mu [R2] probs [R3]-[R29] p [R30] mu
[R31] probs [R32]-[R58] p [R59] mu [R60] probs [R61]-[R87]. J1 is the
number of loci that you want to have with these properties. p and mu
are just string tags preceding the freqs R1 and R30 and mutation
rates R2 and R31 respectively. R3-R29 are the 27 probabilities of
genotypic states---cycling from 0 to 2 over youth then pa then
ma---under the alternative hypothesis and R32-R58 are the same
probabilities under the null hypothesis and R61-R87 are the actual
probs for the trio under their true relationship. The program will
check to make sure that R1==R30==R59 and R2==R31==R60---i.e. it
checks to make sure that the value of p and mu you think you are
dealing with is the same for the three different sets of
probabilities. Typically the alternative hypothesis is that the trio
consists of two parents and their child. The null hypothesis is
typically that the trio is unrelated. I include them on the command
line like this so we can look at the distribution of the test
statistic under different null and alternative hypotheses. Output of
the program snpSumPed can be used to fill out these numbers. Note
that the order of the probabilities is such that the first one
corresponds to youth=0; pa=0; ma=0. The second one is youth=0; pa=0;
ma=1. The third is youth=0; pa=0; ma=2. The fourth is youth=0; pa=1;
ma=0; and so forth. Note that the program snpSumPed outputs data in
this format for use on the TrioTests command line.
This is a big mouthful, but all it is saying is that this option provides a place to use one of the output lines from snpSumPed on the command line of TrioTests. Recall the output line tagged with OUTPUT_FOR_ENSUING_COMMAND_LINE from the program snpSumPed—the part after the colon is exactly in the right format for the -l option to TrioTests, so long as you have followed the convention that individual 1 is the kid, 2 is the putative father, and 3 is putative mother.
The other important thing the above piece of documentation is telling you is that for each locus, you need to specify three different sets of probabilities. The first is the probabilities of the 27 genotypic configurations under the Alternative hypothesis. This is typically the hypothesis that the kid and the putative parents represent a true kid-mother-father trio. The second set of probabilities are for the 27 genotypic configurations under the Null Hypothesis. This is typically the hypothesis that the kid and the putative parents are all unrelated to one another. The last set of probabilities is for the True Relationship, which is merely the true relationship between the three individuals.
In order to put all these sets of probabilities in the correct place on the command line, you could cut and paste them from the output of snpSumPed. That would, however, be very tedious, and bad for your wrists. A better way to do it is to learn how to use Unix, and one of its more useful shells, like bash. In the following, I will show some useful scripts, and how to use them. It assumes a familiarity with Unix.
I provide the shell script NewLocusGroup.sh for forming the -l commands for TrioTests. It takes as arguments in this order:
--ped and --kappa statements which define the true relationship between the members of the trio.--ped and --kappa statements which define the null hypothesis relationship between the members of the trio. If this is omitted, the null hypothesis is that of a totally unrelated trio.--ped and --kappa statements which define the alternative hypothesis relationship between the members of the trio. If this is omitted, the alternative hypothesis is that of a true kid-mother-father trio. Obviously, since these are positional parameters, this optional option cannot be given without also giving the preceding optional option.
It returns a full --new-locus-group option command string.
The script looks like this:
#!/bin/bash if [ $# -lt 4 ]; then echo "Syntax: NewLocusGroup.sh <numloci> <zero-allele-freq> <gtyp-error-rate-per-allele> <TrueRelatFile> [NullRelatFile] [AltRelatFile]" echo exit 1; fi L=$1 MU=$3 P=$2 TRUEFILE=$4 NULLFILE=$5 ALTFILE=$6 # if a NullFile was included, use it. Otherwise it is just unrelated if [ $# -lt 5 ]; then NULL=$(snpSumPed -p $P --ped 1 0 0 --ped 2 0 0 --ped 3 0 0 -f 1 2 3 -e $MU | grep ENSUING | cut -d: -f 2); else NULL=$(snpSumPed -p $P --command-file $NULLFILE -f 1 2 3 -e $MU | grep ENSUING | cut -d: -f 2); fi # if an alt file was included, use it. Otherwise it is just parental if [ $# -lt 6 ]; then ALT=$( snpSumPed -p $P --ped 1 2 3 --ped 2 0 0 --ped 3 0 0 -f 1 2 3 -e $MU | grep ENSUING | cut -d: -f 2); else ALT=$(snpSumPed -p $P --command-file $ALTFILE -f 1 2 3 -e $MU | grep ENSUING | cut -d: -f 2); fi TRUE=$(snpSumPed -p $P --command-file $TRUEFILE -f 1 2 3 -e $MU | grep ENSUING | cut -d: -f 2); # in the end we just print out the command line to stdout echo "--new-locus-group $L $ALT $NULL $TRUE";
Note that it assumes that bash is available on your system in the /bin path. It also assumes that you have put snpSumPed somewhere in the search path of your system.
OK, we now can quickly formulate the arguments to the --new-locus-group (same as -l) option...now, what do we do with that? If we feed TrioTests just a --new-locus-group option, then it processes it and starts generating simulated values of Lambda and importance weights. But, we probably have an idea of how we want to run these things, so, let's look at setting values other than the defaults. The other important options I will cover in this tutorial are:
-R or --reps sets the number of Monte Carlo samples to take. This is N in equation 6 of A&G2006.-P or --prop-imp-samp sets the proportion of --reps which will be done following the importance sampling procedure of equation 7 in A&G2006, the remainder will be done following equation 6. In reality, there is little reason not to set this to be 1.0, but the default is .7.--none-from-alt: issuing this option makes TrioTests compute only terms in the sum for determining the false positive rate given the true relationship. By default, however, it also simulates genotypes from the probabilities associated with the alternative hypothesis. These can be used to compute the false negative rates—i.e., if the alternative hypothesis is true, how often will it be rejected?--import-meth: this controls which method of importance sampling is used. It must be followed by a single word which is either “use-parental” or “logistic-weighted”. The former gives you what we refer to as I1 in the paper, and the latter gives what we refer to as I2. In the case of I2, you may set the parameter M (see equation 9 in A&G2006) using the --max-adj option. The default importance sampling method is I1. Note that it doesn't work well for some relationships. See A&G2006.
Here we do a simple example with 10 loci having frequency of the zero allele equal to .25 with the true relationship being totally unrelated. We will do only 5 reps and 100% of them will be using the default importance sampling method of I1. In the following example this is done from the directory (here called distrib) that contains the pedfiles and scripts directories. The command looks like:
/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 10 .25 .005 ./pedfiles/C_U_U.ped) -R 5 -P 1.0
There is a considerable amount of output, but we will only pay attention to the last few lines:
MC_IMP : 4.418367 T 1.205390e-02 MC_VANILLA : 4.601536 A 1.0 MC_IMP : 3.346053 T 3.522310e-02 MC_VANILLA : 3.175920 A 1.0 MC_IMP : 4.209573 T 1.485271e-02 MC_VANILLA : 4.889234 A 1.0 MC_IMP : -2.562268 T 1.296519e+01 MC_VANILLA : 3.921787 A 1.0 MC_IMP : 4.569243 T 1.036581e-02 MC_VANILLA : 3.909000 A 1.0
The output tagged by MC_IMP or MC_VANILLA is what we are most interested in. In each such line, there are three columns in the second field (after the colon). These columns are
T or A denoting whether the genotypes of the trio were simulated under the (T)rue distribution or the (A)lternative hypothesis.
This is the output needed to compute the averages shown as sums in Equations 6 and 7 in A&G2006.1) The lines marked with A's had the genotypes simulated from the true distribution, so they are useful for computing the false negative rate. The others, marked with a T are useful for computing false positive rates. You could probably do these simple calculations by cutting and pasting the output into some spreadsheet program like Excel, but that would be inefficient and inelegant. In the next section I show a simple perl script that will compute the sums from the output of TrioTests.
The output shown above contains the individual values of the importance weights (in column 3) and in column 1 the observed value of the log of the likelihood ratio for each simulated genotype. TrioTests kicks these values out in this format so that you can easily plot their distribution if you so desire. However, often all you want to do is compute the average of the importance weights, each multiplied by one or by zero according to whether or not the value of the likelihood ratio in the second column is greater than (for computing false positive rates) or less than (for computing false negative rates) some critical value of Lambda, <math tex>\Lambda_c</math>. So, you may first want to figure out what <math tex>\Lambda_c</math> should be. Remember that if you increase <math tex>\Lambda_c</math>, you decrease your false positive rate, but you also increase your false negative rate.
To help process the output from TrioTests in illuminating ways, I provide the script LambdaDist.pl in the scripts folder. It operates directly on TrioTests output, so you can either pipe the output directly into it, or you can store the output from TrioTests into a file and then redirect that into LambdaDist.pl. If you invoke LambdaDist.pl with the -h option, it gives you instructions for its use:
Usage
LamdaDist [-h] [prob1] ... [probn] ... [lambda1] ... [lambdan]
Options
-h
Displays help information and exits if -h is the first option.
If -h is not the first option, then unpredictable, possibly
undesirable results may occur.
If called with no arguments with output of TrioTests directed into stdin,
this script will return 7 columns of output. The first is the value of
lambda. The second is the probability (approximated by Monte Carlo) that
a randomly drawn trio formed under the true trio relationship has a
likelihood ratio exceeding lambda. The third is the number of genotypes
simulated under the true relationship that exceeded lambda, the
fourthth is the standard error of the probability in column 2.
The fifth is the probability (approximated by Monte
Carlo) that a genotype simulated under the alternative hypothesis
(the one in the numerator of the likelihood ratio) exceeds lambda. The
sixth is the number of genotypes simulated under the alternative
hypothesis exceeding that value of lambda. The seventh is the standard
error of the estimate in column five. Note that column 2 holds
the value of alpha [see equation 6 in Anderson and Garza (2006)]
corresponding to the lambda in column 1, and column 4 holds the value of
1 - Beta [see equation 5 in Anderson and Garza (2006)].
Otherwise, LambdaDist can be called with a list of numbers. For each
number X between 0 and 1, the script will return the line (formatted
as above) that corresponds to the largest value of lambda for which either alpha
or 1-beta still exceeds X. If X>1, it will return the line corresponding to
the largest value of lambda less than X. In either case, the value of
X is prepended to the line, so all the columns are shifted over by one.
I give a complete listing of LambdaDist.pl in the appendix at the end. Following are some examples of its use, which also show how to use TrioTests and snpSumPed together.
Let's reproduce some of the results in Table 1 of A&G2006. We'll use 60 loci with zero-allele frequency of .2 and the per-allele genotyping error rate of .005. Since we are interested in <math tex>\beta=0.1</math>, we call LambdaDist.pl with .9 (= 1-.1) as the argument. We'll also use 100,000 Monte Carlo samples.
For the truth being “totally unrelated trio” we have:
/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 60 .20 .005 ./pedfiles/C_U_U.ped) -R 100000 -P 1.0 | ./scripts/LambdaDist.pl .9
which returns the line
9.00000e-01 13.68605 5.4937e-08 89968 5.1040e-10 9.0001e-01 90001 9.4865e-04
Note that columns 3 and 5 there are the same (give or take some random variation) as what appears in Table 1 of A&G2006. It is worth pointing out that the SD columns in that table are giving the standard error of the mean in terms of percentages of the mean—somehow that point was omitted from the text and caption of the table.
Suppose we wanted to know what the relationship between <math tex>\alpha</math> and <math tex>1-\beta</math> was for a few more values of <math tex>1-\beta</math>. We could do like this:
/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 60 .20 .005 ./pedfiles/C_U_U.ped) -R 100000 -P 1.0 | ./scripts/LambdaDist.pl .1 .2 .3 .4 .5 .6 .7 .8 .9
which returns the series of lines:
1.00000e-01 25.47267 2.4572e-13 10193 3.3637e-15 1.0001e-01 10001 9.4873e-04 2.00000e-01 23.47864 3.0575e-12 20272 3.0847e-14 2.0000e-01 20000 1.2649e-03 3.00000e-01 22.10978 1.6719e-11 30227 1.4293e-13 3.0000e-01 30000 1.4491e-03 4.00000e-01 20.82366 6.6123e-11 40312 5.0597e-13 4.0000e-01 40000 1.5492e-03 5.00000e-01 19.72696 2.2401e-10 50326 1.5974e-12 5.0000e-01 50000 1.5811e-03 6.00000e-01 18.66416 7.2063e-10 60236 5.0262e-12 6.0001e-01 60001 1.5492e-03 7.00000e-01 17.37130 2.4470e-09 70352 1.7428e-11 7.0001e-01 70001 1.4491e-03 8.00000e-01 15.77251 9.1759e-09 80205 7.0631e-11 8.0001e-01 80001 1.2649e-03 9.00000e-01 13.69318 5.3848e-08 90184 5.0442e-10 9.0001e-01 90001 9.4865e-04
Now, what if the truth is: “the putative mother is the true mother, and the putative father is a double-first-cousin of the true father?”. That corresponds to the relationship described in C_Se_DFC.pdf which looks like:
& C-type with one putative parent self and the other double first cousin & --ped 1 4 5 --ped 2 0 0 --ped 3 0 0 --ped 4 0 0 --ped 5 0 0 --kappa 2 4 0 0 1 --kappa 3 5 .5625 .375 .0625
From Table 1 in A&G2006, it is clear that for this relationship, it is best to use I2. We can do that using the --import-meth option and making --max-adj equal to .85 to match the simulations done in A&G2006. Our command line then looks like:
/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 60 .20 .005 ./pedfiles/C_Se_DFC.ped) -R 100000 -P 1.0 --import-meth logistic-weighted --max-adj .85 | ./scripts/LambdaDist.pl .9
which yields the line
9.00000e-01 13.69975 8.5335e-03 87217 7.8853e-05 9.0001e-01 90001 9.4865e-04
which, again, accords very well with Table 1 in A&G2006 as it should.
Now, let's reproduce the result for the PSi case in which we know the true mother, and the putative father is a full sibling of the true father. Thus the true relationship is one in which the mother is correct, but the father is a full sib, hence C_Se_Si.ped. However, by equation 4, when doing paternity inference, the null hypothesis is the relationship specified in C_Se_U.ped. We can specify that in option 5 (the first OPTIONAL option) to NewLocusGroup.sh. Also, we note from Table 4 that this is an inference problem for which I2 outperforms I1, so we will use I2, and we will set M to be .85. The command line we wish to use is:
/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 60 .20 .005 ./pedfiles/C_Se_Si.ped ./pedfiles/C_Se_U.ped) \ -R 100000 -P 1.0 --import-meth logistic-weighted --max-adj .85 | ./scripts/LambdaDist.pl .9
which produces the result:
9.00000e-01 7.17151 3.6583e-02 86924 2.1988e-04 9.0001e-01 90001 9.4865e-04
When we compare this to the result in Table 1 of A&G2006, we find that...hmm...it doesn't look quite right. Table 1 shows a false positive rate of 1.08e-04 while in the above, we find a false positive rate more than two orders of magnitude higher: 3.6583e-02. That is odd. Let's check at a series of values for <math tex>1-\beta</math>:
/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 60 .20 .005 ./pedfiles/C_Se_Si.ped ./pedfiles/C_Se_U.ped) \ -R 100000 -P 1.0 --import-meth logistic-weighted --max-adj .85 | \ ./scripts/LambdaDist.pl .1 .2 .3 .4 .5 .6 .7 .8 .9
yields in the first portion of its output:
1.00000e-01 17.02699 1.1193e-04 13283 1.2003e-06 1.0001e-01 10001 9.4873e-04 2.00000e-01 15.36009 4.1060e-04 23857 3.3012e-06 2.0000e-01 20000 1.2649e-03 3.00000e-01 14.19333 9.6593e-04 33574 6.6161e-06 3.0000e-01 30000 1.4491e-03 4.00000e-01 13.16825 1.8589e-03 42484 1.1455e-05 4.0000e-01 40000 1.5492e-03 5.00000e-01 12.21484 3.3735e-03 51539 1.9214e-05 5.0000e-01 50000 1.5811e-03 6.00000e-01 11.26640 5.7247e-03 60038 3.1039e-05 6.0001e-01 60001 1.5492e-03 7.00000e-01 10.20174 9.9320e-03 68880 5.3142e-05 7.0001e-01 70001 1.4491e-03 8.00000e-01 8.93196 1.7736e-02 77758 9.6346e-05 8.0001e-01 80001 1.2649e-03 9.00000e-01 7.12819 3.6625e-02 87198 2.1974e-04 9.0001e-01 90001 9.4865e-04
and what we see is...
Holy Smokes!!
For the paternity inference cases, Table 1 in A&G2006 reports false positive rates that correspond to false negative rates of .90 instead of .1. Well, there is a good lesson in the importance of triple checking results. The paternity scenarios were done separately from the rest, and before I had written a handy perl script to make the process as simple as it is now. (Previously I accomplished the same with a series of awk, sed, and bash scripts that I cobbled together—obviously incorrectly in the case of paternity inference—to get the same results).
Well, perhaps the slightest of silver linings may be found in the fact that, since I need to redo the simulations to correct the last five lines of Table 1 in an erratum sent to Genetics, we can really see how to string all these programs and files together to make some calculations. That will be the topic of the next section.
I must redo the simulations to obtain correct values in Table 1 for PU, PDFC, PSi, PF, and PH. These paternity scenarios correspond to trio relationships described in the files C_Se_U.ped, C_Se_DFC.ped, C_Se_Si.ped, B_Se.ped, and H_Se.ped, respectively. In all cases, the null hypothesis is that the trio has the relationship specified in the file C_Se_U.ped.
For each true relationship, we must first perform 100 replicate simulations with N=# of Monte Carlo Reps=100,000, for each of the two different importance sampling schemes use-parental and logistic-weighted. We want to extract from each simulation the line of output corresponding to 1-Beta = .90.
We can do this sort of repetitious task easily using the scripting capabilities of the bash shell in Unix, to tie together all the scripts that have been described in this tutorial up to this point. We'll do this so that each line of output has a column to say what the true relationship is and what the importance sampling method was. Then we will be able to summarize the results easily using the R statistical language.
There is no reason to run simulated values from the alternative hypothesis all the time. In re-doing the paternity inference simulations, we note that the false negative rates depend solely on the alternative hypothesis, which, looking at Equation 4 in A&G2006 we see corresponds to C_Se_Se.ped, i.e., a true parental trio. We can use a trick with our scripts to generate output from TrioTests that corresponds only to this alternate hypothesis: we simulate with the true relationship being C_Se_Se.ped, and then we change all the T's into A's in the output like so:
/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 60 .20 .005 ./pedfiles/C_Se_Se.ped ./pedfiles/C_Se_U.ped) \ -R 500000 -P 1.0 | \ awk '/^MC_IMP/ || /^MC_VANILLA/ {print}' | \ sed 's/T/A/g;'
The output looks like:
MC_IMP : 8.928114 A 1.000000e+00 MC_VANILLA : 11.080037 A 1.0 MC_IMP : 11.339063 A 1.000000e+00 MC_VANILLA : 10.420669 A 1.0 MC_IMP : 14.771421 A 1.000000e+00 MC_VANILLA : 14.547935 A 1.0 MC_IMP : 17.978395 A 1.000000e+00 MC_VANILLA : 6.443150 A 1.0 MC_IMP : 13.711394 A 1.000000e+00 MC_VANILLA : 6.082480 A 1.0 [...]
and consists of 1 million realizations. Then we use LambdaDist.pl to get the critical value of Lambda associated with 1-beta = .9:
/distrib/--% LambdaDist.pl .9 < PatFromAlt.txt
which yields
9.00000e-01 7.15897 0.0000e+00 0 9.0000e-01 900000 3.0000e-04
which tells us that the critical value of Lambda that we are interested in is 7.15897. So, it is for this value that we will compute corresponding alphas for all the different relationships.
Note that from the perspective of computing variability in the estimates, it is important to do it this way—otherwise you incorporate variability in the value of Lambda used into the variability in the estimate of the variability in alpha.
The first thing that we will want to do is make sure that all the scripts and programs mentioned above are in our shell's search $PATH. If you don't know what that means, pick up a good O'Reilly book about Unix and have a quick read through it.
Then, our script which is called RedoPaternitySims.sh, and is included in the scripts directory, looks like:
#!/bin/bash # here is the location of the .ped files on my system PEDDIR="/Users/eriq/Documents/eca_code/FPG/distrib/pedfiles" # here are the relationship files we want to use RELATS="C_Se_U.ped C_Se_DFC.ped C_Se_Si.ped B_Se.ped H_Se.ped"; # here are the --imp-method arguments we'll use IMPS="use-parental logistic-weighted"; for relat in $RELATS; do for imp in $IMPS; do MAXA=" "; if [ $imp = "logistic-weighted" ]; then MAXA="--max-adj .86 "; # note that .85 is too low for these runs. # so I used .86. will note that in the # erratum. fi # do 100 reps of this for((rep=1;rep<=100;rep++)); do # here we throw out the columns telling us which relationship and imp method # we are using: echo -n "$relat $imp $rep "; # here, we actually run TrioTests TrioTests $(NewLocusGroup.sh 60 .20 .005 $PEDDIR/$relat $PEDDIR/C_Se_U.ped ) \ -R 100000 --none-from-alt -P 1.0 --import-meth $imp $MAXA | LambdaDist.pl 7.15897 # note the use of a critical value of lambda as an argument to # LambdaDist.pl done done done
Note that we have used the --none-from-alt option. This is because there is no need to simulate from the alternative hypothesis, because we already know that our critical value of lambda is going to be 7.15897. This will speed things up by a factor of two or more.
I am going to make a new directory for these simulations in:
/Users/eriq/Documents/work/prj/FPG/PaternityRedo/
Launching them on the G5 at work looks like:
/PaternityRedo/--% pwd /Users/eriq/Documents/work/prj/FPG/PaternityRedo /PaternityRedo/--% date Wed May 10 15:18:34 PDT 2006 /PaternityRedo/--% RedoPaternitySims.sh > PaternityRedoOUTPUT.txt & [1] 12468
It looks like it is going to take about 2.5 hours. Interestingly, the perl script part of the process appears to take many times longer than the TrioTests part of the process. Not surprising—compiled code tends to have a speed advantage. Will consider that in future rewrites.
The output of the simulations looks like:
C_Se_U.ped use-parental 1 7.15897e+00 7.15894 4.5105e-05 89974 3.6694e-07 0.0000e+00 0 C_Se_U.ped use-parental 2 7.15897e+00 7.15895 4.3684e-05 89932 3.6082e-07 0.0000e+00 0 C_Se_U.ped use-parental 3 7.15897e+00 7.15892 4.5095e-05 89871 3.6617e-07 0.0000e+00 0 C_Se_U.ped use-parental 4 7.15897e+00 7.15869 4.4701e-05 89955 3.6405e-07 0.0000e+00 0 C_Se_U.ped use-parental 5 7.15897e+00 7.15891 4.3989e-05 89992 3.5959e-07 0.0000e+00 0 C_Se_U.ped use-parental 6 7.15897e+00 7.15879 4.4367e-05 89974 3.6460e-07 0.0000e+00 0 C_Se_U.ped use-parental 7 7.15897e+00 7.15881 4.3889e-05 89860 3.5894e-07 0.0000e+00 0 C_Se_U.ped use-parental 8 7.15897e+00 7.15881 4.4411e-05 90154 3.6534e-07 0.0000e+00 0 C_Se_U.ped use-parental 9 7.15897e+00 7.15877 4.3671e-05 90068 3.5806e-07 0.0000e+00 0 C_Se_U.ped use-parental 10 7.15897e+00 7.15895 4.4981e-05 90145 3.6706e-07 0.0000e+00 0
Note that since we had no A type realizations, the last two columns are meaningless. Not only that, but every once in a while LambdaDist.pl picked out another line to print. Those spots look like the third line below:
B_Se.ped logistic-weighted 8 7.15897e+00 7.15870 6.7412e-02 90554 4.0441e-04 0.0000e+00 0 B_Se.ped logistic-weighted 9 7.15897e+00 7.15890 6.8184e-02 90627 4.0965e-04 0.0000e+00 0 7.15897e+00 -15.57053 1.1820e+01 100000 1.1060e+01 0.0000e+00 0 B_Se.ped logistic-weighted 10 7.15897e+00 7.15870 6.8046e-02 90695 4.0919e-04 0.0000e+00 0 B_Se.ped logistic-weighted 11 7.15897e+00 7.15881 6.8491e-02 90639 4.1302e-04 0.0000e+00 0 B_Se.ped logistic-weighted 12 7.15897e+00 7.15880 6.8113e-02 90367 4.0838e-04 0.0000e+00 0
it appears this comes from some unstable importance sampling elements for very low lambdas (a region of the distribution that is not of concern to us). So, we apply an awk script to remove those lines, and to remove the last two columns of the lines we keep, and also put a header line in there to name the columns for us:
/PaternityRedo/--% awk ' \ BEGIN {print "relat imp rep OmB Lambda alpha n.obs.a sd.a"} \ NF==10 {for(i=1;i<=8;i++) printf("%s ",$i); printf("\n");} \ ' PaternityRedoOUTPUT.txt > PaternityRedo4R.txt
The output is called PaternityRedo4R.txt which can be used in the R-script below.
In the scripts directory is a script called SummarizeSims.R for the R statistical language that reads in PaternityRedo4R.txt and summarizes these simulations for us. The script looks like:
# read the file in. x <- read.table("/Users/eriq/Documents/work/prj/FPG/PaternityRedo/PaternityRedo4R.txt", header=T); # make a list of the relationships in the order we want them in: Relats <- c("C_Se_U.ped","C_Se_DFC.ped","C_Se_Si.ped","B_Se.ped","H_Se.ped"); # then make a list of their LaTeX typographical equivalents Names <- c('$P_\\mathrm{U}$',"$P_\\mathrm{DFC}$","$P_\\mathrm{Si}$","$P_\\mathrm{F}$","$P_\\mathrm{H}$"); # imp-methods in the order we want them Imps <- c("use-parental", "logistic-weighted"); # the following is not very elegant "R-style" but it gets # the job done... c <- 0; for (r in Relats) { c <- c+1; outstr <- Names[c]; for (i in Imps) { pick <- x$relat==r & x$imp==i; N.obs <- sum(pick); mean.a <- mean(x$alpha[pick]); SD <- sd(x$alpha[pick]) / mean.a; # compute it first SD.MAX <- max(x$sd.a[pick]) / mean.a; SD.MIN <- min(x$sd.a[pick]) / mean.a; # make a vector of upper CI limits ci.hi <- x$alpha[pick] + qnorm(.95) * x$sd.a[pick]; ci.lo <- x$alpha[pick] - qnorm(.95) * x$sd.a[pick]; # then count how many times mean.a is outside of # those CI limits; ci.out <- sum( ci.hi < mean.a | ci.lo > mean.a ); # here we add those values to the output string outstr <- sprintf("%s & %.2e & %.2f & %.2f & %.2f & %d",outstr,mean.a,SD*100,SD.MIN*100,SD.MAX*100,ci.out); } outstr <- sprintf("%s\n",outstr); cat(outstr); }
The script produces output that looks like:
$P_\mathrm{U}$ & 4.44e-05 & 0.86 & 0.80 & 0.83 & 12 & 5.15e-05 & 210.00 & 3.41 & 1813.93 & 58
$P_\mathrm{DFC}$ & 2.07e-03 & 0.89 & 0.81 & 0.88 & 17 & 2.06e-03 & 2.08 & 1.66 & 4.86 & 12
$P_\mathrm{Si}$ & 3.67e-02 & 0.92 & 0.83 & 1.02 & 10 & 3.67e-02 & 0.58 & 0.58 & 0.61 & 9
$P_\mathrm{F}$ & 6.44e-02 & 173.34 & 3.51 & 1471.54 & 60 & 6.80e-02 & 0.57 & 0.59 & 0.61 & 7
$P_\mathrm{H}$ & 3.45e-01 & 42.76 & 2.73 & 300.29 & 38 & 3.83e-01 & 0.28 & 0.31 & 0.31 & 5
That is almost ready to go. To match the format in Table 1 we also need to have a final column for the speed-up factor (ISI, as it was called). So, we save the above text in the file PrelimTable.txt and then operate on it with awk to compute the ISI.
/PaternityRedo/--% awk ' \ BEGIN {FS="&"} \ {a=$2; X=$3; ISI1=(1-a) / ( 10 * a * X*X); \ a=$7; X=$8; ISI2=(1-a) / ( 10 * a * X*X); \ if(ISI1>ISI2) ISI=ISI1; else ISI=ISI2; \ printf("%s & %.2e \\\\\n",$0,ISI); \ } \ ' PrelimTable.txt
which produces the output:
$P_\mathrm{U}$ & 4.44e-05 & 0.86 & 0.80 & 0.83 & 12 & 5.15e-05 & 210.00 & 3.41 & 1813.93 & 58 & 3.05e+03 \\
$P_\mathrm{DFC}$ & 2.07e-03 & 0.89 & 0.81 & 0.88 & 17 & 2.06e-03 & 2.08 & 1.66 & 4.86 & 12 & 6.09e+01 \\
$P_\mathrm{Si}$ & 3.67e-02 & 0.92 & 0.83 & 1.02 & 10 & 3.67e-02 & 0.58 & 0.58 & 0.61 & 9 & 7.80e+00 \\
$P_\mathrm{F}$ & 6.44e-02 & 173.34 & 3.51 & 1471.54 & 60 & 6.80e-02 & 0.57 & 0.59 & 0.61 & 7 & 4.22e+00 \\
$P_\mathrm{H}$ & 3.45e-01 & 42.76 & 2.73 & 300.29 & 38 & 3.83e-01 & 0.28 & 0.31 & 0.31 & 5 & 2.05e+00 \\
which is exactly the output we need to replace those last 5 lines of the table using LaTeX, after changing the appropriate columns to italic font.
So, it is clear that there is less power for paternity inference than was reported in Table 1 in A&G2006.
Here is the listing of the perl script LambdaDist.pl:
#!/usr/bin/perl use strict; if(defined(@ARGV) && $ARGV[0] eq "-h") { &Usage(); exit 1; } my @vals = @ARGV; my %valCheck; # declare lists to store lots of values my @AllOfEm; my %Num; # now, put those values in a single list of 3-vectors foreach (<STDIN>) { if(m/^MC_IMP/ || m/^MC_VANILLA/) { my ($lam,$type,$imp) = (split)[2..4]; push(@AllOfEm, [$lam,$imp,$type]); $Num{$type}+=1.0; } } my $prev_A = sprintf("%.4e %9d",0,0); my $prev_T = sprintf("%.4e %9d",0,0); my ($curT,$curA); my ($sum_A,$sum_T,$ssA,$ssT); # to compute running mean and sd. my $cnt_T=0; my $cnt_A=0; my $printStr; # then get the alphas and 1-Betas from the T entries by sorting # in reverse by lambda and summing as appropriate foreach (sort {@{$b}[0]<=>@{$a}[0]} @AllOfEm) { if(${$_}[2] eq "T") { $sum_T += @{$_}[1]/$Num{"T"}; $cnt_T++; $ssT += @{$_}[1]**2; my $n = $Num{"T"}; my $var = $ssT/($n-1.0) - ( ($n/($n-1.0)) * ( $sum_T**2) ); my $sd = sqrt($var/$n); $prev_T = sprintf("%.4e %9d %.4e",$sum_T,$cnt_T,$sd); } elsif(${$_}[2] eq "A") { $sum_A += @{$_}[1]/$Num{"A"}; $cnt_A++; $ssA += (@{$_}[1])**2; my $n = $Num{"A"}; my $var = $ssA/($n-1.0) - ( ($n/($n-1.0)) * ( $sum_A**2) ); my $sd = sqrt($var/$n); $prev_A = sprintf("%.4e %9d %.4e",$sum_A,$cnt_A,$sd); } if( !defined(@vals)) { printf("%-11.5f %s %s\n",@{$_}[0],$prev_T,$prev_A); } else { # otherwise we go through the list of numbers and take the # first value of either that exceeds it foreach my $val (@vals) { if($sum_A>=$val && !defined($valCheck{"A"}{$val}) ) { printf("%.5e %-11.5f %s %s\n",$val, @{$_}[0], $prev_T,$prev_A); $valCheck{"A"}{$val} = 1; } if($sum_T>=$val && !defined($valCheck{"T"}{$val}) ) { printf("%.5e %-11.5f %s %s\n",$val, @{$_}[0], $prev_T,$prev_A); $valCheck{"T"}{$val} = 1; } if($val>1 && @{$_}[0]<$val && !defined($valCheck{"Lambda"}{$val}) ) { printf("%.5e %-11.5f %s %s\n",$val, @{$_}[0], $prev_T,$prev_A); $valCheck{"Lambda"}{$val} = 1; } } } } sub Usage { print <<__USAGE__; Usage LamdaDist [-h] [prob1] ... [probn] ... [lambda1] ... [lambdan] Options -h Displays help information and exits if -h is the first option. If -h is not the first option, then unpredictable, possibly undesirable results may occur. If called with no arguments with output of TrioTests directed into stdin, this script will return 7 columns of output. The first is the value of lambda. The second is the probability (approximated by Monte Carlo) that a randomly drawn trio formed under the true trio relationship has a likelihood ratio exceeding lambda. The third is the number of genotypes simulated under the true relationship that exceeded lambda, the fourthth is the standard error of the probability in column 2. The fifth is the probability (approximated by Monte Carlo) that a genotype simulated under the alternative hypothesis (the one in the numerator of the likelihood ratio) exceeds lambda. The sixth is the number of genotypes simulated under the alternative hypothesis exceeding that value of lambda. The seventh is the standard error of the estimate in column five. Note that column 2 holds the value of alpha [see equation 6 in Anderson and Garza (2006)] corresponding to the lambda in column 1, and column 4 holds the value of 1 - Beta [see equation 5 in Anderson and Garza (2006)]. Otherwise, LambdaDist can be called with a list of numbers. For each number X between 0 and 1, the script will return the line (formatted as above) that corresponds to the largest value of lambda for which either alpha or 1-beta still exceeds X. If X>1, it will return the line corresponding to the largest value of lambda less than X. In either case, the value of X is prepended to the line, so all the columns are shifted over by one. __USAGE__ }