FPG stands for “Full Parental Genotyping”. That is the name we have given to efforts underway to develop a panel of SNPs to be used for very large scale parentage inference in hatchery populations of salmon. Currently this software suite consists of two C programs and a number of scripts. The theory underlying these software programs is described in:

- Anderson and Garza (2006) The power of single nucleotide polymorphisms for parentage inference.
*Genetics*XX:xx-xx.

The above paper will hereafter be referred to as “A&G2006”.

The two C programs are:

`snpSumPed`

— a program that computes the probability of configurations of SNP genotypes on trios of individuals which are related in ways specified by the user using a combination of pedigree links and pairwise relationships.`TrioTests`

— a program that employs an importance sampling algorithm that can be used to compute false positive and false negative rates for doing parentage inference. This program requires input from`snpSumPed`

in order to work.

The scripts simplify the use of the above two programs. They are described below amongst descriptions of `snpSumPed`

and `TrioTests`

. This document provides a simple tutorial with some examples on how to use the programs.

Note that the two programs above are concerned with the calculation of power for parentage inference. As yet, there is not a released program in the suite for conducting the actual inference, i.e., for computing the likelihood of actual, observed trios of individuals.

`snpSumPed`

computes the probability of genotypic configurations on trios of individuals. It has not been written in a way that is terribly efficient, so, if the individuals in the trio are related by a very large pedigree—even if it does not have loops—it will take a very long time. It is, however, suitable for small pedigrees of the sort one will typically be interested in for parentage inference.

The probability of different genotypic configurations at a single diallelic locus between three individuals depends on the frequency of the two alleles, the genotyping error rate, and the relationship between the individuals. All this information is provided to `snpSumPed`

on the command line using command line options that are either a single dash followed by a single letter, or a double dash followed by a multi-letter word. Sometimes there are two ways of invoking an option. A short list of the options available in `snpSumPed`

can be obtained by calling `snpSumPed --help`

which yields the following output:

SNP_SumPed -- probability of pedigree subsets using binary loci --help short listing of options --help-full long listing of options --help-nroff long listing of options in nroff man-page format --version prints program version information --version-history prints history of different versions --command-file [S] inserts contents of file [S] into the command line **** Command Line Switches for inputting variables **** -p , --minor-freq [R] minor allele freq of the locus -e , --gtyp-error [R1] [R2] ... per-gene-copy genotyping error rate for all loci -f , --fixed-geno [J-kid] [J-pa] [J-ma] Indices of youth/father/mother --ped [J-kid] [J-pa] [J-ma] pedigree entry. Individual names are positive integers. --kappa [J1] [J2] [R0] [R1] [R2] pairwise relatedness between J1 and J2 --pair-k [J1] [J2] [R2] [R1] [R0] OLD STYLE pairwise relatedness between J1 and J2 --verbose spout out ridiculous amounts of output

The first 6 options shown are common to all the programs using Eric Anderson's options processing package. The remaining ones are used to give `snpSumPed`

information to compute probabilities of genotypic configurations. For a more detailed description of all these options, you can enter `snpSumPed --help-full`

on the command line. This full help should provide you with everything you need to know to run the program. To help with some examples, though, I give a few simple examples below, and discuss the output.

Let's imagine that we want to compute the probability of genotype configurations for an allele with a minor allele frequency of .23, in a trio that is totally unrelated, with a genotyping error rate of .005 *per gene-copy*. The first part of our command line would look like:

/eriq/--% snpSumPed -p .23 -e .005

Note that the yellow-green part that says `/eriq/--%`

is my command line prompt. You would not type that part! The same command line could have been written like:

/eriq/--% snpSumPed --minor-freq .23 --gtyp-error .005

If you try to execute either of the above command lines, you will get an error message that says:

Error! Lacking required option -f or --fixed-geno Error! Lacking required option --ped Errors encountered processing command line. Use snpSumPed --help or snpSumPed --help-full for information on available options

This is because the program needs to know who the individuals in the trio are, and how they are related. Individuals are specified with positive integer identifiers (1,2,3,4,…). The relationships between them are specified using the `--ped`

option and also the `--kappa`

option (which we will get to later). The integer identifier 0 is reserved for an unknown parent, i.e. if an individual has parents (specified with `--ped`

) who are 0's, then that individual is considered a founder in the pedigree. It is also necessary to tell the program who is the putative youth, the putative mother, and the putative father in the trio. This is specified using the `-f`

or `--fixed-geno`

option.

OK, let's specify a simple trio in which 1 is the putative kid, 2 is the putative father, and 3 is the putative mother, but the three of them are totally unrelated. The arguments to the `--fixed-geno`

option will then be `1 2 3`

and the pedigree statements giving the relationship between the individuals is very easy—they are all founders! So the command looks like:

/eriq/--% snpSumPed --minor-freq .23 --gtyp-error .005 --fixed-geno 1 2 3 --ped 1 0 0 --ped 2 0 0 --ped 3 0 0

If you run that, the output should look like:

ALLELE_FREQ : 0.230000 GTYPING_ERROR_RATES : 0.005000 SUM_OVER_ALL_STATES_FIXED_ANO_OTHERWISE : 1.000000 TRIO_TRUE_STATES : 0 0 0 0.000148 TRIO_TRUE_STATES : 0 0 1 0.000991 TRIO_TRUE_STATES : 0 0 2 0.001659 TRIO_TRUE_STATES : 0 1 0 0.000991 TRIO_TRUE_STATES : 0 1 1 0.006637 TRIO_TRUE_STATES : 0 1 2 0.011109 TRIO_TRUE_STATES : 0 2 0 0.001659 TRIO_TRUE_STATES : 0 2 1 0.011109 TRIO_TRUE_STATES : 0 2 2 0.018596 TRIO_TRUE_STATES : 1 0 0 0.000991 TRIO_TRUE_STATES : 1 0 1 0.006637 TRIO_TRUE_STATES : 1 0 2 0.011109 TRIO_TRUE_STATES : 1 1 0 0.006637 TRIO_TRUE_STATES : 1 1 1 0.044437 TRIO_TRUE_STATES : 1 1 2 0.074384 TRIO_TRUE_STATES : 1 2 0 0.011109 TRIO_TRUE_STATES : 1 2 1 0.074384 TRIO_TRUE_STATES : 1 2 2 0.124512 TRIO_TRUE_STATES : 2 0 0 0.001659 TRIO_TRUE_STATES : 2 0 1 0.011109 TRIO_TRUE_STATES : 2 0 2 0.018596 TRIO_TRUE_STATES : 2 1 0 0.011109 TRIO_TRUE_STATES : 2 1 1 0.074384 TRIO_TRUE_STATES : 2 1 2 0.124512 TRIO_TRUE_STATES : 2 2 0 0.018596 TRIO_TRUE_STATES : 2 2 1 0.124512 TRIO_TRUE_STATES : 2 2 2 0.208422 TRIO_OBS_STATES : 0.005000 : 0 0 0 0.000159 TRIO_OBS_STATES : 0.005000 : 0 0 1 0.001047 TRIO_OBS_STATES : 0.005000 : 0 0 2 0.001726 TRIO_OBS_STATES : 0.005000 : 0 1 0 0.001047 TRIO_OBS_STATES : 0.005000 : 0 1 1 0.006905 TRIO_OBS_STATES : 0.005000 : 0 1 2 0.011385 TRIO_OBS_STATES : 0.005000 : 0 2 0 0.001726 TRIO_OBS_STATES : 0.005000 : 0 2 1 0.011385 TRIO_OBS_STATES : 0.005000 : 0 2 2 0.018770 TRIO_OBS_STATES : 0.005000 : 1 0 0 0.001047 TRIO_OBS_STATES : 0.005000 : 1 0 1 0.006905 TRIO_OBS_STATES : 0.005000 : 1 0 2 0.011385 TRIO_OBS_STATES : 0.005000 : 1 1 0 0.006905 TRIO_OBS_STATES : 0.005000 : 1 1 1 0.045538 TRIO_OBS_STATES : 0.005000 : 1 1 2 0.075078 TRIO_OBS_STATES : 0.005000 : 1 2 0 0.011385 TRIO_OBS_STATES : 0.005000 : 1 2 1 0.075078 TRIO_OBS_STATES : 0.005000 : 1 2 2 0.123781 TRIO_OBS_STATES : 0.005000 : 2 0 0 0.001726 TRIO_OBS_STATES : 0.005000 : 2 0 1 0.011385 TRIO_OBS_STATES : 0.005000 : 2 0 2 0.018770 TRIO_OBS_STATES : 0.005000 : 2 1 0 0.011385 TRIO_OBS_STATES : 0.005000 : 2 1 1 0.075078 TRIO_OBS_STATES : 0.005000 : 2 1 2 0.123781 TRIO_OBS_STATES : 0.005000 : 2 2 0 0.018770 TRIO_OBS_STATES : 0.005000 : 2 2 1 0.123781 TRIO_OBS_STATES : 0.005000 : 2 2 2 0.204076 OUTPUT_FOR_ENSUING_COMMAND_LINE : p 0.230000 mu 0.005000 probs 0.0001587736028230 0.0010470733600865 0.0017262986445947 0.0010470733600865 0.0069051945783786 0.0113845204125267 0.0017262986445947 0.0113845204125267 0.0187695369843827 0.0010470733600865 0.0069051945783786 0.0113845204125267 0.0069051945783786 0.0455380816501068 0.0750781479375310 0.0113845204125267 0.0750781479375310 0.1237805391329341 0.0017262986445947 0.0113845204125267 0.0187695369843828 0.0113845204125267 0.0750781479375310 0.1237805391329341 0.0187695369843827 0.1237805391329341 0.2040756503581872 PROB_LOC_IS_INCOMPAT : 0.231853

The purple parts at the beginning of each line are tags indicating what is on that line. For now, we will focus on the `TRIO_OBS_STATES`

lines. It has fields separated by colons (shown in pink). The second field shows the genotyping error rate, which is 0.005. The third field starts with three columns that show the genotypic configuration. For example, `0 2 2`

says that the kid (indiv 1) has a genotype with zero “1” alleles, the putative father has a genotype with two “1” alleles, and so does the mother. The fourth column of the third field is the probability of that genotypic configuration.

The most important line in the output is the `OUTPUT_FOR_ENSUING_COMMAND_LINE`

which holds all the genotypic configuration probabilities in a single line that can be used as input for `TrioTests`

. We'll come back to this when we start using `TrioTests`

. In the meantime, we will do two more examples with `snpSumPed`

.

Let's do the same thing now, but have the three individuals be related as a nuclear family. To do this, we merely change the `--ped`

statement in which individual 1 is the child, to reflect such a relationship, i.e., the command line becomes:

/eriq/--% snpSumPed --minor-freq .23 --gtyp-error .005 --fixed-geno 1 2 3 --ped 1 2 3 --ped 2 0 0 --ped 3 0 0

Simple enough. Note that it is tempting to drop the `--ped 2 0 0`

and the `--ped 3 0 0`

parts of the command line. In fact, this still works, but it does not do so in all cases. So, you should always include lines like `--ped X 0 0`

to explicitly declare that X is a founder.

As the relationships get more complex, it will become cumbersome to specify the pedigree connecting the three individuals on the command line. We can use the `--command-file`

option to put the pedigree statements into a file and call that from the command line. The directory `pedfiles`

has the pedigree files for all the relationships used in A&G2006. Let's look at the file for the *H_Si* relationship. It is called `H_Si.ped`

. Note that in all the files in `pedfiles`

we follow the convention that 1 is the youth, 2 is the father, and 3 is the mother. *This is a convention you will want to get into the habit of, because the program TrioTests, which can use output from snpSumPed expects to get data ordered in just the same way.*

& One putative parent is a half sib of the youth and the other is a full sib of the true parent that the youth does not share with the other putative parent & --ped 1 4 5 --ped 2 4 6 --ped 3 0 0 --ped 4 0 0 --ped 5 0 0 --ped 6 0 0 --kappa 3 5 .25 .5 .25

The parts between successive ”&” symbols in a command file are ignored. Thus, you can put comments in there. This file also shows the use of the `--kappa`

option. Recall from the `--help`

output that the syntax for the kappa option is:

--kappa [J1] [J2] [R0] [R1] [R2]

In the command file, the kappa option says that individuals 3 (the putative mother) and 5 (the true mother) are related via identity coefficients (.25,.5,.25), which are the coefficients describing a full sib relationship. These coefficients means that with probability 25% the pair shares no genes IBD at a locus; with probability 50% they share 1 gene IBD at the locus, and with 25% they share two genes IBD. To run `snpSumPed`

using this pedfile, we merely use the `--command-file`

option and specify the path to the file `H_Si.ped`

. Depending on which directory you are working in, the command might look like:

/distrib/--% snpSumPed --minor-freq .23 --gtyp-error .005 --fixed-geno 1 2 3 --command-file pedfiles/H_Si.ped

The program `TrioTests`

simulates the genotypes of trios of individuals and it computes the probability of those genotypes under various hypotheses of the relationship between the trio. It is to be used in conjunction with `snpSumPed`

for approximating false positive rates and false negative rates given a certain set of loci. It is written with flexibility in mind, which means that it is necessary to give it a lot of information (most of it being output from `snpSumPed`

) in order to make it run, *and* the output typically has to be processed in order to get a result out of it. To make the process of running `TrioTests`

more straightforward, I supply some scripts in the `scripts`

directory that help to automate the process.

This tutorial will not attempt to cover all the options in the program, but will show the simple application of `TrioTests`

and some of the scripts that come with it. More complete documentation is available using the `--help-full`

option to `TrioTests`

.

The most important option (and the only one required) to `TrioTests`

is the `-l`

or `--new-locus-group`

option. This allows you to specify that you wish to include in the simulated genotypes for each individual a collection of loci that all have the same frequency of the 0 allele. The full documentation for this option is:

-l , --new-locus-group [Complex Format] Enter information about a group of loci that all have the same probs of the different trios. Can be used up to 5000 times. The syntax is --new-locus-group [J1] p [R1] mu [R2] probs [R3]-[R29] p [R30] mu [R31] probs [R32]-[R58] p [R59] mu [R60] probs [R61]-[R87]. J1 is the number of loci that you want to have with these properties. p and mu are just string tags preceding the freqs R1 and R30 and mutation rates R2 and R31 respectively. R3-R29 are the 27 probabilities of genotypic states---cycling from 0 to 2 over youth then pa then ma---under the alternative hypothesis and R32-R58 are the same probabilities under the null hypothesis and R61-R87 are the actual probs for the trio under their true relationship. The program will check to make sure that R1==R30==R59 and R2==R31==R60---i.e. it checks to make sure that the value of p and mu you think you are dealing with is the same for the three different sets of probabilities. Typically the alternative hypothesis is that the trio consists of two parents and their child. The null hypothesis is typically that the trio is unrelated. I include them on the command line like this so we can look at the distribution of the test statistic under different null and alternative hypotheses. Output of the program snpSumPed can be used to fill out these numbers. Note that the order of the probabilities is such that the first one corresponds to youth=0; pa=0; ma=0. The second one is youth=0; pa=0; ma=1. The third is youth=0; pa=0; ma=2. The fourth is youth=0; pa=1; ma=0; and so forth. Note that the program snpSumPed outputs data in this format for use on the TrioTests command line.

This is a big mouthful, but all it is saying is that this option provides a place to use one of the output lines from `snpSumPed`

on the command line of `TrioTests`

. Recall the output line tagged with **OUTPUT_FOR_ENSUING_COMMAND_LINE** from the program `snpSumPed`

—the part after the colon is exactly in the right format for the `-l`

option to `TrioTests`

, so long as you have followed the convention that individual 1 is the kid, 2 is the putative father, and 3 is putative mother.

The other important thing the above piece of documentation is telling you is that for each locus, you need to specify three different sets of probabilities. The first is the probabilities of the 27 genotypic configurations under the *Alternative hypothesis*. This is typically the hypothesis that the kid and the putative parents represent a true kid-mother-father trio. The second set of probabilities are for the 27 genotypic configurations under the *Null Hypothesis*. This is typically the hypothesis that the kid and the putative parents are all unrelated to one another. The last set of probabilities is for the *True Relationship*, which is merely the true relationship between the three individuals.

In order to put all these sets of probabilities in the correct place on the command line, you could cut and paste them from the output of `snpSumPed`

. That would, however, be very tedious, and bad for your wrists. A better way to do it is to learn how to use Unix, and one of its more useful shells, like `bash`

. In the following, I will show some useful scripts, and how to use them. It assumes a familiarity with Unix.

I provide the shell script `NewLocusGroup.sh`

for forming the `-l`

commands for `TrioTests`

. It takes as arguments in this order:

- the number of loci in the group
- the frequency of the 0 allele
- the genotyping error rate
- the path to a command file that holds the
`--ped`

and`--kappa`

statements which define the true relationship between the members of the trio. - OPTIONAL, the path to a command file that holds the
`--ped`

and`--kappa`

statements which define the null hypothesis relationship between the members of the trio. If this is omitted, the null hypothesis is that of a totally unrelated trio. - OPTIONAL, the path to a command file that holds the
`--ped`

and`--kappa`

statements which define the alternative hypothesis relationship between the members of the trio. If this is omitted, the alternative hypothesis is that of a true kid-mother-father trio. Obviously, since these are positional parameters, this optional option cannot be given without also giving the preceding optional option.

It returns a full `--new-locus-group`

option command string.

The script looks like this:

#!/bin/bash if [ $# -lt 4 ]; then echo "Syntax: NewLocusGroup.sh <numloci> <zero-allele-freq> <gtyp-error-rate-per-allele> <TrueRelatFile> [NullRelatFile] [AltRelatFile]" echo exit 1; fi L=$1 MU=$3 P=$2 TRUEFILE=$4 NULLFILE=$5 ALTFILE=$6 # if a NullFile was included, use it. Otherwise it is just unrelated if [ $# -lt 5 ]; then NULL=$(snpSumPed -p $P --ped 1 0 0 --ped 2 0 0 --ped 3 0 0 -f 1 2 3 -e $MU | grep ENSUING | cut -d: -f 2); else NULL=$(snpSumPed -p $P --command-file $NULLFILE -f 1 2 3 -e $MU | grep ENSUING | cut -d: -f 2); fi # if an alt file was included, use it. Otherwise it is just parental if [ $# -lt 6 ]; then ALT=$( snpSumPed -p $P --ped 1 2 3 --ped 2 0 0 --ped 3 0 0 -f 1 2 3 -e $MU | grep ENSUING | cut -d: -f 2); else ALT=$(snpSumPed -p $P --command-file $ALTFILE -f 1 2 3 -e $MU | grep ENSUING | cut -d: -f 2); fi TRUE=$(snpSumPed -p $P --command-file $TRUEFILE -f 1 2 3 -e $MU | grep ENSUING | cut -d: -f 2); # in the end we just print out the command line to stdout echo "--new-locus-group $L $ALT $NULL $TRUE";

Note that it assumes that `bash`

is available on your system in the `/bin`

path. It also assumes that you have put `snpSumPed`

somewhere in the search path of your system.

OK, we now can quickly formulate the arguments to the `--new-locus-group`

(same as `-l`

) option…now, what do we do with that? If we feed TrioTests just a `--new-locus-group`

option, then it processes it and starts generating simulated values of Lambda and importance weights. But, we probably have an idea of how we want to run these things, so, let's look at setting values other than the defaults. The other important options I will cover in this tutorial are:

`-R`

or`--reps`

sets the number of Monte Carlo samples to take. This is*N*in equation 6 of A&G2006.`-P`

or`--prop-imp-samp`

sets the proportion of`--reps`

which will be done following the importance sampling procedure of equation 7 in A&G2006, the remainder will be done following equation 6. In reality, there is little reason not to set this to be 1.0, but the default is .7.`--none-from-alt`

: issuing this option makes TrioTests compute only terms in the sum for determining the false positive rate given the true relationship. By default, however, it also simulates genotypes from the probabilities associated with the alternative hypothesis. These can be used to compute the false negative rates—i.e., if the alternative hypothesis is true, how often will it be rejected?`--import-meth`

: this controls which method of importance sampling is used. It must be followed by a single word which is either “use-parental” or “logistic-weighted”. The former gives you what we refer to as*I*_{1}in the paper, and the latter gives what we refer to as*I*_{2}. In the case of*I*_{2}, you may set the parameter*M*(see equation 9 in A&G2006) using the`--max-adj`

option. The default importance sampling method is*I*_{1}. Note that it doesn't work well for some relationships. See A&G2006.

Here we do a simple example with 10 loci having frequency of the zero allele equal to .25 with the true relationship being totally unrelated. We will do only 5 reps and 100% of them will be using the default importance sampling method of *I*_{1}. In the following example this is done from the directory (here called `distrib`

) that contains the `pedfiles`

and `scripts`

directories. The command looks like:

/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 10 .25 .005 ./pedfiles/C_U_U.ped) -R 5 -P 1.0

There is a considerable amount of output, but we will only pay attention to the last few lines:

MC_IMP : 4.418367 T 1.205390e-02 MC_VANILLA : 4.601536 A 1.0 MC_IMP : 3.346053 T 3.522310e-02 MC_VANILLA : 3.175920 A 1.0 MC_IMP : 4.209573 T 1.485271e-02 MC_VANILLA : 4.889234 A 1.0 MC_IMP : -2.562268 T 1.296519e+01 MC_VANILLA : 3.921787 A 1.0 MC_IMP : 4.569243 T 1.036581e-02 MC_VANILLA : 3.909000 A 1.0

The output tagged by **MC_IMP** or **MC_VANILLA** is what we are most interested in. In each such line, there are three columns in the second field (after the colon). These columns are

- The log of the probability of the simulated trio genotypes given the null alternative hypothesis, divided by the log of the probability of the trio genotypes given the null hypothesis, i.e., the log-likelihood ratio.
`T`

or`A`

denoting whether the genotypes of the trio were simulated under the (T)rue distribution or the (A)lternative hypothesis.- The importance weight associated with the Monte Carlo realization.

This is the output needed to compute the averages shown as sums in Equations 6 and 7 in A&G2006.^{1)} The lines marked with A's had the genotypes simulated from the true distribution, so they are useful for computing the false negative rate. The others, marked with a T are useful for computing false positive rates. You could probably do these simple calculations by cutting and pasting the output into some spreadsheet program like Excel, but that would be inefficient and inelegant. In the next section I show a simple `perl`

script that will compute the sums from the output of `TrioTests`

.

The output shown above contains the individual values of the importance weights (in column 3) *and* in column 1 the observed value of the log of the likelihood ratio for each simulated genotype. TrioTests kicks these values out in this format so that you can easily plot their distribution if you so desire. However, often all you want to do is compute the average of the importance weights, each multiplied by one or by zero according to whether or not the value of the likelihood ratio in the second column is greater than (for computing false positive rates) or less than (for computing false negative rates) some critical value of Lambda, <math tex>\Lambda_c</math>. So, you may first want to figure out what <math tex>\Lambda_c</math> should be. Remember that if you increase <math tex>\Lambda_c</math>, you decrease your false positive rate, but you also increase your false negative rate.

To help process the output from TrioTests in illuminating ways, I provide the script `LambdaDist.pl`

in the `scripts`

folder. It operates directly on TrioTests output, so you can either pipe the output directly into it, or you can store the output from `TrioTests`

into a file and then redirect that into `LambdaDist.pl`

. If you invoke `LambdaDist.pl`

with the `-h`

option, it gives you instructions for its use:

Usage LamdaDist [-h] [prob1] ... [probn] ... [lambda1] ... [lambdan] Options -h Displays help information and exits if -h is the first option. If -h is not the first option, then unpredictable, possibly undesirable results may occur. If called with no arguments with output of TrioTests directed into stdin, this script will return 7 columns of output. The first is the value of lambda. The second is the probability (approximated by Monte Carlo) that a randomly drawn trio formed under the true trio relationship has a likelihood ratio exceeding lambda. The third is the number of genotypes simulated under the true relationship that exceeded lambda, the fourthth is the standard error of the probability in column 2. The fifth is the probability (approximated by Monte Carlo) that a genotype simulated under the alternative hypothesis (the one in the numerator of the likelihood ratio) exceeds lambda. The sixth is the number of genotypes simulated under the alternative hypothesis exceeding that value of lambda. The seventh is the standard error of the estimate in column five. Note that column 2 holds the value of alpha [see equation 6 in Anderson and Garza (2006)] corresponding to the lambda in column 1, and column 4 holds the value of 1 - Beta [see equation 5 in Anderson and Garza (2006)]. Otherwise, LambdaDist can be called with a list of numbers. For each number X between 0 and 1, the script will return the line (formatted as above) that corresponds to the largest value of lambda for which either alpha or 1-beta still exceeds X. If X>1, it will return the line corresponding to the largest value of lambda less than X. In either case, the value of X is prepended to the line, so all the columns are shifted over by one.

I give a complete listing of `LambdaDist.pl`

in the appendix at the end.
Following are some examples of its use, which also show how to use `TrioTests`

and `snpSumPed`

together.

Let's reproduce some of the results in Table 1 of A&G2006. We'll use 60 loci with zero-allele frequency of .2 and the per-allele genotyping error rate of .005. Since we are interested in <math tex>\beta=0.1</math>, we call `LambdaDist.pl`

with .9 (= 1-.1) as the argument. We'll also use 100,000 Monte Carlo samples.

For the truth being “totally unrelated trio” we have:

/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 60 .20 .005 ./pedfiles/C_U_U.ped) -R 100000 -P 1.0 | ./scripts/LambdaDist.pl .9

which returns the line

9.00000e-01 13.68605 5.4937e-08 89968 5.1040e-10 9.0001e-01 90001 9.4865e-04

Note that columns 3 and 5 there are the same (give or take some random variation) as what appears in Table 1 of A&G2006. It is worth pointing out that the `SD`

columns in that table are giving the standard error of the mean in terms of *percentages* of the mean—somehow that point was omitted from the text and caption of the table.

Suppose we wanted to know what the relationship between <math tex>\alpha</math> and <math tex>1-\beta</math> was for a few more values of <math tex>1-\beta</math>. We could do like this:

/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 60 .20 .005 ./pedfiles/C_U_U.ped) -R 100000 -P 1.0 | ./scripts/LambdaDist.pl .1 .2 .3 .4 .5 .6 .7 .8 .9

which returns the series of lines:

1.00000e-01 25.47267 2.4572e-13 10193 3.3637e-15 1.0001e-01 10001 9.4873e-04 2.00000e-01 23.47864 3.0575e-12 20272 3.0847e-14 2.0000e-01 20000 1.2649e-03 3.00000e-01 22.10978 1.6719e-11 30227 1.4293e-13 3.0000e-01 30000 1.4491e-03 4.00000e-01 20.82366 6.6123e-11 40312 5.0597e-13 4.0000e-01 40000 1.5492e-03 5.00000e-01 19.72696 2.2401e-10 50326 1.5974e-12 5.0000e-01 50000 1.5811e-03 6.00000e-01 18.66416 7.2063e-10 60236 5.0262e-12 6.0001e-01 60001 1.5492e-03 7.00000e-01 17.37130 2.4470e-09 70352 1.7428e-11 7.0001e-01 70001 1.4491e-03 8.00000e-01 15.77251 9.1759e-09 80205 7.0631e-11 8.0001e-01 80001 1.2649e-03 9.00000e-01 13.69318 5.3848e-08 90184 5.0442e-10 9.0001e-01 90001 9.4865e-04

Now, what if the truth is: “the putative mother is the true mother, and the putative father is a double-first-cousin of the true father?”. That corresponds to the relationship described in `C_Se_DFC.pdf`

which looks like:

& C-type with one putative parent self and the other double first cousin & --ped 1 4 5 --ped 2 0 0 --ped 3 0 0 --ped 4 0 0 --ped 5 0 0 --kappa 2 4 0 0 1 --kappa 3 5 .5625 .375 .0625

From Table 1 in A&G2006, it is clear that for this relationship, it is best to use *I*_{2}. We can do that using the `--import-meth`

option and making `--max-adj`

equal to .85 to match the simulations done in A&G2006. Our command line then looks like:

/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 60 .20 .005 ./pedfiles/C_Se_DFC.ped) -R 100000 -P 1.0 --import-meth logistic-weighted --max-adj .85 | ./scripts/LambdaDist.pl .9

which yields the line

9.00000e-01 13.69975 8.5335e-03 87217 7.8853e-05 9.0001e-01 90001 9.4865e-04

which, again, accords very well with Table 1 in A&G2006 as it should.

Now, let's reproduce the result for the *P*_{Si} case in which we know the true mother, and the putative father is a full sibling of the true father. Thus the true relationship is one in which the mother is correct, but the father is a full sib, hence `C_Se_Si.ped`

. However, by equation 4, when doing paternity inference, the null hypothesis is the relationship specified in `C_Se_U.ped.`

We can specify that in option 5 (the first OPTIONAL option) to `NewLocusGroup.sh`

. Also, we note from Table 4 that this is an inference problem for which *I*_{2} outperforms *I*_{1}, so we will use *I*_{2}, and we will set *M* to be .85. The command line we wish to use is:

/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 60 .20 .005 ./pedfiles/C_Se_Si.ped ./pedfiles/C_Se_U.ped) \ -R 100000 -P 1.0 --import-meth logistic-weighted --max-adj .85 | ./scripts/LambdaDist.pl .9

which produces the result:

9.00000e-01 7.17151 3.6583e-02 86924 2.1988e-04 9.0001e-01 90001 9.4865e-04

When we compare this to the result in Table 1 of A&G2006, we find that…hmm…it doesn't look quite right. Table 1 shows a false positive rate of `1.08e-04`

while in the above, we find a false positive rate more than two orders of magnitude higher: `3.6583e-02`

. That is odd. Let's check at a series of values for <math tex>1-\beta</math>:

/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 60 .20 .005 ./pedfiles/C_Se_Si.ped ./pedfiles/C_Se_U.ped) \ -R 100000 -P 1.0 --import-meth logistic-weighted --max-adj .85 | \ ./scripts/LambdaDist.pl .1 .2 .3 .4 .5 .6 .7 .8 .9

yields in the first portion of its output:

1.00000e-01 17.02699 1.1193e-04 13283 1.2003e-06 1.0001e-01 10001 9.4873e-04 2.00000e-01 15.36009 4.1060e-04 23857 3.3012e-06 2.0000e-01 20000 1.2649e-03 3.00000e-01 14.19333 9.6593e-04 33574 6.6161e-06 3.0000e-01 30000 1.4491e-03 4.00000e-01 13.16825 1.8589e-03 42484 1.1455e-05 4.0000e-01 40000 1.5492e-03 5.00000e-01 12.21484 3.3735e-03 51539 1.9214e-05 5.0000e-01 50000 1.5811e-03 6.00000e-01 11.26640 5.7247e-03 60038 3.1039e-05 6.0001e-01 60001 1.5492e-03 7.00000e-01 10.20174 9.9320e-03 68880 5.3142e-05 7.0001e-01 70001 1.4491e-03 8.00000e-01 8.93196 1.7736e-02 77758 9.6346e-05 8.0001e-01 80001 1.2649e-03 9.00000e-01 7.12819 3.6625e-02 87198 2.1974e-04 9.0001e-01 90001 9.4865e-04

and what we see is…**Holy Smokes!!** For the paternity inference cases, Table 1 in A&G2006 reports false positive rates that correspond to false negative rates of .90 instead of .1. Well, there is a good lesson in the importance of triple checking results. The paternity scenarios were done separately from the rest, and before I had written a handy `perl`

script to make the process as simple as it is now. (Previously I accomplished the same with a series of `awk`

, `sed`

, and `bash`

scripts that I cobbled together—obviously incorrectly in the case of paternity inference—to get the same results).

Well, perhaps the slightest of silver linings may be found in the fact that, since I need to redo the simulations to correct the last five lines of Table 1 in an erratum sent to *Genetics*, we can really see how to string all these programs and files together to make some calculations. That will be the topic of the next section.

I must redo the simulations to obtain correct values in Table 1 for *P*_{U}, *P*_{DFC}, *P*_{Si}, *P*_{F}, and *P*_{H}. These paternity scenarios correspond to trio relationships described in the files `C_Se_U.ped`

, `C_Se_DFC.ped`

, `C_Se_Si.ped`

, `B_Se.ped`

, and `H_Se.ped`

, respectively. In all cases, the null hypothesis is that the trio has the relationship specified in the file `C_Se_U.ped`

.

For each true relationship, we must first perform 100 replicate simulations with *N*=# of Monte Carlo Reps=100,000, for each of the two different importance sampling schemes `use-parental`

and `logistic-weighted`

. We want to extract from each simulation the line of output corresponding to 1-Beta = .90.

We can do this sort of repetitious task easily using the scripting capabilities of the `bash`

shell in Unix, to tie together all the scripts that have been described in this tutorial up to this point. We'll do this so that each line of output has a column to say what the true relationship is and what the importance sampling method was. Then we will be able to summarize the results easily using the `R`

statistical language.

There is no reason to run simulated values from the alternative hypothesis all the time. In re-doing the paternity inference simulations, we note that the false negative rates depend solely on the alternative hypothesis, which, looking at Equation 4 in A&G2006 we see corresponds to `C_Se_Se.ped`

, i.e., a true parental trio. We can use a trick with our scripts to generate output from TrioTests that corresponds only to this alternate hypothesis: we simulate with the true relationship being `C_Se_Se.ped`

, and then we change all the `T`

's into `A`

's in the output like so:

/distrib/--% TrioTests $(./scripts/NewLocusGroup.sh 60 .20 .005 ./pedfiles/C_Se_Se.ped ./pedfiles/C_Se_U.ped) \ -R 500000 -P 1.0 | \ awk '/^MC_IMP/ || /^MC_VANILLA/ {print}' | \ sed 's/T/A/g;'

The output looks like:

MC_IMP : 8.928114 A 1.000000e+00 MC_VANILLA : 11.080037 A 1.0 MC_IMP : 11.339063 A 1.000000e+00 MC_VANILLA : 10.420669 A 1.0 MC_IMP : 14.771421 A 1.000000e+00 MC_VANILLA : 14.547935 A 1.0 MC_IMP : 17.978395 A 1.000000e+00 MC_VANILLA : 6.443150 A 1.0 MC_IMP : 13.711394 A 1.000000e+00 MC_VANILLA : 6.082480 A 1.0 [...]

and consists of 1 million realizations. Then we use `LambdaDist.pl`

to get the critical value of Lambda associated with 1-beta = .9:

/distrib/--% LambdaDist.pl .9 < PatFromAlt.txt

which yields

9.00000e-01 7.15897 0.0000e+00 0 9.0000e-01 900000 3.0000e-04

which tells us that the critical value of Lambda that we are interested in is 7.15897. So, it is for this value that we will compute corresponding alphas for all the different relationships.

Note that from the perspective of computing variability in the estimates, it is important to do it this way—otherwise you incorporate variability in the value of Lambda used into the variability in the estimate of the variability in alpha.

The first thing that we will want to do is make sure that all the scripts and programs mentioned above are in our shell's search $PATH. If you don't know what that means, pick up a good O'Reilly book about Unix and have a quick read through it.

Then, our script which is called `RedoPaternitySims.sh`

, and is included in the `scripts`

directory, looks like:

#!/bin/bash # here is the location of the .ped files on my system PEDDIR="/Users/eriq/Documents/eca_code/FPG/distrib/pedfiles" # here are the relationship files we want to use RELATS="C_Se_U.ped C_Se_DFC.ped C_Se_Si.ped B_Se.ped H_Se.ped"; # here are the --imp-method arguments we'll use IMPS="use-parental logistic-weighted"; for relat in $RELATS; do for imp in $IMPS; do MAXA=" "; if [ $imp = "logistic-weighted" ]; then MAXA="--max-adj .86 "; # note that .85 is too low for these runs. # so I used .86. will note that in the # erratum. fi # do 100 reps of this for((rep=1;rep<=100;rep++)); do # here we throw out the columns telling us which relationship and imp method # we are using: echo -n "$relat $imp $rep "; # here, we actually run TrioTests TrioTests $(NewLocusGroup.sh 60 .20 .005 $PEDDIR/$relat $PEDDIR/C_Se_U.ped ) \ -R 100000 --none-from-alt -P 1.0 --import-meth $imp $MAXA | LambdaDist.pl 7.15897 # note the use of a critical value of lambda as an argument to # LambdaDist.pl done done done

Note that we have used the `--none-from-alt`

option. This is because there is no need to simulate from the alternative hypothesis, because we already know that our critical value of lambda is going to be 7.15897. This will speed things up by a factor of two or more.

I am going to make a new directory for these simulations in:

`/Users/eriq/Documents/work/prj/FPG/PaternityRedo/`

Launching them on the G5 at work looks like:

/PaternityRedo/--% pwd /Users/eriq/Documents/work/prj/FPG/PaternityRedo /PaternityRedo/--% date Wed May 10 15:18:34 PDT 2006 /PaternityRedo/--% RedoPaternitySims.sh > PaternityRedoOUTPUT.txt & [1] 12468

It looks like it is going to take about 2.5 hours. Interestingly, the `perl`

script part of the process appears to take many times longer than the `TrioTests`

part of the process. Not surprising—compiled code tends to have a speed advantage. Will consider that in future rewrites.

The output of the simulations looks like:

C_Se_U.ped use-parental 1 7.15897e+00 7.15894 4.5105e-05 89974 3.6694e-07 0.0000e+00 0 C_Se_U.ped use-parental 2 7.15897e+00 7.15895 4.3684e-05 89932 3.6082e-07 0.0000e+00 0 C_Se_U.ped use-parental 3 7.15897e+00 7.15892 4.5095e-05 89871 3.6617e-07 0.0000e+00 0 C_Se_U.ped use-parental 4 7.15897e+00 7.15869 4.4701e-05 89955 3.6405e-07 0.0000e+00 0 C_Se_U.ped use-parental 5 7.15897e+00 7.15891 4.3989e-05 89992 3.5959e-07 0.0000e+00 0 C_Se_U.ped use-parental 6 7.15897e+00 7.15879 4.4367e-05 89974 3.6460e-07 0.0000e+00 0 C_Se_U.ped use-parental 7 7.15897e+00 7.15881 4.3889e-05 89860 3.5894e-07 0.0000e+00 0 C_Se_U.ped use-parental 8 7.15897e+00 7.15881 4.4411e-05 90154 3.6534e-07 0.0000e+00 0 C_Se_U.ped use-parental 9 7.15897e+00 7.15877 4.3671e-05 90068 3.5806e-07 0.0000e+00 0 C_Se_U.ped use-parental 10 7.15897e+00 7.15895 4.4981e-05 90145 3.6706e-07 0.0000e+00 0

Note that since we had no `A`

type realizations, the last two columns are meaningless. Not only that, but every once in a while `LambdaDist.pl`

picked out another line to print. Those spots look like the third line below:

B_Se.ped logistic-weighted 8 7.15897e+00 7.15870 6.7412e-02 90554 4.0441e-04 0.0000e+00 0 B_Se.ped logistic-weighted 9 7.15897e+00 7.15890 6.8184e-02 90627 4.0965e-04 0.0000e+00 0 7.15897e+00 -15.57053 1.1820e+01 100000 1.1060e+01 0.0000e+00 0 B_Se.ped logistic-weighted 10 7.15897e+00 7.15870 6.8046e-02 90695 4.0919e-04 0.0000e+00 0 B_Se.ped logistic-weighted 11 7.15897e+00 7.15881 6.8491e-02 90639 4.1302e-04 0.0000e+00 0 B_Se.ped logistic-weighted 12 7.15897e+00 7.15880 6.8113e-02 90367 4.0838e-04 0.0000e+00 0

it appears this comes from some unstable importance sampling elements for very low lambdas (a region of the distribution that is not of concern to us). So, we apply an `awk`

script to remove those lines, and to remove the last two columns of the lines we keep, and also put a header line in there to name the columns for us:

/PaternityRedo/--% awk ' \ BEGIN {print "relat imp rep OmB Lambda alpha n.obs.a sd.a"} \ NF==10 {for(i=1;i<=8;i++) printf("%s ",$i); printf("\n");} \ ' PaternityRedoOUTPUT.txt > PaternityRedo4R.txt

The output is called `PaternityRedo4R.txt`

which can be used in the R-script below.

In the `scripts`

directory is a script called `SummarizeSims.R`

for the `R`

statistical language that reads in `PaternityRedo4R.txt`

and summarizes these simulations for us. The script looks like:

# read the file in. x <- read.table("/Users/eriq/Documents/work/prj/FPG/PaternityRedo/PaternityRedo4R.txt", header=T); # make a list of the relationships in the order we want them in: Relats <- c("C_Se_U.ped","C_Se_DFC.ped","C_Se_Si.ped","B_Se.ped","H_Se.ped"); # then make a list of their LaTeX typographical equivalents Names <- c('$P_\\mathrm{U}$',"$P_\\mathrm{DFC}$","$P_\\mathrm{Si}$","$P_\\mathrm{F}$","$P_\\mathrm{H}$"); # imp-methods in the order we want them Imps <- c("use-parental", "logistic-weighted"); # the following is not very elegant "R-style" but it gets # the job done... c <- 0; for (r in Relats) { c <- c+1; outstr <- Names[c]; for (i in Imps) { pick <- x$relat==r & x$imp==i; N.obs <- sum(pick); mean.a <- mean(x$alpha[pick]); SD <- sd(x$alpha[pick]) / mean.a; # compute it first SD.MAX <- max(x$sd.a[pick]) / mean.a; SD.MIN <- min(x$sd.a[pick]) / mean.a; # make a vector of upper CI limits ci.hi <- x$alpha[pick] + qnorm(.95) * x$sd.a[pick]; ci.lo <- x$alpha[pick] - qnorm(.95) * x$sd.a[pick]; # then count how many times mean.a is outside of # those CI limits; ci.out <- sum( ci.hi < mean.a | ci.lo > mean.a ); # here we add those values to the output string outstr <- sprintf("%s & %.2e & %.2f & %.2f & %.2f & %d",outstr,mean.a,SD*100,SD.MIN*100,SD.MAX*100,ci.out); } outstr <- sprintf("%s\n",outstr); cat(outstr); }

The script produces output that looks like:

$P_\mathrm{U}$ & 4.44e-05 & 0.86 & 0.80 & 0.83 & 12 & 5.15e-05 & 210.00 & 3.41 & 1813.93 & 58 $P_\mathrm{DFC}$ & 2.07e-03 & 0.89 & 0.81 & 0.88 & 17 & 2.06e-03 & 2.08 & 1.66 & 4.86 & 12 $P_\mathrm{Si}$ & 3.67e-02 & 0.92 & 0.83 & 1.02 & 10 & 3.67e-02 & 0.58 & 0.58 & 0.61 & 9 $P_\mathrm{F}$ & 6.44e-02 & 173.34 & 3.51 & 1471.54 & 60 & 6.80e-02 & 0.57 & 0.59 & 0.61 & 7 $P_\mathrm{H}$ & 3.45e-01 & 42.76 & 2.73 & 300.29 & 38 & 3.83e-01 & 0.28 & 0.31 & 0.31 & 5

That is almost ready to go. To match the format in Table 1 we also need to have a final column for the speed-up factor (ISI, as it was called). So, we save the above text in the file `PrelimTable.txt`

and then operate on it with `awk`

to compute the ISI.

/PaternityRedo/--% awk ' \ BEGIN {FS="&"} \ {a=$2; X=$3; ISI1=(1-a) / ( 10 * a * X*X); \ a=$7; X=$8; ISI2=(1-a) / ( 10 * a * X*X); \ if(ISI1>ISI2) ISI=ISI1; else ISI=ISI2; \ printf("%s & %.2e \\\\\n",$0,ISI); \ } \ ' PrelimTable.txt

which produces the output:

$P_\mathrm{U}$ & 4.44e-05 & 0.86 & 0.80 & 0.83 & 12 & 5.15e-05 & 210.00 & 3.41 & 1813.93 & 58 & 3.05e+03 \\ $P_\mathrm{DFC}$ & 2.07e-03 & 0.89 & 0.81 & 0.88 & 17 & 2.06e-03 & 2.08 & 1.66 & 4.86 & 12 & 6.09e+01 \\ $P_\mathrm{Si}$ & 3.67e-02 & 0.92 & 0.83 & 1.02 & 10 & 3.67e-02 & 0.58 & 0.58 & 0.61 & 9 & 7.80e+00 \\ $P_\mathrm{F}$ & 6.44e-02 & 173.34 & 3.51 & 1471.54 & 60 & 6.80e-02 & 0.57 & 0.59 & 0.61 & 7 & 4.22e+00 \\ $P_\mathrm{H}$ & 3.45e-01 & 42.76 & 2.73 & 300.29 & 38 & 3.83e-01 & 0.28 & 0.31 & 0.31 & 5 & 2.05e+00 \\

which is exactly the output we need to replace those last 5 lines of the table using LaTeX, after changing the appropriate columns to italic font.

So, it is clear that there is less power for paternity inference than was reported in Table 1 in A&G2006.

Here is the listing of the `perl`

script `LambdaDist.pl`

:

#!/usr/bin/perl use strict; if(defined(@ARGV) && $ARGV[0] eq "-h") { &Usage(); exit 1; } my @vals = @ARGV; my %valCheck; # declare lists to store lots of values my @AllOfEm; my %Num; # now, put those values in a single list of 3-vectors foreach (<STDIN>) { if(m/^MC_IMP/ || m/^MC_VANILLA/) { my ($lam,$type,$imp) = (split)[2..4]; push(@AllOfEm, [$lam,$imp,$type]); $Num{$type}+=1.0; } } my $prev_A = sprintf("%.4e %9d",0,0); my $prev_T = sprintf("%.4e %9d",0,0); my ($curT,$curA); my ($sum_A,$sum_T,$ssA,$ssT); # to compute running mean and sd. my $cnt_T=0; my $cnt_A=0; my $printStr; # then get the alphas and 1-Betas from the T entries by sorting # in reverse by lambda and summing as appropriate foreach (sort {@{$b}[0]<=>@{$a}[0]} @AllOfEm) { if(${$_}[2] eq "T") { $sum_T += @{$_}[1]/$Num{"T"}; $cnt_T++; $ssT += @{$_}[1]**2; my $n = $Num{"T"}; my $var = $ssT/($n-1.0) - ( ($n/($n-1.0)) * ( $sum_T**2) ); my $sd = sqrt($var/$n); $prev_T = sprintf("%.4e %9d %.4e",$sum_T,$cnt_T,$sd); } elsif(${$_}[2] eq "A") { $sum_A += @{$_}[1]/$Num{"A"}; $cnt_A++; $ssA += (@{$_}[1])**2; my $n = $Num{"A"}; my $var = $ssA/($n-1.0) - ( ($n/($n-1.0)) * ( $sum_A**2) ); my $sd = sqrt($var/$n); $prev_A = sprintf("%.4e %9d %.4e",$sum_A,$cnt_A,$sd); } if( !defined(@vals)) { printf("%-11.5f %s %s\n",@{$_}[0],$prev_T,$prev_A); } else { # otherwise we go through the list of numbers and take the # first value of either that exceeds it foreach my $val (@vals) { if($sum_A>=$val && !defined($valCheck{"A"}{$val}) ) { printf("%.5e %-11.5f %s %s\n",$val, @{$_}[0], $prev_T,$prev_A); $valCheck{"A"}{$val} = 1; } if($sum_T>=$val && !defined($valCheck{"T"}{$val}) ) { printf("%.5e %-11.5f %s %s\n",$val, @{$_}[0], $prev_T,$prev_A); $valCheck{"T"}{$val} = 1; } if($val>1 && @{$_}[0]<$val && !defined($valCheck{"Lambda"}{$val}) ) { printf("%.5e %-11.5f %s %s\n",$val, @{$_}[0], $prev_T,$prev_A); $valCheck{"Lambda"}{$val} = 1; } } } } sub Usage { print <<__USAGE__; Usage LamdaDist [-h] [prob1] ... [probn] ... [lambda1] ... [lambdan] Options -h Displays help information and exits if -h is the first option. If -h is not the first option, then unpredictable, possibly undesirable results may occur. If called with no arguments with output of TrioTests directed into stdin, this script will return 7 columns of output. The first is the value of lambda. The second is the probability (approximated by Monte Carlo) that a randomly drawn trio formed under the true trio relationship has a likelihood ratio exceeding lambda. The third is the number of genotypes simulated under the true relationship that exceeded lambda, the fourthth is the standard error of the probability in column 2. The fifth is the probability (approximated by Monte Carlo) that a genotype simulated under the alternative hypothesis (the one in the numerator of the likelihood ratio) exceeds lambda. The sixth is the number of genotypes simulated under the alternative hypothesis exceeding that value of lambda. The seventh is the standard error of the estimate in column five. Note that column 2 holds the value of alpha [see equation 6 in Anderson and Garza (2006)] corresponding to the lambda in column 1, and column 4 holds the value of 1 - Beta [see equation 5 in Anderson and Garza (2006)]. Otherwise, LambdaDist can be called with a list of numbers. For each number X between 0 and 1, the script will return the line (formatted as above) that corresponds to the largest value of lambda for which either alpha or 1-beta still exceeds X. If X>1, it will return the line corresponding to the largest value of lambda less than X. In either case, the value of X is prepended to the line, so all the columns are shifted over by one. __USAGE__ }