SibShip Programs

Introduction

Almudevar and Field (1999. JABES 4:136-165) introduced an algorithm for enumerating feasible sibling groups on the basis of genetic data at L genetic markers showing diploid inheritance. This is a great piece of work. They distribute software implementing the algorithm, but it is apparently only available for Windows PC platforms. It also seems that their implementation may not allow for missing data.

I implemented their algorithm in a way that deals with missing data (in what I believe is a correct fashion :-)) using a combination of the Unix utility awk and the bash shell. This is a brief tutorial that describes how to use the resulting software. This software is intended for use by people who know what they are doing in a Unix or Linux environment. It has not been optimized for fast execution. In fact, because I hacked it hastily, and it relies on interpreted code rather than compiled code, it is quite slow for some applications—orders of magnitude slower than Almudevar's implementation!—but for analyzing a clutch from only a small number of parents it seems to work pretty quickly.

For my own memory, the associated files and scripts that I will be using here are stored on my hard drive in:
file:///Users/eriq/Documents/work/prj/SibShip/SibShipScriptsForDistribution/

You may DOWNLOAD all the necessary files in a gzipped tar file here: sibshipscripts_feb22_07.tar.gz

The author of this script is Eric C. Anderson, Southwest Fisheries Science Center, Santa Cruz, CA. You may email me at mailto://eric.anderson@noaa.gov

Program Components

The core of this implementation is in the script DoNextLocus.awk. It takes as input a genotype data file and a list of currently defined sibships (where individuals are listed according to their index (order) in the data file). This function is primarily useful when applied repeatedly over a bunch of loci. This is carried out with the bash shell script called DoAllLoci.sh. In most situations, DoAllLoci.sh is the only script that you will directly interact with. You will need a sane Unix environment to get it to work.

The script DoAllLoci.sh also relies on a script called CondenseFSGs.awk. That script takes the output of DoNextLocus.awk and removes any possible sibships which are subsets of a larger sibship that is already present.

DoAllLoci.sh must know where the scripts it relies on are located. Therefore you will have to change the lines near the top of the script which look like:

# here are the full path to the DoNextLocus and Condense scripts:                                                                           
DONEXT=/Users/eriq/Documents/work/prj/SibShip/SibShipScriptsForDistribution/DoNextLocus.awk
CONDENSE=/Users/eriq/Documents/work/prj/SibShip/SibShipScriptsForDistribution/CondenseFSGs.awk

to be appropriate to your installation.

If you try running DoAllLoci.sh with no arguments, you get the usage message:

Syntax:  DoAllLoci.sh <InputFileOfGenotypes> <MinSizeOfFSG>

     Note that the MinSizeOfFSG should always be 3 or greater

    After running, the file FSGs.X holds the resulting FSGs.  
    X is the number of loci in the data set.

The InputFileOfGenotypes is a whitespace delimited file in which each row is 
an individual and the l-th field in each row is the genotype at the l-th locus.  
Alleles can be named by any string that does not contain an "a", an "e" or an "s", 
but when you name them in a genotype you precede each allele name with an "a" and 
you catenate them together with an "e".  For example, an individual that has 
microsatellite alleles 117 and 121 would have genotype a117ea121 at that locus.  
There must be a convention so that an individual carrying such alleles is always 
called a117ea121 and not a121ea117.  A simple way is to always name the shortest 
allele first. Missing data at a locus should be denoted by a single "0" (zero) 
for the whole locus.  

A Sample Analysis

We will imagine that our data are in the file called SampleData.txt. The first 10 lines of this file correspond to individuals 1 through 10 and they look like:

a223ea231 a115ea123 a155ea155 a113ea128 a184ea184 a85ea89 a278ea278 
a223ea231 a115ea123 a155ea155 a123ea128 a184ea186 a85ea89 a278ea278 
a221ea233 a123ea139 a155ea173 a113ea128 a184ea184 a85ea89 a275ea275 
a223ea233 a123ea123 a151ea173 a113ea128 a184ea186 a87ea87 a275ea278 
a221ea233 a123ea123 a155ea173 a123ea123 a184ea184 a85ea87 a275ea278 
a221ea245 a123ea139 a155ea173 a123ea123 a184ea184 a87ea89 a275ea278 
a223ea245 a123ea139 a151ea155 a123ea128 a184ea184 a85ea89 a275ea275 
a221ea231 a115ea123 a155ea155 a123ea128 a184ea184 a85ea87 a278ea278 
a223ea231 a123ea139 a155ea173 a113ea128 a184ea186 a85ea89 a275ea278 
a223ea245 a123ea139 0 0 a184ea184 a87ea89 a275ea278 

There are 7 loci, and you can confirm that they all have genotypes recorded according to the conventions required (as listed in the previous section). Note that missing data are merely denoted by 0.

Now, we will run this data set. We will allow all sibships of size 3 or larger to be investigated. We do this by issuing the following command:

/SibShipScriptsForDistribution/--% ./DoAllLoci.sh SampleData.txt 3 

Note that the /SibShipScriptsForDistribution/--% is a command prompt (you don't type that part in!). While it is running, the program writes lines to standard output to tell how far it has gotten in the process:

Working on Locus 1
Working on Locus 2
Working on Locus 3
Working on Locus 4
Working on Locus 5
Working on Locus 6
Working on Locus 7

When it is done. The file FSGs.7 holds the feasible groups into which the members of the sample could fall. If you had 13 loci, the final results would be in FSGs.13. The other FSGs.X files contain the feasible sibling groups up to and including locus X.

In this case, the final output in FSGs.7 looks like:

FSG: 1 2 3 4 5 8 9 11 12 13 14 15 17 18 19 22 23 24 25 26 27 29 30 33 34 35 36 37 38 39 41 42 43 44 46 47 50 51 52 56 57 58 60 61 62 63 64 65 66 67 69 70 72 73 76 77 78 79 80 81 82 83 85 86 87 88 89 92 93 94 95 96 
FSG: 3 4 5 6 7 10 11 12 13 14 15 16 19 20 21 23 24 25 28 31 32 35 36 37 40 42 45 46 48 49 51 52 53 54 55 59 61 62 64 66 67 68 69 70 71 72 73 74 75 76 77 80 81 82 83 84 85 87 88 89 90 91 94 95 96 
FSG: 1 2 6 7 8 9 10 16 17 18 20 21 22 26 27 28 29 30 31 32 33 34 38 39 40 41 43 44 45 47 48 49 50 53 54 55 56 57 58 59 60 63 65 68 71 74 75 78 79 84 86 90 91 92 93 

There are three lines. Each one corresponds to a feasible sibling group and includes the indices of all the individuals that are in that sibship.

A Boneheaded way of Collapsing FSGs into a Partition

Here is a totally silly greedy algorithm to take the FSGs and use them to find a partition of the data set into a group of sibships. This is NOT how I would typically choose to do this problem. It is just a simple awk script that goes through the FSGs the way they would be printed in FSGs.X. Note that in the FSGs.X file, the FSGs will be printed in descending order of FSG size. The following script just starts tiling FSGs over the individuals. If an FSG contains any individuals that were already recorded in a larger FSG, it just rejects the whole FSG. The script is an awk script:

#!awk
 
/FSG/ {
  l = NF-1;  # size of sibship
  n = split($0,mem);
 
  already=0
  for(i=2;i<=n;i++) {
    if(mem[i] in Prev) {
      already = 1;
      break;
    }
  }
 
  if(already==0) {
    print;
 
    # add those guys to Prev
    for(i=2;i<=n;i++) {
      Prev[mem[i]]++;
    }
  }
 
}
 
END {
  # print out who wasn't in a sibship
  printf("NOT_IN_SIBSHIP : ");
  for(i=1;i<=NUM_INDS;i++) {
    if( !(i in Prev) ) printf("%d ",i);
  }
  printf("\n");
 
  # then get ready for the next one
  for(i in Prev) delete Prev[i];
 
}

You might cut and paste it as the file CollapseFSGs.awk and then use it as follows:
Imagine that the output from DoAllLoci.sh is FSGs.10 which looks like:

FSG: 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
FSG: 31 32 33 34 35 36 37 38 39 40
FSG: 21 22 23 24 25 26 27 28 29 30
FSG: 10 12 16 18 20 43
FSG: 28 33 35 37 39
FSG: 10 12 18 20 41
FSG: 1 2 3 4 5
FSG: 6 9 11 43
FSG: 4 24 27 29
FSG: 4 21 22 26
FSG: 30 33 35 40
FSG: 30 32 36 38
FSG: 30 31 35 39
FSG: 3 31 33 39
FSG: 28 34 36 38

Let's imagine this came from a data set with 44 individuals in it.

Then we would collapse this into a boneheaded partition by doing:

/temp/--% awk -v NUM_INDS=44 -f CollapseFSGs.awk FSGs.10

and the result looks like:

FSG: 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
FSG: 31 32 33 34 35 36 37 38 39 40
FSG: 21 22 23 24 25 26 27 28 29 30
FSG: 1 2 3 4 5
NOT_IN_SIBSHIP : 41 42 43 44 

Note that you have to supply the number of individuals in the data set with awk's -v option.

There are countless ways of attacking this problem that would be considerably more intelligent. I think Anthony Almudevar has a method that has had a little more thought gone into it. I just did this because it sufficed for a simple task I had with simulated data that was ridiculously informative. I wouldn't really use it in standard practice.

software/sibshipawk.txt · Last modified: 2008/08/26 10:54 (external edit)
 
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki