Aligning Sanger Chr 22 Annotations to UCSC Assembly

Procedure Used

  • Tried to reconstruct genes from Sanger gff file, however was unsuccessful as the file wouldn't parse. (Doesn't use tab delimeters and Doesn't obey other syntax).
  • As a secondary attempt aligned the 832 sequences in the Chr22.2.3.genes.dna file to the UCSC assembly version hg6 using gfClient. This generated a large number of psls in Chr22.2.3.genes.psl which were searched for unique best alignments as scored by pslScore() this left 820 alignments. The pseuedogenes were removed from this list using tossPseudogenes.pl and pseudogenes.txt. This left at the end 611 alignments in best/Chr22.2.3.genesVsUCSCChr22_no_pg.psl. These were loaded into the table sanger22Align and displayed as track sanger22Psls in hg6.
  • A very similar operation was performend with the Chr22.2.3.cds.dna which lead to 513 unique best alignments in Chr22.2.3.cds.best.psl from the 524 sequences in Chr22.2.3.cds.dna. These psls were loaded into the sanger22CDSAlign table and displayed in the sanger22CDSPsls track on hg6.
  • To try and reconstruct the 5' and 3' UTRs the Chr22.2.3.cds.dna file was aligned to the Chr22.2.3.gene.dna using blat. The resulting tStarts,tStops, and tSizes from the alignments were used to construct the UTRs. The resulting genePrediction structures were loaded into the sanger22WCds table and are displayed on the sanger22WCds track.

Please note that the sanger22WCds is not necesarily a subtraction of the other two tracks as those alignments weren't done in genomic space. Also note that there may be off by one errors as some of the alignments don't have qStart=0 and qEnd=qSize (order 10's of these). I feel that the next step would be to get the gff file parsed and use it as a QC device to assure that the exon starts and stops are correct. Also it could possibly be used to decide between alignments that have the same score and are being dropped as not unique.