The seminar will be a journal club, in which students take turns presenting papers from the literature (or their own research). I would like a title and abstract from each presenter at least a week ahead of time to put on this web page.
Experimental SNP discovery consists of a number of labourious steps that make this process complex and expensive. Therefore, in-silico discovery has been proposed to overcome the above problem. However, in order to successfully apply the in-silico method to large data sets, the following challenges need to be addressed: First it is necessary to build an integrated SNP pipeline that handles data processing steps smoothly from the beginning (collecting sequence information) to end (SNP information stored in a database). Also, SNP detection tool parameters have to be optimized to satisfy specific goals of the project. Finally, SNP data could not be fully used until the in-silico method is validated experimentally.
In this work it is presented a design and implementation of an in-silico SNP detection software pipeline that exploits the existence of large EST (expressed sequence tag) data sets and effectively addresses the above challenges. First, the pipeline allows for smooth data transition between its different components by implementing data interfaces that translate the data formats of the different tools in the different stages. Second, we optimized PolyBayes parameters for SNP detection in maize EST. Finally, we implemented a user interface that along with the database structure created, allows the scientist to perform preliminary analysis of the data and to perform basic statistics on the SNP data prior to experimental validation.
The pipeline works with two different types of sequence assemblers PHRAP and CAT--from DoubleTwist. It uses a Bayesian engine for SNP detection (PolyBayes), selects relevant polymorphism information which is then uploaded into a database. We detected 2439 SNPs and 822 insertion deletions (INDELs) with a PolyBayes probability higher than 0.99 on the public set of 68,000 maize ESTs coming from the ZmDB(Zea maize DB).
The user interface allowed us analyzing the polymorphism information right after discovery in several ways that allowed us to gain insight into the distribution and significance of the newly acquired data.
In addition, we consider the multi-class problem of classifying a known kinase into one of a set of families and also sub-families, based on Hank's classification hierarchy. In this experiment we compare the one-vs-one approach to one-vs-rest for multi-class classification using the Pfam-Vector, and a simple nearest-neighbor classifier. However, in this case the SAM-T99 HMMs built specially for the Hanks families and sub-families are clearly the most accurate.
Abstract:
Results: A novel application of generic programming techniques in the
form of a
library of C++ components called the Bioinformatics Template Library
(BTL) is
presented. This library will facilitate the rapid development of
efficient programs
by providing efficient code for many algorithms and data-structures that
are
commonly used in biocomputing, in a generic form that allows them to be
flexibly combined with application specific object-oriented class
libraries.
Availibility: The BTL is available free of charge from our web site
http://www.embl-ebi.ac.uk/FTP/index.html.
Contact: d.moss@mail.cryst.bbk.ac.uk
m.williams@biochemistry.ucl.ac.uk
Motivation: The efficiency of bioinformatics programmers can be greatly
increased
through the provision of ready-made software components that can be
rapidly
combined, with additional bespoke components where necessary, to create
finished
programs. The new standard for C++ includes an efficient and easy to
use library
of generic algorithms and data-structures, designed to facilitate
low-level component
programming. The extension of this library to include functionality
that is
specifically useful in compute-intensive tasks in bioinformatics and
molecular
modelling could provide an effective standard for the design of reusable
software
components within the biocomputing community.
In contrast to a recent study ( Broome B.M. and Hecht M.H. Nature disfavors sequences of alternating polar and non-polar amino acids: implications for amyloidogenesis J Mol Biol 2000 Mar 3;296(4):961-8 ) that suggested that alternating polar/non-polar patterns are disfavored and are under-represented in natural proteins to avoid aggregation, we found that the most frequent patterns in beta-strands are the purely alternating patterns (PNPNP and NPNPN). Moreover, we observed a highly significant preference for association between complementary patterns, in which the hydrophobic and polar residues pair with one other. To examine Broome and Hecht's hypothesis the occurrence of binary patterns in amyloidogenic proteins and in short fragments involved directly in amyloid formation has been investigated. Based on our results we propose that alternating patterns are important for the natural formation of beta-sheets in proteins and are not strongly associated with their self-assembly in pathological situations.