UCSC BME 205 Fall 2012

Bioinformatics: models and algorithms

(Last Update: 17:04 PST 24 November 2012 )

This is a required course for bioinformatics students—both undergraduate and graduate students (also pre-requisite to BME 230). Those who need permission codes to register can get them on the first day of class.

For catalog copy and pre-requisites, see the main page for BME205.

Who, When, and Where:

Instructor: Kevin Karplus ( karplus@soe.ucsc.edu) http://www.soe.ucsc.edu/~karplus
Office hours: PSB 318, W 4-5:30, Th 2-3
+1-831-459-4250
TA: None this year

Lectures: MWF 11–12:10 PSB 305

Online discussion forums:

old BME 205 forum probably not very useful any more
Forum for just this semester https://courses.soe.ucsc.edu/courses/bme205/Fall12/01/forum
Google Groups mailing list bme205@soe.ucsc.edu or view on the web at https://groups.google.com/a/soe.ucsc.edu/d/forum/bme205

Texts

There will be no required texts, two optional texts, plus additional readings that will be distributed via the Web:

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids from Cambridge University Press by R. Durbin, S. Eddy, A. Krogh, and G. Mitchison.

strongly recommended In past years I've required this textbook, and many students have found it to be an invaluable reference, both during the course and afterwards.

This book is a tutorial introduction to the use of hidden Markov models and other probabilistic models for sequence analysis problems in computational molecular biology, but is aimed mainly at a graduate-student audience. We've been using it for years in this class, and have not yet found as detailed a text.

This is a text and reference book that every bioinformatics programmer should have. I don't follow the book very closely, so you will have to figure out for yourself when it is appropriate to read various sections.

Get the third revision, if you can, which has made corrections as indicated on the errata page.

Python in a Nutshell, Second Edition By Alex Martelli July 2006 ISBN 10: 0-596-10046-9 | ISBN 13: 9780596100469

OPTIONAL

BME 205 will be using Python. You can use whatever book or online resource you need to learn the language. I recommend not buying Learning Python nor Programming Python by Mark Lutz, as both are really terrible for this class. It takes Lutz forever to get to the point of anything, and material is scattered in random order. I read over a hundred pages of Learning Python, was still not prepared to write even a short Python program, and was heartily sick of the Python boosterism. Python in a Nutshell is much better organized and gets to the point immediately, but is short on examples.

The best source I've found is the online documentation at http://docs.python.org/tutorial/, http://docs.python.org/reference/, and http://docs.python.org/library/. One Python user highly recommends the index http://docs.python.org/genindex.html, but I've done better using Google with "python" followed by the subject I'm interested in.

If you need a more tutorial introduction, I see that the undergrad course CMPS 5P has used Python for Software Design: How to think like a computer scientist by Allen Downey (Green Tea Press) as a text. An earlier manuscript of the book is available for free.

One of the Fall 2009 students recommended Dive into Python, a free on-line book for experienced programmers.

An Introduction to Bioinformatics Algorithms Neil Jones and Pavel Pevzner MIT Press

OPTIONAL

This is book came out in summer 2004. It looks like it may be a valuable supplementary text, as it seems to be easier to read and at a slightly less advanced level than the Durbin et al. book. The description of sequence-sequence alignment and HMMs does not seem quite detailed enough for this class though.

Reading assignments

Reading assignments in Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by R. Durbin, S. Eddy, A. Krogh, and G. Mitchison.

Date Have read these sections

1 Oct 2012 1.1–1.4

15 Oct 2012 11.1–11.6

22 Oct 2012 2.1–2.9

29 Oct 2012 3.1–3.6

5 Nov 2012 5.1–5.8

12 Nov 2012 4.1–4.5

19 Nov 2012 6.1–6.5

26 Nov 2012 7.1–7.6

Homework

Date (to be) released Assignment Date Due

26 Sept 2012 prereq survey 28 Sept 2012

28 Sept 2012 Python scaffold 5 Oct 2012

4 Oct 2012 parsing FASTA and FASTQ 12 Oct 2012

12 Oct 2012 fellowship application 19 Oct 2012

17 Oct 2012 Markov chains 26 Oct 2012

26 Oct 2012 finding under/over-represented palindromes 2 Nov 2012

30 Oct 2012 null models 9 Nov 2012

8 Nov 2012 bioinformatics/library research paper (pointer is to last year's assignment—only the due date has changed) Wed 21 Nov 2012

18 Nov 2012 affine-gap alignment 30 Nov 2012

24 Nov 2012 peptide libraries from degenerate codons 7 Dec 2012

Date	Have read these sections
1 Oct 2012	1.1–1.4
15 Oct 2012	11.1–11.6
22 Oct 2012	2.1–2.9
29 Oct 2012	3.1–3.6
5 Nov 2012	5.1–5.8
12 Nov 2012	4.1–4.5
19 Nov 2012	6.1–6.5
26 Nov 2012	7.1–7.6

Date (to be) released	Assignment	Date Due
26 Sept 2012	prereq survey	28 Sept 2012
28 Sept 2012	Python scaffold	5 Oct 2012
4 Oct 2012	parsing FASTA and FASTQ	12 Oct 2012
12 Oct 2012	fellowship application	19 Oct 2012
17 Oct 2012	Markov chains	26 Oct 2012
26 Oct 2012	finding under/over-represented palindromes	2 Nov 2012
30 Oct 2012	null models	9 Nov 2012
8 Nov 2012	bioinformatics/library research paper (pointer is to last year's assignment—only the due date has changed)	Wed 21 Nov 2012
18 Nov 2012	affine-gap alignment	30 Nov 2012
24 Nov 2012	peptide libraries from degenerate codons	7 Dec 2012

Every student in the class will need a School of Engineering computer account. I will want assignments turned in by providing me with a publicly readable file (PDF for written assignments) or directory (for multi-file assignments) containing the assignment on the SoE machines. All Python programs must execute correctly on the SoE machines, without needing to install additional Python modules. I will run the programs using python2.7, which is not the default python on most SoE machines (some of which have archaic versions like 2.4.3 as the default). I would prefer to get paper copies of assignments in addition to the electronic ones (to save me the time of printing them), but I will accept electronic-only submissions from those who are too ill to attend class.

To get an SoE computer account see http://support.soe.ucsc.edu/new-accounts

Evaluation

There will be three types of assignments for the class:

reading assignments,
programming exercises to learn Python, and
bioinformatics exercises/papers using real data.

As has been my practice since Fall 2001, there will be no exams, and we will probably not meet during the final exam period (Wed 12 Dec 2012,8–11 a.m.) It turns out to be very difficult to make up small enough problems for examination—almost all the homework exercises are much larger problems than could reasonably be given on a timed exam.

The assignments will be distributed on the web.

The relative weights of the different types of assignment in the evaluation has not been determined yet—it should be roughly proportional to how much time the different assignments take to do well. I expect that most of the assignments will be similar to ones given in previous years, with a few parts tweaked to update them, but I may replace one or more assignments with new ones, if I can think of new problems at the appropriate level of difficulty.

Academic Integrity

Anyone caught cheating in the class will be reported to their college provost (see UCSC policy on academic integrity) and may fail the class. Cheating includes any attempt to claim someone else's work as your own. Plagiarism in any form (including close paraphrasing) will be considered cheating. Use of any source without proper citation will be considered cheating. If you are not certain about citation standards, please ask, as I hate having to fail students because they were improperly taught how to cite sources.

Collaboration without explicit written acknowledgment will be considered cheating. Collaboration on lab assignments with explicit written acknowledgment is encouraged—guidelines for the extent of reasonable collaboration will be given in class.

Classroom Accommodations for Disabilities

If you qualify for classroom accommodations because of a disability, please submit your Accommodation Authorization from the Disability Resource Center (DRC) to me during my office hours in a timely manner, preferably within the first two weeks of the quarter. Contact DRC at 459-2089 (voice), 459-4806 (TTY).

Rough list of topics we'll probably cover (not necessarily in order)

Note: The schedule will be updated throughout the quarter to reflect what really happens.

Quick review of the fundamental dogma of biology: DNA->RNA->protein, bases, codons, amino acids, ...
(2 hours)
Guest Lecture, 1 Oct 2012 Andrew Uzilov, Introduction to Python. Read the material at http://users.soe.ucsc.edu/~auzilov/BME205/Fall2011/ over the weekend and try to undertand the code in the tarballs. Andrew is willing to run over until you get kicked out of the classroom, and says that even with extra time he usually doesn't manage to cover everything he'd like to.
Stochastic models, Bayes Rule, 0-order Markov chain, first-order Markov chain, length model versus stop character for finite strings, use of log-probability for computations, adding probabilities in log-prob representation (efficient computation of log(exp(x)+exp(y)) ). (1.5 hour)
Constructing a model from data. Training, cross-training, and testing. Maximum-likelihood estimate. Pseudocounts to get mean posterior estimate. (1.5 hours)
Converting arbitrary scores to stochastic models: P-value and E-value. Brief discussion of Z-scores (Gaussian dist.) and fat tails of extreme-value (Gumbel dist.) (1.5 hour)
Entropy, relative entropy, Mutual information, sequence logos. (1.5 hour)
What fellowship reviewers look for. Relationship between relative entropy and difference in encoding cost in a train/test framework (clarification for homework exercise). Interpreting classification results: true/false positives, specificity, sensitivity, ROC curves, ROC_n numbers What is a substitution matrix? (1.5 hour)
Substitution matrices and sequence alignment scores. Aligning sequences to sequences, dynamic programming We'll do the the simple, but inefficient algorithm (for arbitrary gap costs) first. (1 hour: Blosum substitution matrices and gapless scoring) (1 hour: the alignment problem and global dynamic programming with arbitrary gap costs) (1 hour: global dynamic programming with linear gap costs, traceback) (1 hour: affine gap costs. Global and local dynamic programming)
Introduction to Hidden Markov models (1.5 hour on HMMs and profiles) (1.5 hours on profile HMMs giving Viterbi algorithm and forward-backward)
Pair HMMs as a way of explaining alignment algorithms (3 hours?)
Fri, 2 Nov. Guest Lecture. Science librarian Katie Fortney will give a presentation on bioinformatics resources available through the library, as well as talking about some of the challenges that face the UCSC library in building an adequate collection in new fields like bioinformatics. We'll meet in the Science Library for the guest lecture.
Protein secondary structure (DSSP and STRIDE), in order to explain second track of 2-track HMM. Discuss secondary structure prediction using neural nets. (1.5 hours)
Sequence weighting (Henikoff's technique for relative weighting and target bit savings for total weight) (1 hour) Multiple alignment techniques Overview and progressive alignment (0.5 hour)
Fast methods for searching (BLAST and BLAT). (In some years, Jim Kent has given an excellent lecture on these.)
Multiple alignment techniques: Muscle, Probcons, AMAP
documentation on MUSCLE:
http://www.drive5.com/muscle/docs.htm Refereed paper: Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research 32(5), 1792-97.
PROBCONS web site (including overview of algorithm): http://probcons.stanford.edu
AMAP http://bio.math.berkeley.edu/amap
Ariel S. Schwartz and Lior Pachter
Multiple alignment by sequence annealing
Bioinformatics 2007 23(2):e24-e29; doi:10.1093/bioinformatics/btl311
Oher multiple alignment programs:
paper on T-coffee:
T-Coffee: A novel method for fast and accurate multiple sequence alignment.
Notredame C, Higgins DG, Heringa J.
J Mol Biol 2000 Sep 8;302(1):205-17
doi:10.1006/jmbi.2000.4042
paper on MAFFT:
Kazutaka Katoh, Kazuharu Misawa, Kei-ichi Kuma and Takashi Miyata. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30(14):3059-3066, 2002.
Phylogeny: brief mention of maximum-likelihood and parsimony. Additivity assumption. UPGMA algorithm presented, ultrametric assumption and molecular clocks, intro to neighbor-joining (no proofs) (1.5 hour)
RNA structure and Stochastic Context-Free Grammars (1.5 hour)
A protocol for evaluating local structure alphabets. This talk ( http://www.soe.ucsc.edu/~karplus/papers/local-structure-germany02.pdf) presented some of the main results from Rachel Karchin's PhD thesis. The following two papers are the guts of the method:
Rachel Karchin, Melissa Cline, Yael Mandel-Gutfreund, and Kevin Karplus. Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins: Structure, Function, and Genetics, 51(4):504–514, June 2003. doi:10.1002/prot.10369 reprint

Rachel Karchin, Melissa Cline, and Kevin Karplus. Evaluation of local structure alphabets based on residue burial. Proteins: Structure, Function, and Genetics, 55(3):508–518, 5 March 2004. doi:10.1002/prot.20008 reprint

Topics for last week requested by students

Rough list of topics we didn't have enough time to do more than briefly mention in recent years.

Contact order and folding rate. In 2001, I handed out paper on contact order:
Contact order, transition state placement and the refolding rates of single domain proteins.
Plaxco KW, Simons KT, Baker D.
J Mol Biol 1998 Apr 10;277(4):985-94
Phylogenetic analysis
DNA microarrays and expression data
Gene finding
Proteomics
RNA structure
DNA assembly
Structure-structure alignment
RNA genes, DNA microarrays, computational and functional genomics Look at the BME 210 course, the BME 230 course, and the lab pages for Todd Lowe and Josh Stuart.
Combining secondary structure, fold-recognition, and new-fold methods for protein structure prediction. Using the transparencies given at Schloss Dagstuhl. I could have handed out book chapter on SAM-T2K, but didn't.

Other resources on the web

http://genome.ucsc.edu/: UCSC Genome Browser—gateway to at least 47 eukaryotic genomes.
http://microbes.ucsc.edu/: UCSC Microbial Browser—gateway to more than 300 archeal and bacterial genomes.
User's Guide to the Human Genome (in Nature Genetics).

SoE home

Kevin Karplus's home page

Biomolecular Engineering Department

BME 205 home page

old BME 205 discussion forum

Fall 2012 BME 205 discussion forum

UCSC Bioinformatics research

Questions about page content should be directed to Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building