UCSC BME 205 HW6: annotation

UCSC BME 205 HW6: Annotation

Due Fri 6 Nov 2015 (Last Update: 20:16 PDT 31 October 2015 )

Annotate the banana slug mitochondrial genome

I decided to try a completely new assignment this year, though I've kept some description of tools and web resources from previous years, even if some of the tools are not particularly relevant to this year's task.

The broad goal of this assignment is simple: to provide annotation for the mitochondrial genome of Ariolimax dolichophallus, the banana slug that is endemic to the UCSC campus and is the campus mascot.

Complete annotation is more than a one-week homework assignment, but a lot can be accomplished in one week, especially if different students choose different approaches for tackling the problem. Because the overall goal is far larger than the time available, do the best job you can in the 10–11 hours you have for this assignment. Write up as you work, so that you can stop at any point and turn in what has been done—don't wait until you are "finished" to write things up!

Currently, the best mitochondrial genome we have for the slug is the one assembled in 2011: https://banana-slug.soe.ucsc.edu/_media/computer_resources:assemblies:mitochondrion-draft2.fasta.gz It is this assembly that we should annotate.

Here are some possible approaches to annotation—it is not a list of instructions, but ideas to get you thinking about what you want to do and write about for this assignment:

Find a mitochondrion annotation tool on the web and apply it.
Do a small literature search to find what protein-coding genes are commonly found on metazoan mitochondria, find the protein sequences for a closely related organism, and use blastx to look for those proteins in the genome. You'll probably have to refine the boundaries of the found genes, because the ends of the proteins may not be highly conserved. Are any of the standard mitochondrial genes missing from this assembly?
Find the ORFs in the genome (on both strands), and try to identify each ORF with searches of the non-redundant protein database. Are there any genes found that are different from the standard mitochondrial genes but that seem to be real genes (based on conservation across species or other tests)?
Look for adjacent ORFs in different reading frames, as possible frameshift errors in the assembly. Are there any indications of a frameshift error in any of the genes?
Scan the mitochondrial genome with tools that look for RNA genes (like tRNA, which mitochondrial genomes usually have many genes for).
Do detailed descriptions of the mitochondrial genes based on the literature and check for any sequence features that might be interesting (mutations to normally conserved residues, insertions or deletions in structurally important regions, ...).

There are a few things you need to know before applying tools to the sequence:

The mitochondrial genome is AT-rich (you can determine how much by counting 1-mers.
The genetic code is not the standard genetic code, but the invertebrate mitochondrial code (which you can find at ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt. Note that this genetic code has several start codons but fewer stop codons than the standard genetic code.
Given the AT-richness of the genome and the unusual genetic code, the probability of a sequences of 3-mers being an ORF by chance is different from what it would be with more balanced composition and the standard code. How long does an ORF need to be in this genome to have a decently small E-value in a random base model?
A version of BLAST that runs on the "protein" server can be found in /projects/compbio/programs/blast-2.2.18/bin/ You can format a fasta file as a blast database with the "formatdb" command and do searches with "blastall". Documentation for BLAST can be found on the web.
We download the non-redundant protein database weekly. It can be found in /projects/compbio/data/nrp/nr
You may want to set up a .ncbirc file containing
```
	[NCBI]

	Data=/projects/compbio/programs/blast2/data
	
```
And then run blast with commands like
```
	blastall -p blastp -d /projects/compbio/data/nrp/nr -i protein.fasta
	
```
(blastp searches a protein database with protein queries, tblastn searches a nucleotide database with protein queries—look up blastn, blastx, and tblastx as well).

Alternatively, you can do what biologists around the world do and use the NCBI website: http://blast.ncbi.nlm.nih.gov/Blast.cgi Searching using BLAST locally is a bit slower than searching the nr database at NCBI, unless the NCBI server is heavily loaded—then searching using our computers is faster.

What to turn in

What you turn in should be a stand-alone paper that a biologist or bioinformatician can read without having any prior knowledge of this class or of specialized mitochondrial genomics. Remember that biologists like to look at figures. If you can show alignments, structure predictions, repeat structure, domain structure, or anything else pictorially, it will probably make your paper more attractive to a biologist. Biologists are much more likely to read and understand a report if there are pictures illustrating the key points!

Be sure to provide proper citations for all papers and web sites that you get information from. You should cite a paper for each tool you use (they generally tell you what to cite). A bare URL is not an adequate citation for a web site—you need to provide enough information that someone can find it with Google if it has moved without being changed—title, URL, and date of publication or date of access is minimal, and author or corporate author should be provided whenever possible.

This paper really should look like a report on the annotation, not like a homework exercise. I have given some suggestions below to help you get started, but these are not questions to answer sequentially, nor are they necessarily the most productive directions for your search.

Don't just print out the results of web searches, but interpret the results to see what (if anything) they say. Please be precise in your descriptions of what you did: Don't just say "blast" but give what version of blast searching what database with what parameters.

If you create annotations in standard formats (like BED or GFF files), please provide those files electronically, as well as a PDF file of your paper report. We may be creating a browser for the mitochondrial sequence, and annotation files would be useful for such an effort.

Below this line are suggestions only, not all of which are necessarily appropriate for this year's assignment. Look through them to get ideas for the sorts of things you might consider doing when annotating a gene, but don't treat it as an instruction list—you're to design your own mix of literature searching and tool using.

Literature search

Look for information about mitochondrial genomes and mitochondrial genes. The literature is vast, so don't bog down in reading a lot of it, but get a quick overview of what is important about mitochondrial genomes and what tools and databases are available.

You may find mitochondrial annotation tools in your literature search or on the web—trying them out is a reasonable thing to do for this assignment, but it should not be the only thing you do.

Be aware of standard resources for sequence data like Swissprot http://us.expasy.org/, the UCSC genome browser http://genome.ucsc.edu, the archeal and prokaryotic browser http://microbes.ucsc.edu, and organism-specific databases (SGD for yeast http://www.yeastgenome.org/, flybase for Drosophila http://flybase.org/), ...) to find information about the sequences you found with BLAST.

The NCBI blast search conveniently provides links to Unigene, Entrez Gene, Medline, and even PubChem BioAssay databases, which makes the web search much easier. Remember that PubMed is a medical database, so will tend to have more articles about human proteins and pathogen proteins than about similar proteins from other organisms (though popular model organisms may have quite a few articles).

Do Google searches using protein names and its accession number or database identifier(s) to try to find web pages about the proteins.

Use PUBMED and other databases at Entrez (now mysteriously renamed GQuery)http://www.ncbi.nlm.nih.gov/gquery to find papers that talk about the protein.

For some proteins, you may want to use BIOSIS from the library website http://library.ucsc.edu/ to see if there are articles there. (BIOSIS is better at plant biology and non-pathogenic microbiology, for example, than PUBMED is.)

Remember that reference list should contain all and only those papers cited in the main body of your paper. Don't pad your reference list with papers that you didn't actually cite. (LaTeX and BibTeX take care of this for you automatically, and I've heard that EndNote, Zotero, and Mendeley also work.) If you do use BibTeX, remember that \cite can take a comma-separated list of citations, and that this is the right way to do multiple citations at a single location.

Find out what else you can get from the protein sequences

Once you've identified orfs that seem to be protein-coding genes, you can translate them into protein sequences and apply various bioinformatic tools to get finer-grained annotation: This could include such things as looking for homologs, looking for internal repeats, splitting up into domains, looking for transmembrane helices or other special features, doing protein-structure prediction, and so forth.

The blast suite has several other programs (psi-blast, for more remote protein homology; rpsblast, for conserved domains; ...).

If you get a "hypothetical protein" annotation for a gene, remember that "hypothetical protein" does not tell you anything about how "real" a protein is, just that there was no direct experimental evidence for the protein at the time of the annotation. Annotators are encouraged to be rather cautious in putting functional identification of proteins into the database, since false positives are much more damaging than false negatives. Since the annotation is rarely updated, even proteins that have now had extensive experimental work may still be labeled as "hypothetical" in some databases.

One popular thing to do is to check for known protein domains, using tools like Pfam (available on-line at http://pfam.janelia.org/) and SUPERFAMILY (available on-line at http://supfam.cs.bris.ac.uk/SUPERFAMILY/). Prosite http://prosite.expasy.org/ can also be useful, though you have to be aware for the high probability of false positives.

If you find some good hits to domains or prosite motifs, do some literature search on them also, so that you know roughly what they do and what they tell you about the structure or function of the protein. Summarize your findings.

Another popular thing to do is to check for transmembrane helices and secretion signals. There is a good suite of tools at the Technical University of Denmark: http://www.cbs.dtu.dk/services/ and I've found TMHMM and SignalP to be particularly useful. You should be aware that TMHMM does a good job of identifying transmembrane helices, but is not much better than random at deciding what is inside and what is outside the cell. I believe that Phobius at http://phobius.sbc.su.se/ gets the inside/outside prediction somewhat better, but it believes that TM helices near the beginning of the sequence are all signal peptides, which is a different sort of error.

Finding homologs

It is often useful to get a large number of putative homologs to your target sequence—both to find annotation about the function and to make multiple alignments for looking for conservation signals. You can get a quick list with BLAST, but this will only provide sequences that are rather similar, and you can get some confusion with multiple-domain proteins that only match on one or two of the domains.

Your best bet (usually) is to break the protein up into domains, and do searches for homologs on each domain separately. If you restrict yourself to domains that do not contain transmembrane helices or transmembrane beta barrels, then you can try submitting the domains to structure prediction servers also.

One of my favorite ways of finding homologs for a protein (or protein domain) used to be to use the SAM-T08 server (at http://compbio.soe.ucsc.edu/SAM_T08/T08-query.html), which not only found the probably homologs and aligned them, but produced sequence logos, secondary structure predictions, and tertiary structure predictions. Unfortunately, as of summer 2014, the SAM-T08 server is broken, because the old bmecluster ran out of space for the NR database. I have not gotten around to reinstalling the web service to run on "protein".

SAM-T08 is also a bit slow, particularly on long proteins, so you might want to also try the more popular PSI-BLAST method (at the BLAST website: http://blast.ncbi.nlm.nih.gov/Blast.cgi).

You might want to do a fast HMM-HMM search like HHPred (or use the other tools there, like CS-BLAST and HHBlits).

You might also want to give the sequence to a metaserver such as http://pcons.net/.

Multiple alignments

Once you have a collection of homologous sequences, it is useful to make a multiple alignment of them. There are many methods for doing this (indeed, psi-blast, CS-blast, HHBlits, and SAM-T08 provide multiple alignments that may be all you need to work with). Multiple alignments tend to be more useful if there is a moderate diversity of sequences: many almost identical sequences tell you little when aligned, and a few very different sequences may be difficult to align accurately.

If you are given a set of sequences without a multiple alignment, or if you do not quite believe the multiple alignment you got from psi-blast or SAM-T08, you may wish to realign the sequences with a different tool.

One very popular (though no longer considered very good) tool is CLUSTALW. This is a progressive method of multiple alignment. It will do all-pairs scoring on a sequence set, then build a guide tree with the sequences on the leaves. Sequences with a high similarity score are assigned to nodes with a common parent on this tree. The alignment is built from the bottom of the tree by merging sibling sequences into pairwise alignments, and then progressively merging the most similar pairwise alignments into multiple alignments.

Since ClustalW is rather ancient code, with poor performance relative to newer tools. I recommend using Clustal Omega instead. You can try it out at the EBI web server at http://www.ebi.ac.uk/Tools/msa/clustalo/. Clustal Omega can handle huge numbers of sequences, but the web server may limit you.

If you generate a Clustal Omega alignment, you can compare it to alignments found by other methods (BLAST, PSI-BLAST, SAM, MUSCLE, ...). Where the alignments differ, which one looks more reasonable to you? What positions contain highly conserved residues? Do the sequence logos from the SAM site suggest any conservation to look for that you did not expect from having just looked at the multiple alignments? (The SAM site may be too slow for a long protein. If you have a multiple alignment in A2M format, you can produce the sequence logos with /projects/compbio/bin/makelogo (online documentation). All aligners seem to use different output formats, many of which do not support the notion of insertions between alignment columns, so you might have some difficulty getting an alignment into A2M format

Another good multiple alignment program is MUSCLE (see http://www.drive5.com/muscle/). You can use MUSCLE to align your sequences and see how it differs from Clustal Omega or psi-blast.

Viewing with Rasmol

If you got any strongly predicted protein structures, you can look at them with rasmol, pymol, vmd, jmol, or some other structure-viewing tool.

If you are on a School of Engineering machine, you can download a protein from PDB with

/programs/compbio/bin/pdb-get 1foo

where 1foo should be replaced by the proper pdb identifier. This program returns the name of the file that has been downloaded, so you can use

rasmol `pdb-get 1foo`

to look at proteins, assuming that your paths are correctly set up.

If you need to download Rasmol for your home computer, there are several sources, including http://www.bernstein-plus-sons.com/software/rasmol/. Rasmol is a command-based viewer, and you will have to use "help" a lot while learning to use it. The download site listed above also has pointers to the web-based Rasmol manual.

Note: there are many other protein viewers on the web (DeepView=Swiss-pdbviewer, molmol, chime, protein explorer, molscript, vmd, jmol, firstview, cn3d, pymol, kinemage, ...). If you wish, you may substitute some other viewer for rasmol. Pymol is probably the most popular for journal-quality images, but the user interface is difficult to learn.

Look at the protein in various ways (as cartoons, as ball-and-stick models, as a backbone trace, ...). For example, in rasmol, with the protein in cartoon view, use "Select hetero and not HOH and not MSE" to select ligands (if there are any), and view them in space-filling mode.

Where are there insertions or deletions in the target relative to the template you chose? Are these in sensible places?

Microarray data

If it seems appropriate, look for microarray data on expression patterns for the gene associated with this protein. What information (if any) can you glean from the databases? I don't know which microarray databases are the easiest to use or the most informative, as I have rarely used them. I have found that the SGD database for yeast has good links to an expression database that does some useful clustering, but I have not found a really good clustering site that uses the public databases.

Note: there is often a strong possibility that a protein being studied is not closely related to proteins from any of the model organisms---or that it is related to lots of proteins which don't all share the same function. Discuss the difficulties as well as the successes!

RNA genes

Finding and annotating RNA genes tends to be more difficult than finding and annotating protein-coding genes and may be too much to attempt for this assignment. One of the more popular approaches is to look for annotations in closely related organisms and try to find homologous regions in this genome. Other popular methods involve using sophisticated models (like stochastic context-free grammars and other covariance models) to look for specific RNA structures that are more highly conserved than sequence in RNA genes.

Things learned after assignment

SoE home

Kevin Karplus's home page

Biomolecular Engineering Department

BME 205 home page

UCSC Bioinformatics research

Questions about page content should be directed to Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building