Scaffolding assignment
Due Friday, 2015 Oct 2

UCSC BME 205 Fall 2015
Bioinformatics: models and algorithms

(Last Update: 12:20 PDT 2 October 2015 )


The purpose of this assignment is to build a program that doesn't do very much, but which has a structure similar to other programs you will write later in the quarter. You will likely start writing those programs by gutting this one and putting in new code to replace the trivial functions of this program with ones that do the tasks of the specific assignment.

A program intended to be reused in this way is sometimes called a template or a scaffold. (The term scaffolding in the pedagogic literature refers to the teacher breaking a complex task into easier subtasks for the student, which is also appropriate for this assignment.)

The task

The basic task for this program is a fairly trivial one in Python: break input text into words and count the frequency of each word. It is, perhaps, not a very bioinformatic task, but we will later do similar things in more bioinformatic contexts.

The program should be a UNIX filter, that is, it should read from standard input (stdin) and output to standard output (stdout). If there are any information or error messages, they should be sent to the standard error output (stderr), not stdout. To use these from Python, you need to import the "sys" module.

The input will be plain English text using the standard ASCII character set (not some proprietary format like Microsoft .doc files, and not including ligatures, accented characters, or Unicode). Obviously, more complicated versions of this program could be written that handle different languages, more complete character sets, and a variety of file types, but our intent is to build a scaffold that is easily gutted and re-purposed, so we are keeping the functionality to a minimum here.

The output should be two columns separated by a tab: the first column containing a word, the second column containing the number of occurrences of the word in the input. Because I'll be checking the output of the turned in homework with simple comparison programs, there should no comments or decorations of the output. Anything other than word-tab-number pairs, one pair per line, will be invalid output.

Aside: As a general rule of thumb for writing bioinformatics programs, you should be flexible and forgiving of minor variations from a standard when reading input, but very scrupulous to follow standards precisely when producing output. That way your program will not be the obstruction that causes a pipeline to fail.

There are several possible orders for the word-count pairs in the output: unsorted, ascending counts, descending counts, and alphabetical. The order of the words in the output will be specified by a command-line argument to the program. This part of the assignment is to make sure that your scaffold has an argparse command-line parser in it, as you will need that in most subsequent assignments. You will need to import the "argparse" module to parse the command-line arguments.

So that I can test everyone's program automatically, your program file must be named "wordcount" (not "word_count", not "hw1", not ""). I will issue one of the following commands:

wordcount --ascend < foo.txt > foo-ascending  
wordcount --descend < foo.txt > foo-decending  
wordcount --alpha < foo.txt > foo-alphabetical
wordcount --alpha --descend < foo.txt > foo-alphabetical
If both --alpha and --descend are specified, then the output should be in reverse alphabetical order. If the program does not provide identical outputs to my "correct" program, you will be required to redo the assignment. There are no partial points on this assignment for "almost working" programs.

You may have noticed that I did not specify what happens if you specify both --alpha and --ascend. This does not mean that you can do whatever you like in that case&emdash;it means that you need to think out what a reasonable user would expect the program to do and fill in the missing specification. Bioinformatics programming is full of incompletely specified requests, and good bioinformaticians make sure that the missing specifications are filled in reasonably, and document their choices.

Because different correct outputs are possible with the same input, I will not test an "unsorted" option, and you need not provide one. Unsorted as a default makes sense in some contexts, where the output will be used by another program, and sorting would be unnecessary overhead.

Aside:The astute programmer will have noticed that I have provided a nearly impossible task above. When sorting in ascending or descending order, there can be several ties—how can you guarantee identical output when the order is not completely specified? This sort of ambiguity in specifications is extremely common, so you should always be looking for it—it is one place bugs often creep in. Don't worry, I won't leave you hanging this time.

As a tie-breaker, when sorting by counts, words with identical counts should be output in alphabetical order. The ascending/descending choice applies only to the counts in this case, so the output

alpha	10
beta	10
aaa	3
would be a correct output file for a --descend option.

Aside:The astute programmer would still be irritated with me for incomplete specs, as I've never defined what I mean by a "word", nor exactly what I mean by "alphabetical". These are not trivial questions—library schools spend a fair amount of time discussing different alphabetization schemes and what the standards are in different systems. Because this assignment is about building a scaffold, not about natural-language processing, we are going to use a trivial definition.

For the purposes of this assignment, a "word" is a contiguous sequence of characters from the set {abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'}. All other characters in the input will serve as separators. Note that there is a character at the end that is not a letter of the alphabet. Apostrophes are part of the definition of word for the assignments, and failure to include them will require redoing the assignment.

For the purposes of this assignment, words are "alphabetically" ordered by the Python string comparison operator. Note that this is not the normal notion of alphabetical order, since the case of the letters matters, and "Z" comes before "a".

Program structure

Because the purpose of this assignment is for you to build a reusable template, I'm going to require that certain features of Python be used and to strongly recommend a particular structure for the program.

I will require

I strongly recommend that that file start with the line

#!/usr/bin/env python2.7
or the line
#!/usr/bin/env python3.4
so that the UNIX, Linux, or OS X operating system can find the right version of python to run the program. Note that Mac OS X systems are likely to have an older version of Python installed, but it is fairly easy to install Python 2.7 (installers for the most systems are available at ).

You can write the program to run under either Python 3 or Python 2.7 if you include
from __future__ import print_function, division
at the beginning of the program (after the #! line and the docstring), and use print functions as in Python 3, rather than print statements as in Python 2.

Warning for Windows users: Windows uses a different line-ending convention from other systems, using the old teletype convention of a separate carriage return and line feed, rather than a single new-line character. When you transfer a Windows text file to a unix-like system, the extraneous carriage returns can cause problems. Some transfer programs will remove the extra characters, but if you encounter a Windows text file on a unix-like system, you can strip out the junk with the "tr" command:

tr -d '\r' < file-with-returns > file-without-returns

If you edit your program on a Windows machine, you may need to strip out the returns before it will run on other systems. You must test on a Linux or Mac OS X system. Windows files that don't run on Linux systems will be returned ungraded and treated as late assignments.

I strongly recommend that the program be broken into four major functions:

Here is an example of what the main function might look like:

def main(args):
    """parses command line arguments, reads text from stdin, and
    prints two-column word-count pairs to stdout in the order specified
    in the command line arguments
    options = parse_args(args)
    counts = collections.defaultdict(int)
        # counts['abc'] is the number of times the word 'abc' occurs 
        # in the input.
        # defaultdict used to make counts of unseen words be 0
    for word in read_word(sys.stdin):
        counts[word] += 1
    print_output(sys.stdout, counts, options.order)

if __name__ == "__main__" :
The mysterious boilerplate at the end is commonly found in Python files (or something similar is used). If the Python file is executed directly by Python, then the "main" function is called with the system-provided command-line arguments and the program exits with the value returned by main. But if the Python file is imported into another Python program, then the "main" function is not automatically called, though all the functions in the file are available for use in the importing program. We'll look at building modules for importing into other programs in later assignments.

Late added notes

You can use the compute server "protein", which I'll be doing the program testing on. You can't currently log into the server directly, but if you first log in on some other BSoE machine, you should be able to use

ssh protein
to get to the compute server.

On protein, there are two python2 installations and one python3 installation. Use

#!/usr/bin/env python2.7
#!/usr/bin/env python3.4
to get the correct version of python. (Not "python", "python2", or "python3", despite what I said in class.)

The assignment doesn't seem to say where to submit the files! They should be on the School of Engineering fileserver in ~yourid/BME205/hw1/wordcount (where "yourid" is replaces by your login id on the School of Engineering computers). It must be the School of Engineering machines, not the campus-wide instructional machines. Email me the file name when your assignment is done.

Make sure that your home directory is readable by others, that BME205 and BME205/hw1 are readable and executable by others, and that wordcount is readable and exacutable by others.