This assignment practices
The Dinky Database Dantai has had complaints from their customers about the slowness of the zip code sorting operation. You have been hired as a consultant to diagnose the problem. They can't afford your consulting fees for the time it would take you to code and debug the routines yourself, so you have to prepare a report to their programmers explaining the fixes to be done. You cannot talk with the programmers who wrote the code, because they are all working for Bit Bucket Enterprises now. The new DDD programmers are mainly entry-level programmers (a euphemism for high school and junior college students working in their spare time).
On examining the code, you find that
Because Canadian and other foreign addresses use non-numeric mail routing codes, and even US codes vary in length, you decide that the data representation is appropriate, and that alphabetical order is the correct ordering for zip codes (00002-4455 is very close to 00002, not to 24455). The comparison and exchange operations seem to be correctly written, with only minor gains in efficiency possible from re-implementing them more carefully.
After checking with DDD, you find that customers have not complained about the number of entries they can sort, only about how long it takes. This means that internal sorting is a reasonable choice, and the extra complexity and speed penalty for external sorting is not justifiable.
You cannot find any reason for using bubble sort, and decide to recommend a faster sorting technique (quicksort, heapsort, merge sort, shell sort, radix sort, ...). You have to explain briefly why you are recommending a change of algorithms, then give a detailed description that the programmers can understand. You do not have to give a full theoretical analysis of the algorithm, but a few sentences to explain why you chose the algorithm you did would be helpful.
Your report will be read by two audiences: the managers of DDD and the programmers. The managers have no formal training in computer science, but want to know that you've solved their problem. The programmers know a little computer science, but are not familiar with sorting algorithms. The programmers may know as much as you do, but because they can't write nice reports, they get sweatshop wages while you earn consultant rates.
Your first task is to choose an algorithm. You may already know one that is good enough. If not, Knuth's Volume 3--Sorting and Searching [Knu73] should provide the information you need. Of course, you don't want to read all of Volume 3 before making up your mind. If you don't have a copy of Knuth, almost any sophomore or junior data structures book will describe a few of the more important sorting algorithms. You may want to check the serial published by the ACM containing nothing but algorithms for various problems [ACM].
Choose your sorting algorithm to be fast and to require minimal changes to the data structures in the existing program. You know that they use bubble sort, and so the information is stored in an array, with comparison and exchange operations available. If your algorithm requires some other data structure, you will have to describe how to build that data structure as well. For example, merge sort may be somewhat easier to describe than quicksort, and less subject to ``off-by-one'' errors, but the simplest implementation requires linked lists, which would increase the memory requirements. If you do describe merge sort, be sure to include the appropriate information about error-checking and pointer manipulation: in the past, many students have used x->next->next in their pseudo-code without warnings about odd list lengths and checking for null pointers.
When describing an algorithm, you have to pay careful attention to the level of abstraction. You do not want to describe everything in gory detail--that's the programmer's job. You also don't want to treat the whole problem as a single black box either. If you knew the programmers were highly skilled, it might be enough to tell them ``Use quicksort!'', but for this crew you'll need more detail than that.
The main parts of an algorithm description are
Different audiences and different purposes demand attention to different aspects of the algorithm description. When presenting a new algorithm for a well-known problem to the research community, the emphasis is usually on technique, justification, and analysis. In that situation, you're trying to show how much better you've done than previous researchers. When presenting a new data structure (say a better representation of a priority queue), the data structure and analysis might be stressed.
For this report, you will be presenting a well-known algorithm as the solution to a specific problem. Here data structure and technique are most important. You can point to the literature for a proof of correctness and the details of the analysis.
Some of the tools used for proving correctness are also useful for explaining the technique, so don't ignore them entirely. For example, the termination condition for a loop needs to be known both for the justification and for the coding. Loop invariants (statements that describe a property that is unchanged as the loop is repeated) are essential for correctness proofs, and handy for generating the loop in the first place. David Gries gives a detailed presentation on using loop invariants to help write programs [Gri83].
A manager reading your report will look at the executive summary and at the figures, to see what the report is about and whether it is worth passing on to the programmers who'll read the complete report. It should be possible to get a fair idea of the content of the report just from looking at the figures and reading the captions. Look at the articles in Scientific American for an excellent example of how figure captions can be used to convey most of the substance of a technical article.
You have several opportunities for figures in this report--you can use them for the pseudo-code for the algorithm, for showing the change in the data structure as the sorting progresses, for displaying the relative speed of different algorithms as a function of the number of items to sort, and so forth.
There are two ways to insert non-textual material into a report: figures and displays.
A figure can appear anywhere in the report, and usually appears on the same page as, or slightly after, the first reference to it. Figures are numbered and have a caption explaining what the figure is illustrating. Each figure must be referred to somewhere in the text, either as part of a sentence, like this: ``Figure 37 illustrates the relationship between ...''; or as a parenthetical remark, like this: ``(See Figure 38.)''. Remember to capitalize the word ``figure'' when it is used as part of the name for a particular figure.
A display differs from a figure in that it is inserted in a fixed place, and is not allowed to ``float'' to a different place in the text. Displays are used for math formulas, very small program fragments, and quotations that are too long just to put in quotes in the middle of a paragraph. Some empty space is usually inserted before and after each display to set it off from the surrounding text, and displays are often centered on the page. Mathematical formulæ are usually inserted as displays, and are grammatically part of the sentence before them. For example, I can define the golden ratio as the positive solution to the quadratic equation
and continue writing the same sentence.
Pictures are often the clearest way to explain data structures, particularly when pointers are required. Whenever a picture is used, it should be a numbered, captioned figure. Figures should be (nearly) comprehensible without the accompanying text. Make sure the caption explains what the figure means. Something like ``Figure 3. Step 2 of the algorithm.'' is nearly useless. A better caption might be ``Figure 3. Swapping out-of-order data elements pointed to by i and j.''
The pictures you are likely to draw are block diagrams, symbolic representation of pointer structures, maps of contiguous sections of memory, and graphs showing how something (time, cost, current, voltage, ...) varies when changing various parameters. The rest of this section will talk only about numerical graphs, though you probably will need to generate other sorts of graphics, even for this assignment.
Proper treatment of graphics is really beyond the scope of this class. A fairly good introduction is included in Chapters 8 and 9 of Huckin and Olsen [HO91, 137-184]. If you want a comprehensive treatment of presenting numerical (mainly statistical) data in a professional form, the best book is Tufte's The Visual Display of Quantitative Information [Tuf83] on reserve in the Science Library. Other books can be found under subject headings like ``Engineering Graphics,'' ``Graphic Methods,'' and ``Computer Graphics.'' For examples of how not to present numerical information, see almost any mass-market magazine or newspaper. USA Today is particularly good at distorting tiny data sets into fancy pictures that hide the data. Another particularly good collection of bad examples can be found in a book by Time Magazine illustrator Nigel Holmes [Hol84], although they are presented as if they were the best way to illustrate data.
Tufte summarizes his theory of data graphics as follows [Tuf83, 105]:
Five principles in the theory of data graphics produce substantial changes in graphical design. The principles apply to many graphics and yield a series of design options through cycles of graphical revision and editing.
- Above all else, show the data.
- Maximize the data-ink ratio.
- Erase non-data ink.
- Erase redundant data ink.
- Revise and edit.
Tufte also has some strong comments about ``chartjunk'', the cluttering of graphs with non-informative decorations [Tuf83, 121]:
Chartjunk does not achieve the goals of its propagators. The overwhelming fact of data graphics is that they stand or fall on their content, gracefully displayed. Graphics do not become attractive and interesting through the addition of ornamental hatching and false perspective to a few bars. Chartjunk can turn bores into disasters, but it can never rescue a thin data set. The best designs (for example, Minard on Napolean in Russia, Marey's graphical train schedule, the cancer maps, the Times weather history of New York, the chronicle of the annual adventures of the Japanese beetle, the new view of the galaxies) are intriguing and curiosity-provoking, drawing the viewer into the wonder of the data, sometimes by narrative power, sometimes by immense detail, and sometimes by elegant presentation of simple but interesting data. But no information, no sense of discovery, no wonder, no substance is generated by chartjunk.Forgo chartjunk, including
the grid, and the duck.
One problem that has appeared repeatedly is that students have given us plots of n2 and for n ranging from 1 to 10, to show the advantages of one of the algorithms over bubble sort. There are two flaws: internal sorting of 10 elements is so fast that no one cares which algorithm is used, and for such short lists some of the O(n2) algorithms are faster than the ones. If the plots showed from 100 to 10,000 records being sorted, the graph would be both more meaningful to the reader and more honest.
Also, the graphs are often misleading, in that they are labeled with seconds or some other inappropriate unit. If you had done some real speed tests, then seconds would be the right unit, but if you have just done a theoretical analysis, then it is better to be more explicit about what you are plotting--in most analyses the number would be the number of comparisons or the number of exchanges done. If you want to provide a good graph, you might look up the formula for the expected number of comparisons actually done by each algorithm. The best place to find accurate formulas for the standard sorting algorithms is Knuth's Sorting and Searching [Knu73].
Here are some quick hints for creating usable graphs:
Most of you have seen flowcharts for describing techniques. Although touted as a cure-all to programming woes in the 1960's, they have essentially disappeared from professional programming. They encouraged unstructured ``spaghetti'' code, and didn't help much in understanding the resulting mess. The small boxes in flow charts give far too low an information density, using lots of paper to say very little.
Instead of flowcharts, most algorithms are now described using pseudo-code. The pseudo-code syntax can be based on any structured language (Algol, Pascal, C, Ada, ...), but should be readable by someone familiar only with a different language. Tests and statements are not written in full detail, instead the intent of the statement or test is given. For example, pseudo-code for finding the minimum of an array might look like
minfor each element of array if element<min then minelement
Pseudo-code should be properly indented and either displayed like the example above, or put into a figure with a figure number and caption. Figures are generally easier for magazine and book typesetters to handle, and are essential for big chunks of pseudo-code, but displays are easier to read for small pieces of code.
If you do put the pseudo-code in a figure, try to arrange the layout so that the figure comes after the explanation has started. It is unfair to the reader to dump a big chunk of unexplained code in front of him or her. It can be very frustrating to slog through the code only to find an explanation after you turn the page.
The procedure sort(arr,low,high) sorts the elements of the array arr[low] through arr[high] into increasing order.
Sorting is accomplished by dividing the sequence to be sorted into two smaller sequences, sorting each sequence independently, then merging the two sorted subsequences.Note that the above description could apply equally well to quicksort or merge sort. The differences between the two algorithms come in how the partitioning and merging are done. Merge sort uses a trivial partition and a more complex merging operation, while quicksort uses a fairly complex partitioning operation and a trivial merge.
A sequence with only a single element is already sorted, so no partitioning and merging are necessary. Fast variants of divide-and-conquer sorting algorithms often use insertion sorting when the sequence has fewer than five or six elements, as the overhead for recursion is quite high on conventional machines.
When you prepare a formal report, it is worth taking a little extra care to make sure that the report will reach the intended audience, and that they will decide to read it. There are several things you can do to make it more likely that your report ends up in the right place:
Frequently, students misinterpret a statement that an algorithm takes O(n2) steps to mean that it takes exactly n2. You can't claim that switching from an O(n2) algorithm to an algorithm will give you a 100-fold speedup for n>1000, because you don't know what the multiplicative constants are. The only way you can make such strong claims is to do a more detailed analysis of the algorithms.
If you look in Volume 3 of Knuth [Knu73, 107-110], you'll find that bubble-sort requires at most comparisons and exchanges to sort n items, and that on the average it takes exchanges and comparisons. All of these numbers are O(n2), but they do differ significantly. Two popular algorithms, heapsort and quicksort, are both on the average, but heapsort is guaranteed to have fewer than comparisons [Knu73, 149], while quicksort could take as many comparisons as bubble sort. On random data, quicksort is usually the fastest sorting algorithm, but on non-random data it can be slower than bubble sort!
If you have trouble understanding one author's version of quicksort, check another's. But, please, don't get two different versions mixed together! Warning: many authors pick a pivot for quicksort that gives the worst possible behavior with an already sorted list. Because zip code lists are likely to be re-sorted after a few new entries are added, you want to be sure that the algorithm you recommend will be fast even if the list is already sorted.
If you are having trouble with definite and indefinite articles, read Chapters 29 and 30 of [HO91]. Particularly watch out for the word data; it used to be a plural noun (the plural of datum), but has come to be an uncountable one in American usage. Under no circumstances is it a singular, countable noun, so it can never be used with the indefinite article.
A divide-and-conquer sorting algorithm has three steps: partitioning the sequence into two shorter sequences, recursively sorting each subsequence, and merging the two sorted subsequences to form a complete sorted sequence.Italicizing words tells the reader that you believe the word is important, that the particular use you are making of the word may be unfamiliar to the reader, and that you are clarifying it here. Do not italicize jargon words every time you use them, but only when you are defining them. Italics are also used for foreign words that have not yet been Anglicized, and for general emphasis. In some documents, italics are also used for comments on program fragments, to make it clear that the comments are not part of the program itself.
Your report will be read by executives who understand the problem, but not programming. The format of the report can aid them considerably in finding the information they need. A jargon-free executive summary at the beginning, explaining the problem and its solution, may be all they need to read.
Use good section headings and captions for your figures. Read Sections 9.3 and 9.4 of Huckin and Olsen [HO91, 176-183] for suggestions about formatting a report for maximum readability.
Prepare your report as neatly as you can. A professional consulting job may produce a 10-20 page report from a $2000 study. At more than $100 a page, a consultant can afford to do it right. You do not need to spend enormous amounts on type-setting or professional illustrators. A typescript is still acceptable in industry, though laser-printing is becoming the standard, rather than the exception. Figures can be prepared on separate pages, pasted in, and photocopied. Photoreducing hand-drawn figures before pasting them in can hide the wiggly lines somewhat. The software available on the Macintosh and Windows machines makes combining graphics and text easy.
Choose an appropriate font size for your document. Unless you are writing a document that you don't want anyone to read (a warranty, perhaps?), choose at least a 10-point font. Unless you are writing for young children or people with bad eyesight, use at most a 12-point font. Standard elite and pica typewriters fall in the correct range. Using too large a font usually strikes readers as childish, which is not the effect you want in a report that you are charging hundreds of dollars for. If you want to impress someone with the bulk of a report, use 24-pound rag bond paper, a moderately wide font (say Palatino or Bookman instead of Times), slightly wider margins, and slightly more leading (space between lines), not 14-point fonts.
Check your spelling and punctuation carefully. Check even the spaces around your punctuation:
Remember to double-space all assignments turned in in this class, so that we can insert our comments easily. Also, remember to keep the returned assignment to turn in at the end of the quarter.