CMPS 13H: Honors Introduction to Programming and Data Structures

Programming Assignment #6: Reading and Sorting Words


Remember: your programming assignment must be turned in online.

The Basics

The goals of this program are to a) do some file I/O, and b) do some sorting.

The purpose of the assignment is to write a program that will read in a text file, count the words, then count the unique words in the file, and report on the results.  This is useful in a variety of situations.  For instance, someone once told me that the works of Shakespeare contain about  6 times as many unique words as the bible, which was long considered the standard for the English language.  The only way to know this is to count the unique words in both works and compare them.  I haven't tried it myself, but it would be interesting.

For this assignment, you should  use a doubly linked list to hold the individual words.  You should read the words in, and place each one in a Node in your linked list.  When all of the words are inserted, you should print out the size of the list - that's how many words are in the file.  Then you should sort the list.  After you sort it, any duplicate words will be adjacent in the list.  So to filter out duplicate words, you can just run through the list and remove any element of the list that is the same as the one before it.  Then you can print out the number of unique words in the file, which is just the number of elements remaining in the list.


The Details

First, you have to read in the file and convert it into a series of Strings.  This code shows you how to do that.  Feel free to copy this code and use it, with appropriate changes, in your program.  If you are interested, this code also shows how to write words to a file.

After reading the words, you should store them in a doubly linked list.  One word of warning: any method or variable called or accessed by a static method must also be declared as a static.  If your design correctly discusses why this is the case, you will earn the elusive 10th point. Once you have read all of the words in, you can print out the number of elements in your list.

Next, you should sort the words in the list.  You may use any of the sorts we have discussed.  Whichever sort you choose, be sure to discuss the expected running time in big-O notation.  Correctly implementing an O(nlogn) sort is also worth the "something extra" point.

Next, you should run through the list and remove duplicate entries.  Once you have done that, you should print out how many elements remain in the list. That's the number of unique words.

To test your program, I have provided about 1/5 of Homer's Odyssey as a test case.  This file has 23,156 words, 18,469 of which are duplicates, which means that only 4,687 of them are unique.  FYI, my program outputs the following:

        There are 23156 words in odyssey.txt
        Next we'll sort the list
        Deleted 18469 duplicate words
        There are 4687 unique words in odyssey.txt

After you have implemented your program, time how long it takes using the unix "time" utility.  All you have to do is put the word time in front of the word "java" when you go to run your program, like this (assuming you call your program CountWords):

        time java CountWords odyssey.txt

You will get output that looks like this:

        39.170u 0.040s 0:39.33 99.6%    0+0k 0+0io 1665pf+0w

The sum of the first two numbers is how much CPU time your program took.  In this case, my implementation using quicksort took 39.21 seconds running on my linux desktop machine.  The third number is how much real time your program took.  Mine took 39.33 second.  This is slightly longer than 39.21 seconds because the computer was doing some other stuff besides just running my program (like reading the text file from the hard drive).

Extra Credit (4 points possible): Reimplement the assignment using binary search trees. My binary search tree code is available in the java directory. Grading: Design exists: 1 point, Design and Code exist: 2 points, Design exists and Code compiles and looks like it should work: 3 points, Design exists and Code works: 4 points.

Your program must be accompanied by a Makefile and an ASCII Design document (.txt format) that shows:

  1. Inputs
  2. Outputs
  3. Data objects and all non-trivial operations on those data objects
  4. Algorithm(s)

Your program must also contain clear and descriptive variable names and comments, and good formatting that enhances the clarity of the program.


What to turn in

Your java code, your makefile, and your design document.

REMEMBER: Do not submit object files, assembler files, or executables. Every file in the submit directory that could be generated automatically by the compiler or assembler will result in a 5 point deduction from your programming assignment grade.


sbrandt@cse.ucsc.edu