The goals of this program are to a) do some file I/O, and b) do some sorting.
The purpose of the assignment is to write a program that will read in
a text file, count the words, then count the unique words in the file, and
report on the results. This is useful in a variety of situations. For
instance, someone once told me that the works of Shakespeare contain about
6 times as many unique words as the bible, which was long considered
the standard for the English language. The only way to know this is
to count the unique words in both works and compare them. I haven't
tried it myself, but it would be interesting.
For this assignment, you should use a doubly linked list to hold
the individual words. You should read the words in, and place each
one in a Node in your linked list. When all of the words are inserted,
you should print out the size of the list - that's how many words are in
the file. Then you should sort the list. After you sort it, any
duplicate words will be adjacent in the list. So to filter out duplicate
words, you can just run through the list and remove any element of the list
that is the same as the one before it. Then you can print out the number
of unique words in the file, which is just the number of elements remaining
in the list.
First, you have to read in the file and convert it into a series of Strings.
This code shows you how to do that. Feel
free to copy this code and use it, with appropriate changes, in your program.
If you are interested, this code also shows
how to write words to a file.
After reading the words, you should store them in a doubly linked list.
One word of warning: any method or variable called or accessed by a
static method must also be declared as a static. If your design correctly
discusses why this is the case, you will earn the elusive 10th point. Once
you have read all of the words in, you can print out the number of elements
in your list.
Next, you should sort the words in the list. You may use any of
the sorts we have discussed. Whichever sort you choose, be sure to
discuss the expected running time in big-O notation. Correctly implementing
an O(nlogn) sort is also worth the "something extra" point.
Next, you should run through the list and remove duplicate entries. Once
you have done that, you should print out how many elements remain in the
list. That's the number of unique words.
To test your program, I have provided about 1/5
of Homer's Odyssey as a test case. This file has 23,156 words,
18,469 of which are duplicates, which means that only 4,687 of them are unique.
FYI, my program outputs the following:
After you have implemented your program, time how long it takes using
the unix "time" utility. All you have to do is put the word time in
front of the word "java" when you go to run your program, like this (assuming
you call your program CountWords):
time java CountWords odyssey.txt
You will get output that looks like this:
39.170u 0.040s 0:39.33 99.6%
0+0k 0+0io 1665pf+0w
The sum of the first two numbers is how much CPU time your program took. In this case, my implementation using quicksort took 39.21 seconds running on my linux desktop machine. The third number is how much real time your program took. Mine took 39.33 second. This is slightly longer than 39.21 seconds because the computer was doing some other stuff besides just running my program (like reading the text file from the hard drive).
Extra Credit (4 points possible): Reimplement the assignment using binary search
trees. My binary search tree code is available in the java directory.
Grading: Design exists: 1 point, Design and Code exist: 2 points, Design exists
and Code compiles and looks like it should work: 3 points, Design exists and
Code works: 4 points.
Your program must be accompanied by a Makefile and an ASCII Design document (.txt format) that shows:
Your program must also contain clear and descriptive variable names and comments, and good formatting that enhances the clarity of the program.
REMEMBER: Do not submit object files, assembler files, or executables. Every file in the submit directory that could be generated automatically by the compiler or assembler will result in a 5 point deduction from your programming assignment grade.
sbrandt@cse.ucsc.edu |