Today, there are only two main tasks, EPGYAuthor and Final Projects, with the optional ImageClusterer on the side. Spend your time wisely! If you are not interested in EPGYAuthor, and would rather just work on your final project, feel free, just make sure that you take a peek at EPGYAuthor and get a sense of what is going on. Also, if you really want to make it through previous labs, feel free to spend some time on it, but don’t get too caught up on older projects at the cost of not working on your final project!
Before getting started, make sure that you have updated your submissions directory fully and you have as much of your work in it as possible.
Recall the demos that we saw of using clustering to find image features? We are going to do that ourselves! Get excited to implement k-means clustering! Download the starter code here. Read through the project and look at the part you need to implement, the clusterPixels() method. The general steps are outlined in the comments, but as far as actual implementation is concerned, you need to think carefully about a few parts, especially how you will keep track of which pixels wind up clustered to which centers. Remember, this is totally optional! If you want to devote your time to your final project, work on that instead.
Python example: http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_ward_segmentation.html
(Deliverable) When you are done, take SCREENSHOTS! Post on blog.
(Read the entire part before starting.)
In this short project, we’re going to give a computer a source text or book, and then have the computer generate some text that’s similar to the text or book. For example, if you give it the complete works of Shakespeare, this program should spit out some text that looks like it could have been written by Shakespeare, as long as you don’t read too closely.
Sounds hard, right?
Not if you use the correct model! We learned in lecture that supervised learning has two main steps–feature/model selection, and actual learning (populating our model). If we select a great model, the learning should be pretty simple. In this project, we’re going to build a Markov model, then use a trivial learning technique (just pick based on the probability that it showed up before).
The model we’re going to use here is called the Markov model. Markov models were created by Andrey Markov, and subsequently used by the computer scientist Claude Shannon in 1948 to model language. Simple, a Markov model derives the current state from some probability distribution of the state(s) that immediately preceding this state. Markov models are widely used in machine learning, data mining, and statistics.
So, how can we use Markov models in modeling English? As we discussed in lecture, the distribution of letters in English is not random. Certain letters are much more likely to appear after certain sequences of letters than others. For example, the letters ‘e’ and ‘o’ are much more likely to follow the sequence of letters “indicat” than, say, ‘y’ or ‘b’. We can therefore build a Markov model to model which letters follow other sequences of letters, given a text, and then use that model to generate some new text.
Markov models have a notion of “order,” which is how many previous states it uses to generate the current state. For example, an order 0 Markov model would not use any previous states, so it would just use the distribution of single characters in a text to generate a new text. With an order 1 Markov model, we would now be representing common bigrams, or two letter sequences, since we will be picking letters that commonly follow other single letters. An order 2 Markov model would then look at each 2-character sequence in the text, and record what letters are likely to follow each 2-letter sequence. For example, in English, the 2-character sequence “tw” is much more likely to be followed by a vowel than a consonant.
To generate text, we could just take that model, start with a random letter or letters, and then just use the Markov model to generate letters one at a time, based on the preceding letters. With order 0, you will just get a bunch of nonsense. With order 2, you’ll get sequences of letters that actually may look like it can be text in some weird language. With order 2, you can definitely see traces of English–it could even be read as some strange form of long-lost English. At order 4, words start becoming apparent, but the sentence structure and the writing style still differs from the original text. Between 5-7, you definitely start seeing the original style. You can go longer, but, if you pick an order that’s too long, it will become increasing unlikely that your source text will have many examples of the long letter sequences, so you risk copying the source text exactly instead of generating relatively novel text.
So, your mission, if you choose to accept it, is to build a k-order Markov model for some source text, then generate random text output that uses this model. Take the following steps:
For each of the following, declare data structure(s) to effectively store the information. Give some brief reasoning for your choice of data structures, and state any assumptions. Remember that the data will be used! Make sure to think of common ways that the data could be accessed in addition to how the data is stored. If you can, write out how you might declare your data structure in Java. Make sure to give your variables appropriate names.
(Extension) Write out a full class definition for the data structure. These progress approximately from easier to hard, and also might require creative use of data structures (or, if you want, data structures you have seen outside of this class). An example: the locations of cities around the world:
Possible data structure: Mapping from city name to coordinate pair (a pair of doubles).
Map<String,ArrayList<Double>> cityCoordinates = new HashMap<String,ArrayList<Double>>();
Extension: This is just an example of how you might declare a Java class to represent this information. You do not need to do this for each of the examples.
public class City { private double latitude, longitude; private String name; \\ maybe even include a constructor? public City(double latitude, double longitude, String name) { this.latitude = latitude; this.longitude = longitude; this.name = name; } \\ or getters and setters public void setName(String name) { this.name = name; } public void getName() { return this.name; } } \\ and now the full cities class! public class Cities { private Map<String, City> cities; \\ you can imagine the rest! }
Another possible (maybe less good?) data structure: List of the form city name – precise directions for traveling to the given city from Stanford.
ArrayList<String> cityDirections = new ArrayList<String>();