Kip Turner: CMPS 161: Winter 2012: Project Proposal

Terms:

Kanji - The Japanese name for ideographic characters of Chinese origin. Totaling to around 6,000 characters, the Kanji system is very difficult to master even for native speakers of the language.
Stroke Count - The number of strokes it takes to draw the kanji, this is essentially the complexity of learning it.
Kana - The Japanese syllabary, it is the phonetic way of spelling words. All kanji must have a phonetic spelling represented by kana. All kana take an equivelent duration to sound, making Japanese a moraic language.
Zipf's law - An empirical law referring to the fact that the frequency of words spoken is inversely related to its rank in the frequency table. George Zipf further postulated that language is spoken with Principle of Least Effort. Essentially, short words dominate over longer words, and those longer words are spoken with extreme rarity compared to shorter words.

The Concept:

Zipf's law implies the Principle of Least Effort when it comes to spoken language, and universally, spoken language has evolved so that the most commonly used words are the shortest. However, written language is another matter, generally speaking, most languages' written length is directly proportional to its spoken length. However, Asian languages differ in this regard, the written form for a single syllable(or mora) can be very complex and time consuming to write. Thus, I would like to graphically represent how Zipf's Law extends to written Japanese and its writing system.

The Implementation:

Legend:

Italics indicate the method I am using to display a particular dimension of information.
Bold indicates what that particular dimension is.

For the y-axis in my visualization, I will represent the Zipf complexity(how long the word is to speak) of the kanji. To estimate the Zipf complexity of the kanji:

I will first compute how many characters each word is in the Japanese syllabary. The Japanese syllabary is inherently moraic, meaning that the number of characters is directly proportional to how long it takes to speak.
After finding the phonetic length of each word, I can compute for every kanji the average phonetic length of words that it is used in.

The x-axis is the Newspaper Frequency Ranking of the kanji character. Where 1 is the most frequently appearing kanji in newspaper. This will be used as the metric for how often a kanji is used in language.

The kanji themselves will be plotted on the graph as bubbles containing the character inside it. The size of the bubble will correspond to the summation of the normalized frequencies of all the words that contain that kanji. What this shows is the total use of the particular kanji.

The tint of the bubble will correspond to the number of strokes in the kanji. This will show how difficult it is for practitioners of the language to learn/remember. I will use tint instead of color because values need to be relative comparisons and perceptual ordering cannot be easily achieved by color-mapping.

Timeline:

2/13 Parse Kanji, Newspaper Frequency Rank
2/20Parse corpus of words and their respective frequencies into SQL database
2/22Compute Zipf complexity by averaging lengths of words that include the kanji.
Extra:Refine the Zipf complexity by translating the kanji characters into their kana counter-parts. This is done using a dictionary file that already has a map of kanji spelling to kana.
2/25Compute the cumulative frequency of words that contain the kanji
Extra:Shade by stroke count of the kanji
3/5Render Graph to Screen

Using Zipf's Law to Analyze Japanese Kanji Frequency

CMPS 161: Winter 2012

Kip Turner

Terms:

The Concept:

The Implementation:

Legend:

Timeline:

Data Files

Looking at research from: