Using Zipf's Law to Analyze Japanese Kanji Frequency

CMPS 161: Winter 2012

Kip Turner


Terms:

The Concept:

Zipf's law implies the Principle of Least Effort when it comes to spoken language, and universally, spoken language has evolved so that the most commonly used words are the shortest. However, written language is another matter, generally speaking, most languages' written length is directly proportional to its spoken length. However, Asian languages differ in this regard, the written form for a single syllable(or mora) can be very complex and time consuming to write. Thus, I would like to graphically represent how Zipf's Law extends to written Japanese and its writing system.

The Implementation:

Legend:

Italics indicate the method I am using to display a particular dimension of information.
Bold indicates what that particular dimension is.

For the y-axis in my visualization, I will represent the Zipf complexity(how long the word is to speak) of the kanji. To estimate the Zipf complexity of the kanji:

  1. I will first compute how many characters each word is in the Japanese syllabary. The Japanese syllabary is inherently moraic, meaning that the number of characters is directly proportional to how long it takes to speak.
  2. After finding the phonetic length of each word, I can compute for every kanji the average phonetic length of words that it is used in.
The x-axis is the Newspaper Frequency Ranking of the kanji character. Where 1 is the most frequently appearing kanji in newspaper. This will be used as the metric for how often a kanji is used in language.

The kanji themselves will be plotted on the graph as bubbles containing the character inside it. The size of the bubble will correspond to the summation of the normalized frequencies of all the words that contain that kanji. What this shows is the total use of the particular kanji.

The tint of the bubble will correspond to the number of strokes in the kanji. This will show how difficult it is for practitioners of the language to learn/remember. I will use tint instead of color because values need to be relative comparisons and perceptual ordering cannot be easily achieved by color-mapping.

Timeline:


Data Files