Kip Turner: CMPS 161: Winter 2012: Final Project

Terms:

Kanji - The Japanese name for ideographic characters of Chinese origin. Totaling to around 6,000 characters, the Kanji system is very difficult to master even for native speakers of the language.
Stroke Count - The number of strokes it takes to draw the kanji, this is essentially the complexity of learning it.
Kana - The Japanese syllabary, it is the phonetic way of spelling words. All kanji must have a phonetic spelling represented by kana. All kana take an equivelent duration to sound, making Japanese a moraic language.
Zipf's law - An empirical law referring to the fact that the frequency of words spoken is inversely related to its rank in the frequency table. George Zipf further postulated that language is spoken with Principle of Least Effort. Essentially, short words dominate over longer words, and those longer words are spoken with extreme rarity compared to shorter words.

The Concept:

Zipf's law implies the Principle of Least Effort when it comes to spoken language, and universally, spoken language has evolved so that the most commonly used words are the shortest. However, written language is another matter, generally speaking, most languages' written length is directly proportional to its spoken length. However, Asian languages differ in this regard, the written form for a single syllable(or mora) can be very complex and time consuming to write. Thus, I would like to graphically represent how Zipf's Law extends to written Japanese and its writing system.

Implementation Stuff

The data is stored in an SQL data, the program is set up such that it queries a static database manager to insert and retreive data from the database. Once it retreives the data it applies data visualization techniques to it to create scatter plots. Renderers are used to describe how the data should be rendered.

Readme stuff

To compile the program: Compiling the program can be done by opening the project in eclipse and running a build, this will ensure that the required depencies (external .jar files) are included in the compile.
Running the program:

It should be as simple as clicking the .jar file.
It is IMPORTANT that the /database folder is in the same directory as the running executable.
The database folder needs to have the database files named to be named jpdb.
If the database is not in the correct file path, the visualization will crash.
If recompiling the project, make sure to copy the /database folder from the .jar file into the folder that is running the java executable.

Using the program: The applet has three main buttons, these each create their respective scatter plots in seperate JFrames.

Once the scatter plots have launched, the user can change what the characteristic of the data is being mapped to the axis using the top toolbar.
This is somewhat analagous to having a scatter plot matrix.
Hovering over a data item in the scatter plot will show information about the item.

Notes: 'Render 3D Scatter' is particularly slow because it executes 2500 SQL queries, as well the visualization is not very insightful so should be used last.

Using Zipf's Law to Analyze Japanese Kanji Frequency

CMPS 161: Winter 2012

Kip Turner

Links

Terms:

The Concept:

Implementation Stuff

Readme stuff

Video

Data Files

Looking at research from: