Using Zipf's Law to Analyze Japanese Kanji Frequency

CMPS 161: Winter 2012

Kip Turner


Links

Visualizing Zipf's Law in Japanese Paper
Executable
Source code

Terms:

The Concept:

Zipf's law implies the Principle of Least Effort when it comes to spoken language, and universally, spoken language has evolved so that the most commonly used words are the shortest. However, written language is another matter, generally speaking, most languages' written length is directly proportional to its spoken length. However, Asian languages differ in this regard, the written form for a single syllable(or mora) can be very complex and time consuming to write. Thus, I would like to graphically represent how Zipf's Law extends to written Japanese and its writing system.

Implementation Stuff

The data is stored in an SQL data, the program is set up such that it queries a static database manager to insert and retreive data from the database. Once it retreives the data it applies data visualization techniques to it to create scatter plots. Renderers are used to describe how the data should be rendered.

Readme stuff

To compile the program: Compiling the program can be done by opening the project in eclipse and running a build, this will ensure that the required depencies (external .jar files) are included in the compile.
Running the program:

  1. It should be as simple as clicking the .jar file.
  2. It is IMPORTANT that the /database folder is in the same directory as the running executable.
  3. The database folder needs to have the database files named to be named jpdb.
  4. If the database is not in the correct file path, the visualization will crash.
  5. If recompiling the project, make sure to copy the /database folder from the .jar file into the folder that is running the java executable.

Using the program: The applet has three main buttons, these each create their respective scatter plots in seperate JFrames.
  1. Once the scatter plots have launched, the user can change what the characteristic of the data is being mapped to the axis using the top toolbar.
  2. This is somewhat analagous to having a scatter plot matrix.
  3. Hovering over a data item in the scatter plot will show information about the item.
Notes: 'Render 3D Scatter' is particularly slow because it executes 2500 SQL queries, as well the visualization is not very insightful so should be used last.


Video

The Camtasia Studio video content presented here requires JavaScript to be enabled and the latest version of the Adobe Flash Player. If you are using a browser with JavaScript disabled please enable it now. Otherwise, please update your version of the free Adobe Flash Player by downloading here.


Data Files