CMPS 161 Final Project Proposal

David Seagal
Winter 2009
dseagal@ucsc.edu


Project Description

This project is an attempt to visualize information from the Internet Movie Database (IMDB) using a series of directed and undirected network graphs. Each node on the graph can represent either a celebrity (actors, directors, etc) or a project (movie, TV show, etc). The size of the node indicates the "success" of that celebrity/project.

The output files generated by this project are Pajek .net files. The output was initially intended to be used for the NodeXL extension to Microsoft Excel. Technical difficulties requireda switch to Pajek's standalone graphing program.


Instructions

First, download and extract the zipped file containing all the code and data files for this program. The file is here

There are two data files: actors.list and ratings.list. Feel free to modify these files to see the changes that occur. Make sure whatever changes made still adhere to the format of the data.

Unfortunately, time constraints prevented a proper GUI being created, so running the program will automatically generate four .net files, each one showing a different feature. You can change in the source code which celebrities/projects are processed.

To view the graph generated by the .net file, you will need to download and install Pajek. In Pajek, choose File->Network->Read and select the desired .net file. To view the graph, choose Draw->Draw. Depending on the settings, some configuration is needed to improve the look of the graph.


Features

There are four types of graphs generated by this program, which represent:

Choosing a celebrity and viewing all associated projects.


Choosing a project and viewing all associated celebrities.


Choosing two celebrities and viewing all projects shared between them.


Choosing two projects and viewing all celebrities shared between them.


The shape of the node determines its type, with celebrities being represented as triangles, and projects as circles.

The size of the node indicates the "status" of a celebrity/project. The larger the node, the more successful.


Problems

The actual IMDB files were huge and difficult to parse. As a result, dummy files and a more simplified parser were used. This means that the program currently does not work with actual IMDB data.

No time to build a GUI.

The Pajek format is frustratingly difficult to write to correctly. This lead to frustrations in attempting to implement some features.