An N-gram Based Cheat Checker

Overview

A cheat checker is a program that compares other programs to one another, raising a red flag if the programs are similar. The definition of similarity varies with the algorithm the cheat checker uses to compare files. This cheat checker preprocesses input files and looks for similar sequences of characters, counting them more if they occur in fewer files. It then does a pairwise comparison between the profiles of each file read in, ranking the pairs according to their similarity. The output is a list of pairs, with the most similar pairs listed first.

The goal of the cheat checker is to identify pairs of files that are similar, and should be manually checked. An instructor should check the files himself or herself before drawing any conclusions about cheating. This program is a tool, and shouldn't (in and of itself) be used to accuse students of cheating.

Languages

The cheat checker works on several common computer languages (C/C++, Java, Pascal) as well as some less common ones (Esim—a hardware simulation language developed at UMBC) and plain old English text. It should also work fine on text in other human languages, particularly those that use a Roman alphabet.

If you'd like to use it on a computer language not listed above, please let me know by emailing elm «at» cs.ucsc·edu. If there's enough demand, I'll add support for other computer languages if possible. Possibilities include Python, Perl and Tcl, but I'd like to know if there's sufficient demand. Perl in particular can be somewhat difficult to deal with because stripping the comments isn't easy...

Limitations

Like any other mechanism that tries to detect similarity between two documents, the cheat checker isn't perfect. As a result, you should manually check the first few results flagged by the program as being similar.

Note, too, that similarity levels vary with the types of documents. If you're comparing student projects where 90% of the code was supplied by the instructor, it's normal (and expected) that most programs will be similar to one another. However, thousand line programs written entirely by different groups ought not to be similar at all, so a similarity level that would be innocuous in the first case could indicate large-scale copying in the second case. Similarity is all relative to the set of programs being compared, and shouldn't be used between sets of programs.

Performance

This program is written in Perl, so it's not the fastest thing around. However, it can be used on up to about 200 assignments of 40-50KB each, as long as your machine has enough memory and you're willing to wait for a few minutes. If there's enough interest, I can rewrite the code in C++ to make it run faster at the expense of making it harder to improve and maintain.

How do I run the cheat checker?

Setting things up

  1. Before running the cheat checker, you'll need to download it and set up a few Perl modules.
  2. Once you've got everything installed and set up, you'll need the password to the cheat checker. Please don't distribute this password; I'd prefer that people get it directly from me.

Running the cheat checker on a set of files

Now you're ready to run the cheat checker! Just follow these simple steps:

  1. Create a single file for each student's assignment. If there are multiple files per student, use cat (or something similar) to create a single file. Don't worry about the order in which the multiple files appear in the single file—the cheat checker is relatively immune to switching the order of large chunks of code or text. You may want to use a simple script to do this; a sample script is available.
  2. Run the cheat checker as follows:
    cheatchk password options file1 file2 ... fileN
    Available options are:
    -language language Selects the language the input files are in. Choices are: C (also handles C++), Java, Pascal, esim, and English / text. Default is C.
    -c, -java, -pascal, -esim, -english, -text, -html Shortcuts for specifying a particular language.
    -ngramlen length Length of n-grams to use. Default is 5, but you can also try values from 3–7.
    -top num Only print the top num similarity results.
    -skipps Only useful in -text mode. Skips PostScript files (similarity on them is pretty tough to do...).
    -help Prints a help message.
  3. Hand check the pairs of documents that come out near the top of the list. Based on your assessment, decide whether the assignments are similar enough to warrant taking action.

How can I get a copy?

You can download a copy of the cheat checker Perl script right here!

Why does this file I downloaded not look like a Perl program?

The cheatchecker is distributed in encrypted form. This is done to prevent students from gaining access to the code contained in it. It's designed to run encrypted, with the password entered each time it's run. In order to support this, you need several Perl modules:

The first three modules are available from CPAN (the Comprehensive Perl Archive Network) . The fourth is a filter module that decrypts the executable on the fly, using the first three modules. You may already have some of these modules installed on your system; if so, no need to reinstall them.

Where I can get a password to run the program?

I'd be happy to send a password to any instructor who wants to run the program — just send me email (elm «at» cs·ucsc·edu), and your password will be on its way within a day or two.

Feedback

I'm interested in hearing what you think of this program, and in any suggestions you might have for improving it. Please send comments to elm «at» cs·ucsc·edu.