YOW User Study Data: Implicit and Explicit Feedback for News Recommendation

 

This data set was originally collected by the PI in Carnegie Mellon University. This data set was not shared before, and it is related to the PIIR project and might be interesting for people who are doing related research. Thus we are making it publicly available.

The data contained in this directory was collected in a user study. The goal of the study was to collect a data set from a variety of users with explicit feedback, implicit feedback, a wider range of information about documents and topics, and a heterogeneous set of documents.

 

To achieve these goals, we developed a web based news story filtering system. This system constantly gathers and recommends information to the users. The system has six major components: a news crawler, a text indexer, a database server, an adaptive filtering system, a web server, and a browser. The crawler has about 8000 candidate RSS news feeds to crawl frequently. RSS is a format

for syndicating news and the content of news-like sites. The adaptive filtering system learns from users' explicit feedback and recommends documents to the users. An indexer Lemur parses each incoming news story and incrementally builds a text index of documents to facilitate the computation of the filtering system. Users use a special browser to log into the system daily to read and evaluate what the system has delivered to them. The special browser captures the implicit and explicit user feedback from a user and sends the information to the web server at real time. The information is saved in a central database and used by the filtering system to learn user preferences. The interface also provides some simple search functionalities so that the users can search the articles in Yow-now's news archive using keywords and get a list of news, or search the news sources using keywords and get a list of RSS feeds.

 

21 paid subjects participated in the study for about 4 weeks. The subjects are otherwise not affiliated with our research. The subjects were required to read the news for about 1 hour per day and provide explicit feedback for each page they visited. In the last week of the study, some subjects read 2 hours per day. They are encouraged but not required to do so. 28 users tried this system. However, only 21 users were official paid subjects, among which one worked only for 2 weeks and 20 worked for about 4 weeks.

 

We have collected 7000+ feedback entries from all users. Each entry contained several different forms of evidence for each news story a user clicked.Our intention to collect the evidence was not to be exhaustive, but representative. The evidence can be roughly classified into the following five categories:

 

Explicit User Feedback: After finishing reading a news story, a user clicked a button on the toolbar of the browser to bring up an evaluation interface. Through this interface, the user provided the explicit feedback to tell the hidden properties about current story, including the topics the news belongs to

(classes), how the user liked this news (user_like), how relevant the news was related to the class(es) (relevant), how novel the news is (novel), whether the news matched the readability level of the user (readable), and whether the news was authoritative (authoritative). user_like, relevant and novel were recorded as integers ranging from 1 (least) to 5 (most). readable and authoritative were recorded as 0 or 1. A user has the option to provide partial instead of all explicit feedback. A user could create new classes, and choose multiple classes for one document.

 

User Actions: The special browser recorded some user actions, such as mouse activities, scroll bar activities, and keyboard activities TimeOnPage is the number of seconds the user spent on a page, and EventOnScroll is the number

of clicks on the scroll bars. When the mouse is out of the browser

window or when the browser window is not focused, the browser does

not capture any activities.

 

The data is in an excel file. Please download the data here

 

More information about the data set can be found in Chapter 5 of

Yi Zhang Bayesian Graphical Models for Adaptive Information Filtering (Ph.D. Dissertation 2005) [pdf]