Mining Intent from Search results
Jan Pedersen


Abstract:   The current generation of Web search services derives most of their quality signals from two sources:  the Web Graph and query session logs.  For example, Page Rank is mined from the Web Graph while query understanding features, such as spelling correction, rely heavily on query log analysis.   The next generation of Web Search services will be distinguished more by presentation than by conventional matching and ranking.  Sophisticated presentations require understanding the intent behind a query.  For example, knowing that the query [Rancho San Antonio] names a particular place (an open space reserve in the Bay Area), not the class of ranches near San Antonio. Interestingly, Web results typically contain enough information to infer search intents in many cases.   I will outline how this can be used through post result processing to produce both improved results and improved presentations.

Bio: Jan O. Pedersen is currently Chief Scientist for Core Search at Microsoft. Pedersen began his career at Xerox's Palo Alto Research Center (PARC) where he managed a research program on information access technologies.  In 1996 he joined Verity (recently purchased by Autonomy) to manage their advanced technology program.  In 1998 Dr. Pedersen joined Infoseek/Go Network, a first generation Internet search engine, as Director of Search and Spidering.  In 2002 he joined Alta Vista as Chief Scientist. Alta Vista has later acquired acquired by Yahoo!, where Dr. Pedersen served as Chief Scientist for the Search and Advertising Technology Group. Prior to joining Microsoft, Dr. Pedersen was Chief Scientist at A9, an Amazon company. Dr. Pedersen holds a Ph.D. in Statistics from Stanford University and a AB in Statistics from Princeton University. He is credited with more than ten issued patents and has authored more than twenty refereed publications on information access topics, seven of which are in the Special Interest Group on Information Retrieval (SIGIR) proceedings.

 

Viral genomics and the semantic web

Carla Kuiken

The “HIV database and analysis platform” has been maintained in Los Alamos for 22 years, and has grown to be an internationally renowned resource for HIV data analysis. It is in the process of expanding to include hepatitis C virus and hemorrhagic fever viruses; the eventual goal is to make it a universal viral resource. This expansion necessitates much greater reliance on external data and information sources. These resources rarely use the same identifies and frequently contain annotator- and submitter-specific language. While efforts have been underway for some time to standardize and cross-link biological information on the web, there still is a long way to go. I will describe current status of the “Viral data analysis platform”, the (semantic) problems we have grappled with, and the local and global efforts at amelioration.

Bio: Carla Kuiken is a staff scientist at Theoretical Biology and Biophysics Group in Los Alamos National Laboratory. She received a PhD in Medicine from the University of Amsterdam in 1995. For several years, she split her time between the Department of Microbiology at the University of Amsterdam and the Theoretical Biology and Biophysics Group at the Los Alamos National Laboratory. In addition, she spent some time at the Australian National University (ANU) with Prof. Adrian Gibbs. In 1996 she became a full-time postdoc at LANL, then a staff scientist at the HIV Database and Analysis Project in 1999. In 2002 NIAID agreed to fund a new sequence and immunology database for hepatitis C virus, which was subsumed into the Viral Bioinformatics Resource Center in 2007 – only to be moved again into the Viral Pathogen Research (ViPR) database when the number of BRCs was reduced to five in 2009. In 2009 a database for Hemorrhagic Fever Viruses was started, funded by the Department of Defense.

 

Patterns of Spam in Twitter

Aleksander Kolcz

The growing popularity of Twitter has been attracting significant attention from spammers. The 140 character constraint, as well as other characteristics of the Twitter service, affect both legitimate users and spammers alike, forcing spammers to adopt certain unique tactics. In this talk we will offer a glimpse of the various techniques employed by service abusers, contrast them with other types of spam and discuss the challenges they pose to automatic detection systems.

Bio: Aleksander Kolcz is a Software Engineer at Twitter focusing on applying of Machine Learning and Data Mining to modeling user interests and preventing service abuse. He has 12 years R&D experience at Microsoft, AOL and Personalogy. He received his PhD in 1996 from the University of Manchester Institute of Science and Technology. 

 

Highly dimensional problems in Computational Advertising
Andrei Broder

The central problem of Computational Advertising is to find the "best match" between a given user in a given context and a suitable advertisement. The context could be a user entering a query in a search engine ("sponsored search"), a user reading a web page ("content match" and "display ads"), a user interacting with a portable device, and so on. The information about the user can vary from scarily detailed to practically nil. The number of potential advertisements might be in the billions. The number of contexts is unbound. Thus, depending on the definition of "best match" this problem leads to a variety of massive optimization and search problems, with complicated constraints. The solution to these problems provides the scientific and technical underpinnings of the online advertising industry, an industry estimated to surpass 28 billion dollars in US alone in 2011.

An essential aspect of this problem is predicting the impact of an ad on users’ behavior, whether immediate and easily quantifiable (e.g. clicking on ad or buying a product on line) or delayed and harder to measure (e.g. off-line buying or changes in brand perception). To this end, the three components of the problem -- users, contexts, and ads -- are represented as high dimensional objects and terabytes of data documenting the interactions among them are collected every day. Nevertheless, considering the representation difficulty, the dimensionality of the problem and the rarity of the events of interest, the prediction problem remains a huge challenge. The goal of this talk is twofold: to present a short introduction to Computational Adverting and survey several high dimensional problems at the core of this emerging scientific discipline.

Bio: Andrei Broder is a Yahoo! Fellow and Vice President for Computational Advertising. Previously he was an IBM Distinguished Engineer and the CTO of the Institute for Search and Text Analysis in IBM Research. From 1999 until 2002 he was Vice President for Research and Chief Scientist at the AltaVista Company. He was graduated Summa cum Laude from Technion, the Israeli Institute of Technology, and obtained his M.Sc. and Ph.D. in Computer Science at Stanford University under Don Knuth. His current research interests are centered on computational advertising, web search, context-driven information supply, andrandomized algorithms. He has authored more than a hundred papers and was awarded over thirty patents. He is a member of the US National Academy of Engineering, a fellow of ACM and of IEEE, and past chair of the IEEE Technical Committee on Mathematical Foundations of Computing.

Machine Learning on Big Data for Personalized Internet Advertising

Michael Recce

Marketers have long sought more effective ways to reach their audience to show the right ad to the right person at the right time.  Huge volumes of internet activity data, advances in machine learning methods, new hardware and software for large scale distributed computing, and developments in real-time decisioning have made this finally possible. Increasingly the particular advertisement that is seen on a web page is decided in a auction that takes place in a fraction of a second, while the page is loading.  In this presentation I will discuss how we, at Quantcast, meet the challenges in personalizing advertising.  This process involves multiple machine learning methods to evaluate of about 15 billion individual daily media events and leveraging this data to to make precise bids in almost 100,000 auctions every second.

Bio: Dr. Michael Recce has been managing the Modeling team at Quantcast for thepast year and a half. Prior to Quantcast, he lead Fortent¹s transaction monitoring and risk assessment systems. For seven years, Michael worked extensively with financial institutions devising improved methods for detecting unusual activity in financial transaction data.  Early in his career, Michael was a product engineering manager at Intel Corporation, where he led the development of new memory products for the company. Other projects he has worked on include the design of a control system for a space-based robot for Daimler-Benz, which was developed to run scientific and engineering experiments in the space station. Michael holds six patents, including one for research of a behavioral biometric called dynamic grip recognition, and was a recipient of the Thomas A. Edison Award in 2005. He has been a lecturer at University College, London and a professor of information systems at New Jersey Institute of Technology. He received his bachelor's degree from the University of California-Santa Cruz and his doctorate from University College, London.

 

Privacy and Effectiveness in Internet Advertising

Qi Zhao

With the proliferation of diverse internet services and applications, individuals are confronted with the risk of losing their privacy through providing personal information for enjoying the service. In this talk, we will first address typical scenarios where privacy breach occurs and then provide a brief overview of existing approaches handling privacy issues. In the last, we focus on discussing privacy preservation for advertising data sharing platforms. Such a data setting distinguishes itself from previous data setting in the sense of much larger number of records and much higher dimensional attribute vector, which pose great challenges to existing approaches and motivate the idea of reducing the certainty of individuals' profile via noise injection. The feasibility and effectiveness of the proposed method is demonstrated by applying it to the simulated campaigns for Expedia.

Bio: Qi Zhao is currently a second year Ph.D student at Information Retrieval and Knowledge Management Lab at UCSC. He found his research interests in applying statistic knowledge and machine learning techniques to solving problems involved with large scale data. Specifically speaking, Qi is now engaged in developing algorithms for internet privacy preservation. Prior to coming to UCSC, he obtained both his M.S degree and B.S degree at Fudan University, China.

 

Utilizing Marginal Net Utility for Recommendation in E-commerce

Jian Wang

The main goal of a recommender system in e-commerce is to help potential consumers find products to purchase. In order to achieve this goal, the system needs to learn how the consumer makes a purchase
decision. Earlier research in economics and marketing can be utilized to better understand the consumer's purchase intention. These earlier research help us to design the recommender system accordingly. The system can learn from the consumers' history and make better predictions during the recommendation stage. Here we present our recent work in this direction. Traditional recommendation algorithms often select products with the highest predicted ratings to recommend. However, earlier research in economics and marketing indicates that a consumer usually makes purchase decision(s) based on the product's marginal net utility (i.e., the marginal utility minus the product price). Utility is defined as the satisfaction or pleasure user gets when purchasing the corresponding product. A rational consumer chooses the product to purchase in order to maximize the total net utility. In contrast to the predicted rating, the marginal utility of a product depends on the user's purchase history and changes over time.

To better match users' purchase decisions in the real world, we explore how to recommend products with the highest marginal net utility in e-commerce sites. Inspired by the Cobb-Douglas utility function in consumer behavior theory, we propose a novel utility-based recommendation framework. The framework can be utilized to revamp a family of existing recommendation algorithms. To demonstrate the idea, we use Singular Value Decomposition (SVD) as an example and revamp it with the framework. We evaluate the proposed algorithm on an e-commerce (shop.com) data set. The new algorithm significantly improves the base algorithm, largely due to its ability to recommend both products that are new to the user and products that the user is likely to re-purchase.

Bio: Jian Wang is a third year Ph.D. student in University of California, Santa Cruz. She works with Prof. Yi Zhang in the information retrieval and knowledge management lab at UCSC. Her research interests include recommender system, information retrieval and data mining. She has published in ACM SIGIR, ACM Recsys, ACM CIKM and so on. Jian received a master degree in Lehigh University in Pennsylvania in 2009 and bachelor degree in Fudan University in 2007. She worked in eBay research lab to help build the post-purchase recommendation engine, as well as the IBM websphere team before.

 

Recommendation System for the Facebook Open Graph

Wei Xu

The Open Graph at Facebook contains very rich connections between hundreds of millions of users and billions of objects. Recommendation technology is important for finding the most interesting objects for the users from the huge amount of objects in the Graph. In this talk, I will give the following:  1) an overview of the different recommendation tasks we are facing at Facebook; 2) the tools we provide to the developers for accessing object recommendations from the Graph; and 3) the challenges and solutions for building such a recommendation system.

Bio: Wei Xu is a research scientist at Facebook. He initiated and is a leading member of the Facebook recommendation platform “Taste”, which is a key technology behind Facebook Open Graph. Before joining Facebook, he is a senior research staff at NEC Lab America, where he was the leading architect for the video event detection system, a top performer at TRECVID’08/09 event detection evaluation. He has received Technology Commercialization Award and Technology Impact Award of NEC Laboratories America. He has written a book on “Machine Learning for Multimedia Content Analysis” and has published 40+ research papers on various venues such as ICML, SIGIR, NIPS, CVPR, and ICCV. He received his B.S. from Tsinghua University and M.S. from Carnegie Mellon University.

 

Recommender Systems at the Long Tail

Neel Sundaresan

Online Recommender systems are essential to eCommerce.  A complex marketplace like eBay  poses unique challenges and opportunities. The large diversity in the item space, the buyer and seller space introduces super-sparsity at scale. However, the elaborate transaction flow offers opportunities for a wide class of recommender applications. In this talk we will discuss these challenges, opportunities, and systems for recommendations.

Bio: Neel Sundaresan is a Senior Director of eBay Research Labs. He has led the labs for over the past 5 years. He directs research in areas from Search, Recommender systems, Social Networks, Vision, Economics, and Large Data Science at eBay among others. His new mission is Science for Empowerment. He has over 50 research publications and over 65 patents to his credit. He is a frequent speaker at national and international conferences.  He has a bachelors and masters in Mathematics and Computer Science from IIT Mumbai. His PhD dissertation is in the area of compilers and runtime systems for modeling data and control parallelism in object oriented languages.

 

Filtering Semi-Structured Documents Based on Faceted Feedback

Lanbo Zhang

Existing adaptive filtering systems learn user profiles based on users' relevance judgments on documents. In some cases, users have some prior knowledge about what features are important for a document to be relevant. For example, a Spanish speaker may only want news written in Spanish, and thus a relevant document should contain the feature "Language: Spanish"; a researcher working on HIV knows an article with the medical subject "MeSH: AIDS" is very likely to be interesting to him/her.

Semi-structured documents with rich faceted metadata are increasingly prevalent over the Internet. Motivated by the commonly used faceted search interface in e-commerce, we study whether users' prior knowledge about faceted features could be exploited for filtering semi-structured documents. We envision two faceted feedback solicitation mechanisms, and propose a novel user profile-learning algorithm that can incorporate user feedback on features. To evaluate the proposed work, we use two data sets from the TREC filtering track, and conduct a user study on Amazon Mechanical Turk. Our experimental results show that user feedback on faceted features is useful for filtering. The new user profile learning algorithm can effectively learn from user feedback on faceted features and performs better than several other methods adapted from the feature-based feedback techniques proposed for retrieval and text classification tasks in previous work.

Bio: Lanbo Zhang is a Ph.D. candidate at the IRKM lab at UC Santa Cruz. He has been working on several topics in the field of personalized information filtering and recommendation, including how to learn user profiles based on new types of user feedback, how to learn a single user's multiple interests, etc. His general research interests lie in applying machine learning and data mining techniques in filtering/recommendation tasks. Lanbo has published several papers in top IR conferences including SIGIR and CIKM. Lanbo used to be a summer intern at IBM Almaden research center working on adverse drug effects mining from electronic health records.  Lanbo got his MS and BE degree in computer science from Chinese Academy of Sciences (2008) and Tsinghua Unviersity (2005) respectively.