Mining Intent from Search results
Jan Pedersen
Abstract: The current generation of Web search services derives
most of their quality signals from two sources: the Web Graph and query
session logs. For example, Page Rank is mined from the Web Graph while
query understanding features, such as spelling correction, rely heavily on
query log analysis. The next generation of Web Search services will
be distinguished more by presentation than by conventional matching and
ranking. Sophisticated presentations require understanding the intent
behind a query. For example, knowing that the query [Rancho San Antonio]
names a particular place (an open space reserve in the Bay Area), not the class
of ranches near San Antonio. Interestingly, Web results typically contain
enough information to infer search intents in many cases. I will
outline how this can be used through post result processing to produce both
improved results and improved presentations.
Bio: Jan O. Pedersen is currently
Chief Scientist for Core Search at Microsoft. Pedersen began his career at
Xerox's Palo Alto Research Center (PARC) where he managed a research program
on information access technologies. In 1996 he joined Verity (recently
purchased by Autonomy) to manage their advanced technology program. In
1998 Dr. Pedersen joined Infoseek/Go
Network, a first generation Internet search engine, as Director of
Search and Spidering. In 2002 he joined Alta Vista
as Chief Scientist. Alta Vista has later acquired acquired by Yahoo!,
where Dr. Pedersen served as Chief Scientist for the Search and Advertising
Technology Group. Prior to joining Microsoft, Dr. Pedersen was Chief Scientist
at A9,
an Amazon company. Dr. Pedersen holds a Ph.D. in Statistics from Stanford
University and a AB in Statistics from Princeton
University. He is credited with more than ten issued
patents and has authored more than twenty refereed
publications on information access topics, seven of which are in the
Special Interest Group on Information Retrieval (SIGIR) proceedings.
Viral genomics and
the semantic web
Carla Kuiken
The “HIV
database and analysis platform” has been maintained in Los Alamos for 22 years,
and has grown to be an internationally renowned resource for HIV data analysis.
It is in the process of expanding to include hepatitis C virus and hemorrhagic
fever viruses; the eventual goal is to make it a universal viral resource. This
expansion necessitates much greater reliance on external data and information
sources. These resources rarely use the same identifies and frequently contain
annotator- and submitter-specific language. While efforts have been underway
for some time to standardize and cross-link biological information on the web,
there still is a long way to go. I will describe current status of the “Viral
data analysis platform”, the (semantic) problems we have grappled with, and the
local and global efforts at amelioration.
Bio: Carla Kuiken is a staff
scientist at Theoretical Biology and Biophysics Group in Los Alamos National
Laboratory. She received a PhD in Medicine from the University of Amsterdam in
1995. For several years, she split her time between the Department of
Microbiology at the University of Amsterdam and the Theoretical Biology and
Biophysics Group at the Los Alamos National Laboratory. In addition, she spent
some time at the Australian National University (ANU) with Prof. Adrian Gibbs.
In 1996 she became a full-time postdoc at LANL, then a staff scientist at the
HIV Database and Analysis Project in 1999. In 2002 NIAID agreed to fund a new
sequence and immunology database for hepatitis C virus, which was subsumed into
the Viral Bioinformatics Resource Center in 2007 – only to be moved again into
the Viral Pathogen Research (ViPR) database when the number of BRCs was reduced
to five in 2009. In 2009 a database for Hemorrhagic Fever Viruses was started,
funded by the Department of Defense.
Patterns of Spam in Twitter
Aleksander
Kolcz
The
growing popularity of Twitter has been attracting significant attention from
spammers. The 140 character constraint, as well as other characteristics of the
Twitter service, affect both legitimate users and spammers alike, forcing
spammers to adopt certain unique tactics. In this talk we will offer a glimpse
of the various techniques employed by service abusers, contrast them with other
types of spam and discuss the challenges they pose to automatic detection
systems.
Bio: Aleksander Kolcz is a Software
Engineer at Twitter focusing on applying of Machine Learning and Data Mining to
modeling user interests and preventing service abuse. He has 12 years R&D
experience at Microsoft, AOL and Personalogy. He received his PhD in 1996 from
the University of Manchester Institute of Science and Technology.
Highly dimensional problems in
Computational Advertising
Andrei Broder
The central problem of Computational Advertising is to find the "best
match" between a given user in a given context and a suitable
advertisement. The context could be a user entering a query in a search engine
("sponsored search"), a user reading a web page ("content
match" and "display ads"), a user interacting with a portable
device, and so on. The information about the user can vary from scarily
detailed to practically nil. The number of potential advertisements might be in
the billions. The number of contexts is unbound. Thus, depending on the
definition of "best match" this problem leads to a variety of massive
optimization and search problems, with complicated constraints. The solution to
these problems provides the scientific and technical underpinnings of the
online advertising industry, an industry estimated to surpass 28 billion
dollars in US alone in 2011.
An essential aspect of this problem is predicting the impact of an ad on users’
behavior, whether immediate and easily quantifiable (e.g. clicking on ad or
buying a product on line) or delayed and harder to measure (e.g. off-line
buying or changes in brand perception). To this end, the three components of
the problem -- users, contexts, and ads -- are represented as high dimensional
objects and terabytes of data documenting the interactions among them are
collected every day. Nevertheless, considering the representation difficulty,
the dimensionality of the problem and the rarity of the events of interest, the
prediction problem remains a huge challenge. The goal of this talk is twofold:
to present a short introduction to Computational Adverting and survey several
high dimensional problems at the core of this emerging scientific discipline.
Bio: Andrei Broder is a Yahoo! Fellow
and Vice President for Computational Advertising. Previously he was an IBM
Distinguished Engineer and the CTO of the Institute for Search and Text
Analysis in IBM Research. From 1999 until 2002 he was Vice President for
Research and Chief Scientist at the AltaVista Company. He was graduated Summa
cum Laude from Technion, the Israeli Institute of Technology, and obtained his
M.Sc. and Ph.D. in Computer Science at Stanford University under Don Knuth. His
current research interests are centered on computational advertising, web
search, context-driven information supply, andrandomized algorithms. He has
authored more than a hundred papers and was awarded over thirty patents. He is
a member of the US National Academy of Engineering, a fellow of ACM and of
IEEE, and past chair of the IEEE Technical Committee on Mathematical
Foundations of Computing.
Machine Learning on Big Data for
Personalized Internet Advertising
Michael
Recce
Marketers
have long sought more effective ways to reach their audience to show the right
ad to the right person at the right time.
Huge volumes of internet activity data, advances in machine learning
methods, new hardware and software for large scale distributed computing, and
developments in real-time decisioning have made this finally possible.
Increasingly the particular advertisement that is seen on a web page is decided
in a auction that takes place in a fraction of a second, while the page is
loading. In this presentation I will
discuss how we, at Quantcast, meet the challenges in personalizing
advertising. This process involves
multiple machine learning methods to evaluate of about 15 billion individual
daily media events and leveraging this data to to make precise bids in almost
100,000 auctions every second.
Bio: Dr. Michael Recce has been
managing the Modeling team at Quantcast for thepast year and a half. Prior to
Quantcast, he lead Fortent¹s transaction monitoring and risk assessment
systems. For seven years, Michael worked extensively with financial
institutions devising improved methods for detecting unusual activity in
financial transaction data. Early in his
career, Michael was a product engineering manager at Intel Corporation, where
he led the development of new memory products for the company. Other projects
he has worked on include the design of a control system for a space-based robot
for Daimler-Benz, which was developed to run scientific and engineering
experiments in the space station. Michael holds six patents, including one for research
of a behavioral biometric called dynamic grip recognition, and was a recipient
of the Thomas A. Edison Award in 2005. He has been a lecturer at University
College, London and a professor of information systems at New Jersey Institute
of Technology. He received his bachelor's degree from the University of
California-Santa Cruz and his doctorate from University College, London.
Privacy and Effectiveness in
Internet Advertising
Qi Zhao
With the
proliferation of diverse internet services and applications, individuals are
confronted with the risk of losing their privacy through providing personal
information for enjoying the service. In this talk, we will first address
typical scenarios where privacy breach occurs and then provide a brief overview
of existing approaches handling privacy issues. In the last, we focus on
discussing privacy preservation for advertising data sharing platforms. Such a
data setting distinguishes itself from previous data setting in the sense of
much larger number of records and much higher dimensional attribute vector,
which pose great challenges to existing approaches and motivate the idea of
reducing the certainty of individuals' profile via noise injection. The
feasibility and effectiveness of the proposed method is demonstrated by
applying it to the simulated campaigns for Expedia.
Bio: Qi Zhao is currently a second
year Ph.D student at Information Retrieval and Knowledge Management Lab at
UCSC. He found his research interests in applying statistic knowledge and
machine learning techniques to solving problems involved with large scale data.
Specifically speaking, Qi is now engaged in developing algorithms for internet
privacy preservation. Prior to coming to UCSC, he obtained both his M.S degree
and B.S degree at Fudan University, China.
Utilizing Marginal Net Utility
for Recommendation in E-commerce
Jian
Wang
The main
goal of a recommender system in e-commerce is to help potential consumers find
products to purchase. In order to achieve this goal, the system needs to learn how
the consumer makes a purchase
decision. Earlier research in economics and marketing can be utilized to better
understand the consumer's purchase intention. These earlier research help us to
design the recommender system accordingly. The system can learn from the
consumers' history and make better predictions during the recommendation stage.
Here we present our recent work in this direction. Traditional recommendation
algorithms often select products with the highest predicted ratings to
recommend. However, earlier research in economics and marketing indicates that
a consumer usually makes purchase decision(s) based on the product's marginal
net utility (i.e., the marginal utility minus the product price). Utility is
defined as the satisfaction or pleasure user gets when purchasing the
corresponding product. A rational consumer chooses the product to purchase in
order to maximize the total net utility. In contrast to the predicted
rating, the marginal utility of a product depends on the user's purchase history
and changes over time.
To better match users' purchase decisions in the real world, we explore how to
recommend products with the highest marginal net utility in e-commerce sites.
Inspired by the Cobb-Douglas utility function in consumer behavior theory, we
propose a novel utility-based recommendation framework. The framework can be
utilized to revamp a family of existing recommendation algorithms. To
demonstrate the idea, we use Singular Value Decomposition (SVD) as an example
and revamp it with the framework. We evaluate the proposed algorithm on an
e-commerce (shop.com) data set. The
new algorithm significantly improves the base algorithm, largely due to its
ability to recommend both products that are new to the user and products that
the user is likely to re-purchase.
Bio: Jian Wang is a third year Ph.D.
student in University of California, Santa Cruz. She works with Prof. Yi Zhang
in the information retrieval and knowledge management lab at UCSC. Her research
interests include recommender system, information retrieval and data mining.
She has published in ACM SIGIR, ACM Recsys, ACM CIKM and so on. Jian received a
master degree in Lehigh University in Pennsylvania in 2009 and bachelor degree
in Fudan University in 2007. She worked in eBay research lab to help build the
post-purchase recommendation engine, as well as the IBM websphere team before.
Recommendation
System for the Facebook Open Graph
Wei Xu
The Open
Graph at Facebook contains very rich connections between hundreds of millions
of users and billions of objects. Recommendation technology is important for
finding the most interesting objects for the users from the huge amount of
objects in the Graph. In this talk, I will give the following: 1) an
overview of the different recommendation tasks we are facing at Facebook; 2)
the tools we provide to the developers for accessing object recommendations
from the Graph; and 3) the challenges and solutions for building such a
recommendation system.
Bio: Wei Xu is a research scientist
at Facebook. He initiated and is a leading member of the Facebook
recommendation platform “Taste”, which is a key technology behind Facebook Open
Graph. Before joining Facebook, he is a senior research staff at NEC Lab
America, where he was the leading architect for the video event detection
system, a top performer at TRECVID’08/09 event detection evaluation. He has
received Technology Commercialization Award and Technology Impact Award of NEC
Laboratories America. He has written a book on “Machine Learning for Multimedia
Content Analysis” and has published 40+ research papers on various venues such
as ICML, SIGIR, NIPS, CVPR, and ICCV. He received his B.S. from Tsinghua University
and M.S. from Carnegie Mellon University.
Recommender Systems at the Long
Tail
Neel
Sundaresan
Online
Recommender systems are essential to eCommerce. A complex marketplace
like eBay poses unique challenges and opportunities. The large diversity
in the item space, the buyer and seller space introduces super-sparsity at
scale. However, the elaborate transaction flow offers opportunities for a wide
class of recommender applications. In this talk we will discuss these
challenges, opportunities, and systems for recommendations.
Bio: Neel Sundaresan is a Senior
Director of eBay Research Labs. He has led the labs for over the past 5 years.
He directs research in areas from Search, Recommender systems, Social Networks,
Vision, Economics, and Large Data Science at eBay among others. His new mission
is Science for Empowerment. He has over 50 research publications and over 65
patents to his credit. He is a frequent speaker at national and international
conferences. He has a bachelors and masters in Mathematics and Computer
Science from IIT Mumbai. His PhD dissertation is in the area of compilers and
runtime systems for modeling data and control parallelism in object oriented
languages.
Filtering
Semi-Structured Documents Based on Faceted Feedback
Lanbo Zhang
Existing
adaptive filtering systems learn user profiles based on users' relevance
judgments on documents. In some cases, users have some prior knowledge about
what features are important for a document to be relevant. For example, a
Spanish speaker may only want news written in Spanish, and thus a relevant
document should contain the feature "Language: Spanish"; a researcher
working on HIV knows an article with the medical subject "MeSH: AIDS"
is very likely to be interesting to him/her.
Semi-structured
documents with rich faceted metadata are increasingly prevalent over the
Internet. Motivated by the commonly used faceted search interface in
e-commerce, we study whether users' prior knowledge about faceted features
could be exploited for filtering semi-structured documents. We envision two
faceted feedback solicitation mechanisms, and propose a novel user
profile-learning algorithm that can incorporate user feedback on features. To
evaluate the proposed work, we use two data sets from the TREC filtering track,
and conduct a user study on Amazon Mechanical Turk. Our experimental results
show that user feedback on faceted features is useful for filtering. The new
user profile learning algorithm can effectively learn from user feedback on
faceted features and performs better than several other methods adapted from
the feature-based feedback techniques proposed for retrieval and text
classification tasks in previous work.
Bio: Lanbo Zhang is a Ph.D.
candidate at the IRKM lab at UC Santa Cruz. He has been working on several
topics in the field of personalized information filtering and recommendation,
including how to learn user profiles based on new types of user feedback, how
to learn a single user's multiple interests, etc. His general research
interests lie in applying machine learning and data mining techniques in
filtering/recommendation tasks. Lanbo has published several papers in top IR
conferences including SIGIR and CIKM. Lanbo used to be a summer intern at IBM
Almaden research center working on adverse drug effects mining from electronic
health records. Lanbo got his MS and BE degree in computer science
from Chinese Academy of Sciences (2008) and Tsinghua Unviersity (2005)
respectively.