training example, an ideal function
f
would output values that induce the same ranking of
documents as given by labels.
Each example
X
i
,
i
= 1
, . . . , N
, is a collection of feature vectors with labels:
X
i
=
{
(
x
i,j
, y
i,j
)
}
r
i
j=1
. Features in a feature vector
x
i,j
represent the document
j
= 1
, . . . , r
i
.
For example,
x
(1)
i,j
could represent how recent is the document,
x
(2)
i,j
would reflect whether the
words of the query can be found in the document title,
x
(3)
i,j
could represent the size of the
document, and so on. The label
y
i,j
could be the rank (1
,
2
, . . . , r
i
) or a score. For example,
the lower the score, the higher the document should be ranked.
There are three principal approaches to solve such a learning problem: pointwise, pairwise,
and listwise.
Pointwise approach transforms each training example into multiple examples: one example
per document. The learning problem becomes a standard supervised learning problem, either
regression or logistic regression. In each example (
x, y
) of the pointwise learning problem,
x
is the feature vector of some document, and
y
is the original score (if
y
i,j
is a score) or a
synthetic score obtained from the ranking (the higher the rank, the lower is the synthetic
score). Any supervised learning algorithm can be used in this case. The solution is usually
far from perfect. Principally, this is because each document is considered in isolation, while
the original ranking (given by the labels y
i,j
of the original training set) could optimize the
positions of the whole set of documents. For example, if we have already given a high rank to
a Wikipedia page in some collection of documents, we would not give a high rank to another
Wikipedia page for the same query.
In the pairwise approach, the problem also considers documents in isolation, however, in this
case, a pair of documents is considered at once. Given a pair of documents (
x
i
, x
k
) we want
to build a model
f
that, given (
x
i
, x
k
) as input, outputs a value close to 1 if
x
i
has to be put
higher than
x
k
in the ranking. Otherwise,
f
outputs a value close to 0. At the test time,
given a model, the final ranking for an unlabeled example
X
is obtained by aggregating the
predictions for all pairs of documents in
X
. Such an approach works better than pointwise,
but still far from perfect.
The state of the art rank learning algorithms, such as
LambdaMART
, implement the
listwise approach. In the listwise approach, we try to optimize the model directly on some
metric that reflects the quality of ranking. There are various metrics for assessing search
engine result ranking, including precision and recall. One popular metric that combines both
precision and recall is called mean average precision (MAP).
To define MAP, let us ask judges (Google call those people rankers) to examine a collection
of search results for a query and assign relevancy labels to each search result. Labels could
be binary (1 for “relevant” and 0 for “irrelevant”) or on some scale, say from 1 to 5: the
higher the value, the more relevant the document is to the search query. Let our judges build
such relevancy labeling for a collection of 100 queries. Now, let us test our ranking model on
this collection. The precision of our model for some query is given by:
Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5