There are two fundamental classes of measures in order to evaluate the effectiveness of a retrieval system, depending on whether the result set returned in response to the query is *unranked* or *ranked*.

**Evaluation of Unranked Result Sets**

In this scenario, given the query the IR system considers any document either as *relevant* or *not relevant* to (i.e., **binary relevance**).

Specifically, let be the set of all the documents that are *actually* relevant to , and which is known in advance. Then, , is the set of documents that the system retrieves and considers as relevant to .

The two measures that are used most frequently to evaluate these IR systems are *Precision* and *Recall*, which are computed as follows:

*Precision* () is the fraction of results retrieved by the IR system that are actually relevant:

On the other hand, *Recall* () is the fraction of the actually relevant results that the IR system was able to retrieve:

The above quantities can be better explained using the so-called **confusion matrix** (or contingency table), as described below:

The main diagonal of this matrix specifies the two cases where the IR system performs correctly whereas the anti-diagonal indicates the two situations where the system behaves wrongly.

Concretely, *true positives* (), indicate the number of relevant documents that the IR system correctly retrieves (i.e., *hits*). Similarly, *true negatives* () represent the number of not-relevant documents, which are not retrieved at all by the system (i.e., *correct rejections*).

On the other hand, *false positives* () refer to those documents that the IR system retrieves as relevant which in fact are not (a.k.a. *false alarms* or *Type I errors*). Finally, *false negatives* () represent actually relevant documents, which the system missed to retrieve (a.k.a. *misses* or *Type II errors*).

Precision, which is also called **positive predicted value** (PPV), and Recall, which is also refer to as **sensitivity** (as well as *true positive rate* (TPR) or *hit rate*) can be thus defined in terms of the above indicators, as follows:

In addition, the following measures can be derived from the confusion matrix above:

– **specificity** (or *true negative rate* (TNR)) computes the fraction of not-relevant documents that were not correctly retrieved among all the documents that are actually not relevant:

– **negative predicted value** (NPV) measures the fraction of not-relevant documents that were not correctly retrieved among all the documents that were not retrieved:

– **false positive rate** (FPR) computes the ratio of documents that the IR system mistakenly considered not relevant among all the documents that are actually not relevant:

– **false discovery rate** (FDR) computes the ratio of documents that the IR system mistakenly considered not relevant among all the documents that were retrieved:

– **accuracy** (ACC) measures overall ability of the IR system to correctly retrieve relevant documents and miss not-relevant ones:

On might be tented to use accuracy as the right indicator for the effectiveness of an IR system. However, most of the times, the distribution of relevance across documents is very skewed, meaning that the vast majority of documents turn out to be not-relevant and very few are relevant. Let us see this with an example.

Suppose we have a collection of documents (i.e., ). Assume also that, for a given query , documents are not relevant (i.e., ) and only documents are relevant to (i.e., ). In such a case, an IR system which barely does not retrieve any document at all in response to (i.e., because the system considers all the documents as not-relevant) has an accuracy of , which is a very remarkable value.

Unfortunately, this high accuracy values is obtained without retrieving any result document in response to the query, which is generally not the case for the user interacting with an IR system! In fact, users always tolerate to see some not-relevant results (i.e., false positives), providing that they eventually find interesting information (i.e., true positives). That’s why it is useful to have precision and recall, which both focus on true positives.

The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances. For instance, in some cases one might prefer to have high precision (e.g., web search results should contain relevant documents in the first page) whereas in some other cases it is preferable to get as the high recall as possible (e.g., hard disk searches). Anyway, the two indicators clearly trade off against one other: because recall is a non-decreasing function of the number of documents retrieved (i.e., ) it is always possible to design an IR system with recall equals to (but with a very low precision!) by simply returning the document collection *as a whole* in response to *any* query. Of course, this solution is unfeasible as the number of documents turn out to be increasingly large.

Finally, in general we want to get some amount of recall while tolerating only a certain percentage of false positives. A single measure that trades off precision versus recall is the *F-measure*, which is the weighted harmonic mean of precision and recall:

, where

The default *balanced* F-measure equally weights precision and recall, which means making or . It is commonly written as (or ), which is the shorthand for .

**Evaluation of Ranked Result Sets**

When the result set returned by an IR system is a ranked list of documents, it is often the case we want to measure the performance of two retrieval systems and by comparing the ranked lists they retrieve in response to the same query , i.e., and , respectively.

To this end, there exists distinct measures depending on the grade of relevance, i.e., **binary** vs. **multi-graded**.

Binary relevance, which we already introduced before, simply states whether a document is *relevant* or *not relevant* to . This simply means that each document is assigned either with to indicate it is relevant or with (sometimes ) if it doesn’t.

On the other hand, multi-graded relevance may assign each document a discrete value in the range or, equivalently, .

In the following, we examine those two situations separately.

**Binary Relevance**

If documents are simply considered either as relevant or not, the following measures can be used to compare any two ranked result lists in response to the query :

**Precision@ (P@)**

Consider the following ranked result list containing 8 documents. Relevant results are showed in green, while non relevant ones are showed in red

According to the definition, we can compute P@ for , as follows:

– P@ = How many relevant documents out of top- retrieved documents? =

– P@ = How many relevant documents out of top- retrieved documents? =

– P@ = How many relevant documents out of top- retrieved documents? =

– …

– P@ = How many relevant documents out of top- retrieved documents? =

**Mean Average Precision (MAP)**

Consider the ranked result list above, and focus only on the rank positions of each relevant documents, i.e., (where is the total number of retrieved documents for ). For instance, in the example above we have .

Then, we compute the Precision@ for each , and we make the average:

.

is the average precision, and in the example above is:

Now, assuming we have to measure the Mean Average Precision (MAP) of an IR system for multiple queries and/or for multiple rankings given the same query, and let this total queries/rankings be then we simply compute the following:

.

**Mean Reciprocal Rank (MRR)**

Given a query and a list of results, we can define as the rank of the (first) result that is relevant to , i.e., the position in the list indicating the (first) relevant result. The reciprocal of this number is called **reciprocal rank** for the query , and is computed as .

The Mean Reciprocal Rank (MRR) is thus the average of the reciprocal ranks of results for a sample of queries :

.

Some issues arise either when no relevant result appears in the returned list (use MRR = 0) or when multiple relevant results are returned (use MAP, instead).

**Multi-graded Relevance**

Instead of using a binary graded relevance score, where a document is considered either as relevant or not, we use multi-graded relevance.

The rationale of this choice is that a document can be *marginally* relevant is some circumstances. For instance, even an highly relevant document can be redundant if there exist other documents in the result list that contain the same information.

We still assume is the set of the retrieved results (i.e., documents) that are supposed to be relevant to the query . In addition, we denote by the relevance function which maps each document to the corresponding relevance score. Note that the binary graded relevance score seen above is a special case of this definition of relevance function, where .

In the following, we present 3 measures which apply in the case of multi-graded relevance.

**Cumulative Gain (CG)**

Cumulative Gain (CG) is the predecessor of DCG (see below) and does not include the position of a result in the consideration of the usefulness of a result set. Concretely, it is the sum of the graded relevance scores of all results in a search result list.

The CG with respect to a query at a particular rank position , is defined as:

.

It turns out that the value computed with the CG function is unaffected by changes in the ordering of search results. That is, moving a highly relevant document above a higher ranked, less relevant, document does not change the computed value for CG.

However, we might want the following:

– Highly relevant documents are more useful when appearing earlier in a search engine result list (have higher ranks)

– Highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than irrelevant documents.

In order to deal with the two assumptions above, Discounted Cumulative Gain (DCG) is used instead.

**Discounted Cumulative Gain (DCG)**

The rationale of DCG is that highly relevant documents appearing lower in a search result list should be *penalized* as the graded relevance value is reduced *logarithmically* proportional to the position of the result. Typically, this means that the relevance score at rank position is multiplied by a factor which is equal to .

Overall, the DCG with respect to a query accumulated at a particular rank position , , is thus defined as:

.

Alternatively, if we want to put more emphasis on relevant results retrieved, we can compute the as follows:

.

Note that there is no restriction on the base of the logarithm to be used. Choosing the base of the logarithm affects the weight of the discount. For instance, using the typical value of the discount at rank would be . Of course, if we would use another logarithmic base, we had a stronger or a weaker penalization.

**Normalized Discounted Cumulative Gain (nDCG)**

Search result lists vary in length depending on the query . Comparing a search engine’s performance from one query to the next cannot be consistently achieved using DCG alone, so the cumulative gain at each position for a chosen value of should be normalized across queries. This is done by sorting documents of a result list by relevance, producing the maximum possible DCG until position , also called Ideal DCG (IDCG) until that position.

For a query , the *normalized* discounted cumulative gain (nDCG) is therefore computed as:

.

If we want a measure of the average performance of a search engine’s ranking algorithm on a set of queries , the nDCG values for all queries can be averaged.

Note also that in a perfect ranking algorithm, the will be the same as the producing an nDCG of . It turns out that , and so are cross-query comparable.

The main difficulty encountered in using nDCG is the unavailability of an ideal ordering of results when only partial relevance feedback is available.

**Example**

Suppose we have the following ranked list of documents , each of those scored on the scale , where means totally not relevant, completely relevant, and and meaning “somewhere in between”:

.

This means that the first-ranked document has relevance score , the second-ranked document has relevance score , etc.

Let’s compute the CG for this list, namely . By the definition above it follows that:

.

As we said before, CG is not affected by the ordering of the results in the ranked list. For instance, if we swap the third document (scored with ) with the forth (scored with ), still we obtain the same value of CG = , though the ranked list appears different.

DCG is used to emphasize highly relevant documents appearing early in the result list. Using the logarithmic scale for reduction, the DCG for each result in order is:

Therefore, can be computed as follows:

.

Now a switch of the third and fourth documents results in a reduced DCG because a less relevant document is placed higher in the ranking, i.e., a more relevant document is discounted more by being placed in a lower rank.

The performance of this query to another is incomparable in this form since the other query may have more results, resulting in a larger overall DCG which may not necessarily be better. In order to compare, the DCG values must be normalized.

To normalize DCG values, an ideal ordering for the given query is needed. For this example, that ordering would be the monotonically decreasing sort of the relevance judgments provided by the experiment participant, which is:

.

The DCG of the *ideal* ordering above, or IDCG, is then:

.

Therefore, the nDCG for this query is given as:

.