Some IR Evaluation Metrics
It is the fraction of relevant documents retrieved to the total documents retrieved.
Precision takes all the retrieved documents into account but it can also be evaluated up-to a given no of retrieved results, considering only the topmost results. For example: we may calculate precision only on the top 100 results out of all the results. This is called as precision at n or [email protected] .
It is the fraction of relevant documents retrieved to the total number of relevant documents in the database.
Precision and recall alone are not good measures because they both are inversely related to one another and may give different information about the quality of the search results for the same query.
If we combine both precision and recall, we get a better evaluation metric known as the F-Score. It is the harmonic mean of precision and recall and incorporates the complementing results of both precision and recall into a single value. Below is the equation of the balanced F-Score.
In the above formula, both precision and recall are equally weighted. Sometimes, a user may wish to assign more weight to either precision or recall. For that he may use the general equation of F-Score given below where ? is a non-negative real number.
In general terms, the balanced F-Score is known as F1. Two other commonly used F-Scores are F2 (which gives twice the weightage to recall than precision) and F0.5 ( which gives twice the weightage to precision than recall ).
· MEAN AVERAGE PRECISION
Precision, Recall and F-Score work are based on the entire list of documents in the database which is not very useful when the we are dealing with query searching over the Internet because of the size of the database and also because the Internet queries return a ranked result.
So, we use another metric which overcomes the above problem. It is known as Mean Average Precision but before understanding Mean Average Precision, you need to understand Average Precision.
§ AVERAGE PRECISION
It is the average of the precision scores for a single query calculated after each relevant result is retrieved. The formula for average precision is given below.
Where P(k) is the precision at rank k ( [email protected] ) and rel(k) is the change in recall from rank k-1 to k multiplied by the total number of documents in the database. For a more clear understanding, consider the below example:
Let’s suppose the query returned 7 results out of which only the 1st, 4th, 5th and 6th are relevant.
Now, the values of rel(k) at different ranks will be:
Notice that the recall at rank 1 will be 1/(total documents) and recall at rank 2 will also be 1/(total documents) since recall only considers relevant docs and so rel(2) will be 0*(total docs) which is 0.
The precision at different ranks will be:
Finally, Average Precision = 1*1 + (1/2)*0 + (1/3)*0 + (2/4)*1 + (3/5)*1 + (4/6)*1 + (4/7)*0 / 4 = (1 + 0 + 0 + 0.5 + 0.66 + 0 ) / 4 = 0.69
Please notice that we divide the entire by 4 (total relevant documents) and not 7 (total retrieved documents) .
Mean Average Precision (MAP) is just an extended part of Average Precision. In MAP, we take the mean of Average Precisions for different queries in a set of queries (commonly known as a query batch) .Note that in MAP, the ranking of relevant documents is very important because of [email protected] in the formula of Average Precision. Below is the formula for MAP:
Where Q is the number of queries in the query batch.
· MEAN RECIPROCAL RANK
Mean Reciprocal Rank is another metric which is useful when we don’t know the entire dataset and the results are ranked. Let’s first start with understanding mean reciprocal rank.
§ RECIPROCAL RANK
It is very simple concept. The reciprocal rank of a query is the multiplicative inverse of the rank of that query in the output results. It only considers the rank of the first relevant result.
For Example: If the position of the correct result is 2 then Reciprocal Rank (RR) for that query is 1/2. If none of the results is correct then the RR for that query is 0.
The Mean Reciprocal Rank is the mean of the RR for a set of queries (query batch). The formula is given below:
Where Q is the number of queries in the query batch and ranki is the position of the first relevant result for the ith query.
Please note that MRR only considers the rank of the first relevant result. It does not take into account the ranks of the subsequent relevant results whereas Mean Average Precision takes into account the rank of all the relevant results upto a particular rank. Both MAP and MRR find their application in certain areas.
There are many more IR evaluation metrics which I have not covered in this blog but I think these are some of the more commonly used. If you are interested in knowing them too, you might wanna take a look at Wikipedia’s article : “Evaluation measures (information retrieval)”