Evaluation Metrics for Recommendation Systems

This video explores how one can evaluate recommender systems.

Evaluating a recommender system involves (1) If the right results are being recommended (2) Whether more relevant results are being recommended at the top/ first compared to less relevant results.

There are two popular types of recommender systems. Explicit Feedback recommender systems and implicit feedback recommender systems. The metrics used in these are slightly different.

Explicit Feedback Recommender Systems

These are systems where the user gives explicit feedback, usually in the form of a numeric rating for each recommendation.

An example is a movie recommendation system where a user could give a numeric rating in the form of the number of stars for each recommended movie.

Metrics used in Explicit Recommender Systems

For such a system, the metrics used could be pretty simular to that used in a standard regression problem since the target is really a score that you could be predicting, and the actual score is available to measure how good the prediction is.

Mean Absolute Error: Mean over all data points, absolute value of difference between actual rating and predicted rating.
Root Mean Square Error: Square root of Mean over all data points, square of difference between the actual rating and predicted rating.
R2 score:

Implicit Feedback Recommender Systems

These are the more common form of recommendation systems where the user is shown a set of recommendations and there is no explicit feedback. Instead there is implicit feedback in the form of user clicking the item, or adding it to cart and so on.

For instance, when you see a set of related products on an ecommerce website, you are not likely asked for a rating on the quality of the recommendation! Rather, you would probably click on one or more of the products if you like them. And if you click on one of the first few products, that shows that the recommender also managed to get its order right, in addition to showing a relevant product.

Metrics used in Implicit Recommender systems

Since the implicit feedback available is usually binary – in the form of “yes the item is relevant” or “no the item is not relevant”, metrics are simimar to that for classification algorithms.

Precision: Of all recommended items, how many are relevant ?
Recall : Of all relevant items, how many made it into the recommendations?
Precision@K: Of the top K recommended items, how many are relevant?
Recall@K: Of all the relevant items, how many made it into the top K recommendations.
Mean Average Precision (MAP): If there are K items recommended, precision@k averaged from k=1…K averaged over all data points.
Mean Average Recall (MAR): If there are K items recommended, recall@k averaged from k=1…K averaged over all data points.

For recommeders with Single Correct Response

There are situations where a single correct response makes sense. For instance, suppose you are scanning a product to identify it from a catalouge for billing, there is only one correct match from the items recommended based on the scan.

Mean Reciprocal Rank or Mean Percentile Rank

In such a case, the popular metric used is the mean reciprocal rankor the mean percentile rank. This metric involves computing the reciprocal of the rank of the correct item from the list of recommendations. If the correct item is at the top, it leads to a rank of 1 and keeps degrading as the correct item appears lower in the list of recommendations.