Ranked Probability Score in Python

Ranked Probability Score is an unusual error metric that was designed for meteorology but which has since more often been utilized for assessing models predicting football (soccer) matches. I was forced upon it because of it’s use in the M6 forecasting competition.

The purpose of the ranked probability score is to give an accuracy metric for a probabilistic prediction of a categorical outcome where the outcome in question is ranked/ordered. Which, to put into practical terms means: if you predict something to come in first place, and then it comes in last place, you will be penalized more than if you predicted first place and then the actual outcome was second place. Being right is best, but being close is okay with this metric.

With the M6 competition for example, the outcome is predicting the ranked order of stock price performances. Personally, I think it would make much more sense to predict the actual error of the forecast prices (MAE, RMSE, etc.) and skip the ranking part altogether. Even in football, for example, winning often has a value of 3 points, a draw has a value of 1 point, and a loss is worth 0, thus bypassing the need to perform the ranked probability score.

As you can see, I am rather skeptical of the value of this metric. However, the paper that is the source of the sample data below does a decent job of selling the value of the metric. Constantinou, Anthony Costa, and Norman Elliott Fenton. “Solving the problem of inadequate scoring rules for assessing probabilistic football forecast models.” Journal of Quantitative Analysis in Sports 8.1 (2012).

And here is the code in vectorized numpy/pandas form:

import numpy as np
import pandas as pd

def rps(predictions, observed):
    """Vectorized version of Ranked Probability Score.
    A lower value is a better score.
    From: Colin Catlin, https://syllepsis.live/2022/01/22/ranked-probability-score-in-python/

        predictions (pd.DataFrame): each column is an outcome category
            with values as the 0 to 1 probability of that category
        observed (pd.DataFrame): each column is an outcome category
            with values of 0 OR 1 with 1 being that category occurred
    assert (
        predictions.shape == observed.shape
    ), "prediction and observed array shapes must match"
    ncat = predictions.shape[1] - 1
    return (
            (np.cumsum(predictions, axis=1) - np.cumsum(observed, axis=1)) ** 2, axis=1
        ) / ncat

Sample data is of soccer/football matches,
5 matches, with 2 different models predicting (with 'a' the better model)
3 Categories
Where H = home team wins, D = draw, A = away wins

ORDER MATTERS (of H, D, and A), which is the whole point of this metric.
predictions_df = pd.DataFrame({
    'H': [1, 0.9, 0.8, 0.5, 0.35, 0.6, 0.6, 0.6, 0.5, 0.55],
    'D': [0, 0.1, 0.1, 0.25, 0.3, 0.3, 0.3, 0.1, 0.45, 0.1],
    'A': [0, 0, 0.1, 0.25, 0.35, 0.1, 0.1, 0.3, 0.05, 0.35],
    "model": np.tile(["a", "b"], 5),
    "match": np.repeat(np.arange(1, 6), 2),
    "outcome": np.repeat(["H", "H", "D", "H", "H"], 2),
expected_result = [0, 0.005, 0.025, 0.15625, 0.1225, 0.185, 0.085, 0.125, 0.12625, 0.1625]
predictions = predictions_df[["H", "D", "A"]]
observed = pd.get_dummies(predictions_df["outcome"]).reindex(
    columns=predictions.columns, fill_value=0
rps_result = rps(predictions, observed).round(5)
print(f"do results match expected: {all(expected_result == rps_result)}")


and here is the starting version

def rps_old(predictions, observed):
    Lower rps is better.
    Inspired by:
    which itself looks to have been inspired by:
Calculate the ranked probability score in R
all referencing: https://qmro.qmul.ac.uk/xmlui/bitstream/handle/123456789/10783/Constantinou%20Solving%20the%20Problem%20of%20Inadequate%202012%20Accepted.pdf """ assert ( predictions.shape == observed.shape ), "prediction and observed shapes must match" ncat = predictions.shape[1] npred = predictions.shape[0] rps = np.zeros(npred) for x in range(0, npred): cumulative = 0 for i in range(1, ncat): cumulative += ( sum(predictions.iloc[x, 0:i]) - sum(observed.iloc[x, 0:i]) ) ** 2 rps[x] = cumulative / (ncat - 1) return rps rps_result_old = rps_old(predictions, observed).round(5) print(f"do results match expected: {all(expected_result == rps_result_old)}")

Murphy, A. H. (1969). On the “Ranked Probability Score”, Journal of Applied Meteorology and Climatology, 8(6),

Leave a Comment

Your email address will not be published.