How to Use Betfair's Historical Data

July 27, 2020 by SportsBettingQuant

The betting exchange Betfair offers historical data for download including time-stamped price data for various sports. Customers can select and download the data at https://historicdata.betfair.com either in the basic, advanced or pro category. Whilst basic data is currently for free, it only covers 1 timestamp every minute. For advanced or pro you would need to pay some money to get higher resolution data (every second for advanced or 50ms frequency for pro).

Extracting the tar Archive

Once you selected the data of interest you can download a tar archive and save it as data.tar on your system. In this post I will use the soccer data for June 2020 which has around 184MB and includes around 78.000 markets - according to Betfair's site. After downloading the data was actually larger than indicated on their website.

You might wonder how to navigate the folder structure to actually find the data. If you are familiar with Betfair's data model you might identify that the folder represent the year, month, date, event id and market id. For every market there is a archive (bz2) file, that actually contains the market data.

Historical Betfair Exchange Data Folder Structure

Looking at the Market Data Files (bz2 Files)

Next we will address the question how we can actually view the data. After extracting a bz2 file and opening it with the editor you will see something like the following:

{"op":"mcm","clk":"1234","pt":1234,"mc":[{"id":"1.1234","marketDefinition":{"bspMarket":false,"turnInPlayEnabled":true,"persistenceEnabled":true,"marketBaseRate":1.2,"eventId":"1234","eventTypeId":"1","numberOfWinners":1}}]}
{"op":"mcm","clk":"12345","pt":12345,"mc":[{"id":"1.1234","rc":[{"ltp":1.9,"id":1}]}]}
{"op":"mcm","clk":"12345","pt":12345,"mc":[{"id":"1.1234","rc":[{"ltp":1.6,"id":2}]}]}
{"op":"mcm","clk":"12345","pt":12345,"mc":[{"id":"1.1234","rc":[{"ltp":4.3,"id":1}]}]}
{"op":"mcm","clk":"12345","pt":12345,"mc":[{"id":"1.1234","rc":[{"ltp":6.9,"id":1}]}]}
{"op":"mcm","clk":"12345","pt":12345,"mc":[{"id":"1.1234","rc":[{"ltp":4.1,"id":1}]}]}

You might be surprised by the data format and processing the data for some useful analysis might impose a challenge. Betfair provides a pdf with an explanation of the data format.

Processing Betfair's Historical Exchange Data with Python

Since the raw data might be very handy for analysis I would like to develop a Python script that can extract the last price traded for selections in a betting market.

import pandas as pd
import json
from bz2 import BZ2File

file_name = "1.170604916.bz2"

with BZ2File(file_name, "r") as f:
    lines = f.readlines()

price_changes = []

for line in lines:
    content = json.loads(line)
    market_changes = content.get("mc", [])
    for market_change in market_changes:
        market_id = market_change.get("id")
        market_definition = market_change.get("marketDefinition")

        runner_changes = market_change.get("rc")
        if runner_changes:
            for runner_change in runner_changes:
                print(runner_change)
                price_changes.append([content.get("pt"), runner_change.get("id"), runner_change.get("ltp")])

df = pd.DataFrame(price_changes, columns=["timestamp", "selection_id", "price"])
df.to_csv("prices.csv", index=False)

With the Python script I basically extract a bz2 file into memory and iterate over all the lines. For every line the JSON is parsed and then I check for market changes and filter further for runner changes. Every runner change is added to a list and then a Pandas dataframe object is created from that list. In the end a csv file is created containing the timestamp, the selection id and the last price traded. The csv file looks like the following extract and is in a format that can be easily worked with in Excel for instance:

  timestamp selection_id price
0 123456789 1 3.75
1 123456789 2 3.65
2 123456789 3 1.94

Saving Betfair's Historical Data for Analysis

If you would like to save the raw data or some processed data that is still close to the raw data and if you would like to keep the JSON format then a NoSQL JSON storage such as MongoDB might be a good option. PostgreSQL now also has some JSON support and might be a viable option to store the data. Again, it depends a lot on how the extracted data looks like and what kind of analysis you'd like to do.

If your data processing pipeline generates some relational data then again PostgreSQL might be a solid option to save the processed data. If you prefer to analyse and view the data in software like Excel, then saving the data as csv file might be the best option. It seems that Betfair also published a Excel file on their github that includes a VBA macro to process the raw historical data and draws a chart with the last price traded per selection.

Making Use of Historical Betfair Data

After you have a basic understanding of how the data looks like you might ask yourself what you can really do with it and how you can benefit from historical data.

Extracting Starting Price Information

One of the things you can do with Betfair's historical data is extracting starting price information. For this i can create a Python script that iterates through the folder structure, opens the bz2 files and checks for messages that contain the bsp information. The messages with starting prices are collected and saved to a csv file.

First i Import required Python packages and define a message that checks if it contains relevant starting price information. Such message should contain market change information and have bspReconciled set to true. I am also limiting analysis to win markets with country code GB.

import json
from bz2 import BZ2File
from pathlib import Path
import pandas as pd


def filter_message(message) -> bool:
    """Check if message is relevant."""
    try:
        md = message.get("mc")[0].get("marketDefinition")
        if md["bspReconciled"] and md["marketType"] == "WIN" and md["countryCode"] == "GB":
            return True
        else:
            return False
    except:
        return False

Next I define function that converts the message into a list. All the runners are appended to the list so that I can later create a csv file from it. The information that I keep is the timestamp of the message, market ID, market start, venue, name of the selection, starting price and the outcome (winner / loser / removed).

def parse_sp_info(message: dict):
    """Extract relevant information from message."""
    market_change_message = message.get("mc")[0]
    market_definition = market_change_message["marketDefinition"]
    return [
        (
            message.get("pt"),
            market_change_message["id"],
            market_definition["marketTime"],
            market_definition["venue"],
            runner.get("name"),
            runner.get("bsp"),
            runner.get("status"),
        )
        for runner in market_definition["runners"] if runner.get("bsp")
    ]

The rest of the Python script is straight forward: Just some for loops that go through the folder structure to find relevant bz2 files, open them and process the files line by line. In the main part you can define which data should be used. In the end the data is saved to a csv file.

def process_event_dir(event_dir: Path):
    """Process a directory with bz2 files."""
    print(f"Processing dir {event_dir}")
    runners = []
    market_files = [f for f in event_dir.iterdir() if f.name.startswith("1.")]
    for market_file in [f for f in market_files if f.suffix == ".bz2"]:
        with BZ2File(market_file, "r") as f:
            lines = f.readlines()
            sp_info = []
            for line in lines:
                content = json.loads(line)
                if filter_message(content):
                    sp_info = parse_sp_info(content)
            runners.extend(sp_info)
    return runners


def extract_bsp(plans, years, months):
    """Loop through folder structure and process directories with bz2files."""
    runners = []
    for plan in plans:
        for year in years:
            for month in months:
                day_paths = [f for f in (Path(plan) / year / month).iterdir() if f.is_dir()]
                for day_path in day_paths:
                    for event_dir in day_path.iterdir():
                        runners.extend(process_event_dir(event_dir))
                break
    return runners


if __name__ == "__main__":
    plans = ("BASIC",)
    years = ("2020",)
    months = ("Dec",)
    runners = extract_bsp(plans, years, months)
    df = pd.DataFrame(
        runners,
        columns=["message_ts", "market_id", "event_dt", "venue", "selection", "bsp", "status"]
    )
    df.sort_values(by="event_dt", inplace=True)
    df.to_csv("runners.csv", index=False)

The csv file has the following structure:

Timestamp MarketId MarketStart Venue Selection SP Result
1606844 1.17615 2020-12-01T17:30:00.000Z Wolverhampton Horse 1 11.21 WINNER
1606844 1.17615 2020-12-01T17:30:00.000Z Wolverhampton Horse 2 66.48 LOSER

Now you can use the csv file for some simple analysis which might work well for a simple data set. Of course it would also be possible to save the data to a database and use SQL to query it.

When looking at the data in the csv file it exactly matched with the extract that I obtained following the procedure described in "Working with Historical Starting Price Data". This is a good sign and most likely the script processes the data accordingly.

Summary

For many customers Betfair's Historical Exchange data might not be very intuitive to use due to the data format and the complexity of the data structure. However, the great detail of the data allows all kind of backtests and with some code the data can be transformed into simple csv-like extracts that can be easily analysed in Excel or similar software to successfully test betting systems on historical data.

Do you like our content? Please share with your friends!

Share on Facebook Share on Twitter

Comments

No comments published yet.

Please log in to leave a comment.