Some betting exchange markets, such as betfair for instance, offer the possibility to place bets at a starting price. In an earlier blog post we already warned about using starting price bets. One of the reasons is, that it might reveal a successful betting strategy. This is exactly what will be shown in this article.
In this article I will cover how reverse engineering can be applied to betting markets. I will focus on the collection of starting price data on the Betfair betting exchange and explain how a simple cluster analysis revealed a successful betting strategy.
Reverse Engineering Betting Strategies
The idea behind this blog post is that a bot is placing bets at a betting exchange at characteristic odd limits, at typical times (e.g. x minutes before the off) or using a characteristic staking plan. I will then deploy a simple script that collects this data from the exchange and saves it in a database for analysis. Different analysis techniques can then be used to identify patterns in the data. In this example I will focus on principal component analysis (PCA) to visualize and explore the data. Typical patterns should then lead to betting strategies that are applied in the market. It might also be possible to train a machine learning model on top of the data - but that's maybe something for another blog post in the future.
The following picture illustrates the approach:
Bets at Betfair Starting Price
For UK horse races betfair allows to place bets at starting price which means that bets are matched at the beginning of the race. For both, back and lay bets, a limit can be set and users can view the amounts and limits on the ladder interface as shown in the following picture.
The problem is that this data is visible to the public and also accessible via the betfair API.
Scraping the Data from Betfair
With a simple Python script it is possible to connect to the Betfair API and request the price data for starting price bets. A cron is scheduled every minute to poll the data from the API for races with a starting time within the next 2 hours (sampling rate is once a minute). The raw data is saved to a postgres database table with the following structure:
|1||2020-02-15 12:15:35||1.168759002||2020-02-15 13:15:00||Ascot 15th Feb||26781748||Sporting John||back||1.01||309.95|
|2||2020-02-15 12:15:35||1.168759002||2020-02-15 13:15:00||Ascot 15th Feb||26781748||Sporting John||back||1.80||18.07|
|3||2020-02-15 12:15:35||1.168759002||2020-02-15 13:15:00||Ascot 15th Feb||26781748||Sporting John||back||2.00||12.05|
|4||2020-02-15 12:15:35||1.168759002||2020-02-15 13:15:00||Ascot 15th Feb||26781748||Sporting John||back||2.66||7.16|
|5||2020-02-15 12:15:35||1.168759002||2020-02-15 13:15:00||Ascot 15th Feb||26781748||Sporting John||lay||1000.00||114.97|
|6||2020-02-15 12:15:35||1.168759002||2020-02-15 13:15:00||Ascot 15th Feb||18286520||Master Debonair||back||1.01||1117.47|
|7||2020-02-15 12:15:35||1.168759002||2020-02-15 13:15:00||Ascot 15th Feb||18286520||Master Debonair||back||1.73||12.05|
|8||2020-02-15 12:15:35||1.168759002||2020-02-15 13:15:00||Ascot 15th Feb||18286520||Master Debonair||back||1.80||15.66|
|9||2020-02-15 12:15:35||1.168759002||2020-02-15 13:15:00||Ascot 15th Feb||18286520||Master Debonair||back||2.66||15.54|
|10||2020-02-15 12:15:35||1.168759002||2020-02-15 13:15:00||Ascot 15th Feb||18286520||Master Debonair||lay||1000.00||12.05|
Analyzing the Betting Data
Before the actual analysis of the data another transformation is used to convert the raw data into a format that is more suitable for analysis. The script polls the accumulated amount placed at odds limits. For the analysis I am more interested in the difference of the series, hence I convert the raw data into the following structure with a Python script:
|269547||1.168759002||Ascot 15th Feb||2020-02-15 13:15:00||26781748||Sporting John||7107||0||1.01||24.09|
|269548||1.168759002||Ascot 15th Feb||2020-02-15 13:15:00||26781748||Sporting John||6686||0||1.01||2.89|
|269549||1.168759002||Ascot 15th Feb||2020-02-15 13:15:00||26781748||Sporting John||6627||0||1.01||2.41|
|269550||1.168759002||Ascot 15th Feb||2020-02-15 13:15:00||26781748||Sporting John||5006||0||1.01||23.50|
|269551||1.168759002||Ascot 15th Feb||2020-02-15 13:15:00||26781748||Sporting John||4765||0||1.01||3.62|
Identifying Clusters with Principal Component Analysis
The assumption is that betting bots place bets at characteristic times before the off, use characteristic stakes and odds limits. The data is high dimensional, for a range of 120 minutes before the off I have the amount placed for every price increment for both, back and lay side. My goal is to plot the data in two dimensions with a scatter plot where one point illustrates a selection (horse). There is a technique to reduce high dimensional data which is called Principal component analysis (PCA). With the scikit-learn implementation of the PCA I get the following chart:
The next step was to look at the clusters that formed in a bit more detail. One cluster had positive return hence I looked through the selections in that cluster, trying to figure out what the selections had in common and why they showed similar behaviour in the way BSP bets were placed. Quickly it become obvious that horses giving their debut were forming the cluster. A separate backtest of such a simple strategy could reveal impressive returns over the past couple of years. Please have a look at the "Back the Newcomer in Horse Racing" betting strategy for more details.
Try to avoid deploying betting bots using the starting price of a betting exchange in a deterministic manner. Some obfuscation such as placing bets at random times, split in random stakes etc. might help with this. Or simply drop feed your bets into the market prior to the off not using starting price bets at all.
Scraping data from betting exchange markets and analysing the data can be a viable approach towards the development of a betting strategy through reverse engineering. With the example above a successful strategy over the past couple of years could be revealed. A simple copy and paste of the strategy would surely reduce the edge and profits will diminish over time. However, the approach can lead to new ideas, refinements and new developments. One example for further development is the training of a machine learning model on top of the BSP data, that we continue to collect which is something for another blog post.