Published on June 10, 2020, 3:21 p.m. - 9 Comments - Sports: Football
Machine Learning and Artificial Intelligence are powerful tools to learn from large amounts of data and help to make better decisions. In this article I would like to train a machine learning model that is capable of predicting the outcome of football matches.
The basic idea behind machine learning is that relations in historical data can be learned by a computer program to make predictions on fresh data in the future. Nowadays machine learning is used in many applications: Some widely known examples are self driving-cars, handwriting recognition or credit-card fraud detection. I would like to use these examples to quickly explain how machine learning is used and applied in a broad context.
All machine learning applications have in common that they require data as input. For self driving cars video streams from cameras as well as sensor data (LIDAR) is recorded during test rides. For credit-card fraud detection data around payment transactions such as time, IP address, etc is used. For handwriting recognition image files of handwritten text is used. I refer to this data as input. However, the input alone is often not very helpful. You would also need the desired output or label that you would like to predict. In the process of training a machine learning model features are generated based on the input data and then the goal is to link it to the label or output. The following table gives an overview of the data for some example applications:
|Application||Input Data||Label / Output|
|Self driving cars||Video Streams, Lidar data||Driving wheel position, speed|
|Handwriting Detection||Images with handwritten text||Actual text written on the image|
|Credit card fraud detection||Time of transaction, IP address, credit card holder details, etc.||Fraud yes / no|
A scenario where you have both, input and output data, is called supervised learning. There are also algorithms which work without the output data or labels, this is called unsupervised learning with clustering as an important example. In a previous blog post I used a clustering technique to identify patterns in Betfair starting price bets.
You might also have noticed that the format of the output varies and depends on the application: It could be a continuous variable in a certain range such as driving wheel position or speed for the self driving car example. In this case we talk about a regression. If the output is restricted to certain classes such es fraud yes or no - then we talk about a classification problem.
When training a machine learning model the data is typically split into a train and test set. The training set is used to train the machine learning model. The test set is then used to evaluate the performance of the machine learning model.
There are many different types of machine learning models, some examples are artificial neural networks, decision trees and regression models.
Our task is now to apply the machine learning framework on football data to successfully derive a betting or trading strategy. If the machine learning model is better at predicting the outcome of football matches compared to the market then we should have some competitive edge that we can exploit on a betting exchange market through value betting.
My goal is to predict the outcome of a football match and with outcome I mean home team wins, draw or away team wins (1x2). I will treat it as a classification problem with three classes: home, draw, away. Obviously I could also pick different targets, such as as over or under 2.5 (classification problem), predicting the number of goals (which would be a regression problem), etc.
Input data is obviously historical data of football match, including goals, shots on target, cards and other statistics. It is also possible to add other data sources such as video of football matches, weather data or the odds from bookmaker or betting exchange markets. The goal is to have data at hand which correlates with the outcome of the football match. The better your input data describes the outcome, the better the performance of your machine learning model.
Instead of simply adding the raw data as input for the machine learning model it might make sense to handcraft certain features, which is called feature engineering. Such features could be the expected goals, average goals in the past x matches, etc.
Now let's get into some coding. I will be using python programming stack, use pandas to handle tabular data and then I will train a machine learning model using the famous scikit-learn package. I believe that scikit-learn is a great tool for getting into machine learning as it offers a fantastic documentation.
First step, we download relevant data from football-data.co.uk and load it into a pandas dataframe:
import pandas as pd from datetime import datetime df = pd.read_csv( f"http://www.football-data.co.uk/mmz4281/1819/E0.csv", parse_dates=["Date"], date_parser=lambda x: datetime.strptime(x, "%d/%m/%Y"), usecols=["Date", "HomeTeam", "AwayTeam", "FTHG", "FTAG", "PSH", "PSD", "PSA"], ) df["Target"] = df.apply(lambda x: 0 if x["FTHG"] > x["FTAG"] else 1 if x["FTHG"] == x["FTAG"] else 2, axis=1)
The dataframe should look like the following. The target column contains 0 for home win, 1 for draw and 2 for away win which are the classes that I would like my model to predict (framing it as classification problem):
Next, I add two columns to the table which contain the the average goals scored in the last 5 matches at home or away. These two columns are the features that I will put into the machine learning algorithm:
def get_average_goals(df, team, side, date): temp_df = df[df[side] == team] temp_df = temp_df[temp_df["Date"] < date] temp_df.sort_values("Date", inplace=True) if side == "HomeTeam": return temp_df.FTHG[-5:].mean() else: return temp_df.FTAG[-5:].mean() df["Home_AvgHG"] = df.apply(lambda x: get_average_goals(df, x["HomeTeam"], "HomeTeam", x["Date"]), axis=1) df["Home_AvgAG"] = df.apply(lambda x: get_average_goals(df, x["HomeTeam"], "AwayTeam", x["Date"]), axis=1) df["Away_AvgHG"] = df.apply(lambda x: get_average_goals(df, x["AwayTeam"], "HomeTeam", x["Date"]), axis=1) df["Away_AvgAG"] = df.apply(lambda x: get_average_goals(df, x["AwayTeam"], "AwayTeam", x["Date"]), axis=1)
Next is splitting the data. The first 100 matches of the 2018/19 season are ignored and only serve as warm-up period to calculate the average goals of the teams. The first 80% of the matches are used to train the machine learning model. The remaining 20% of the data are used as test set to evaluate the performance of the machine learning model at the end. The sklearn library has a handy function called train_test_split which splits the data for us:
from sklearn.model_selection import train_test_split df_after_warmup = df[100:] X_train, X_test, y_train, y_test = train_test_split( df_after_warmup[["Home_AvgHG", "Home_AvgAG", "Away_AvgHG", "Away_AvgAG"]].values, df_after_warmup["Target"].values, test_size=0.2, random_state=42, shuffle=False )
Typically I would start of using a dummy classifier to have a baseline for the problem. The scikit-learn dummy classifier can use different strategies, here I am using the "prior" strategy which simply predicts the most frequent class (always a home win). The classifiers on scikit-learn use a very simple interface: fit is used to train a model and predict for inference. The predict_proba method returns the actual prior, the ratio of home win, draws and away wins for the train set:
from sklearn.dummy import DummyClassifier from sklearn.metrics import confusion_matrix, accuracy_score dummy_classifier = DummyClassifier(strategy="prior", random_state=42) dummy_classifier.fit(X_train, y_train) proba = dummy_classifier.predict_proba([[1, 1, 1, 1]]) predictions = dummy_classifier.predict(X_test) print(confusion_matrix(y_test, predictions)) print(accuracy_score(y_test, predictions))
When looking at the performance of the model we can first look at the confusion matrix. The confusion matrix is a simple table that outputs the actual class versus the predicted class. The dummy classifier predicts a home win for all matches, which means all 56 matches in the test set are predicted as home win. However, only 23 of those matches actually ended in a home win, which means that the accuracy of the model is around 41% (correct predicted samples / all samples).
|0 - Predicted Home||1 - Predicted Draw||2 - Predicted Away|
|0 - Home||23||0||0|
|1 - Draw||13||0||0|
|2 - Away||20||0||0|
Next, I will use an actual machine learning model, a simple random forest classifier which uses multiple decision trees to make a prediction. The interface is exactly the same as for the dummy classifier:
from sklearn.ensemble import RandomForestClassifier rf_classifier = RandomForestClassifier(random_state=42) rf_classifier.fit(X_train, y_train) predictions = rf_classifier.predict(X_test) print(confusion_matrix(y_test, predictions)) print(accuracy_score(y_test, predictions))
Now the confusion matrix is a bit more interesting as also draw and away wins occur in the predictions. The accuracy on the test set has improved to around 50% on the test set. But is this sufficient to make a profit with sports betting?
|0 - Predicted Home||1 - Predicted Draw||2 - Predicted Away|
|0 - Home||19||1||3|
|1 - Draw||10||0||3|
|2 - Away||11||0||9|
To translate the accuracy of the machine learning model into actual profit I will append the predictions of the machine learning model to the dataframe with the test data:
df_test = df[-len(X_test):] predictions = pd.DataFrame( rf_classifier.predict_proba(df_test[["Home_AvgHG", "Home_AvgAG", "Away_AvgHG", "Away_AvgAG"]].values), columns=["HomeProba", "DrawProba", "AwayProba"] ) predictions.set_index(df_test.index, inplace=True) df_with_predictions = pd.concat([df_test, predictions], axis=1) df_with_predictions
Next I will apply a simple value betting strategy: If my machine learning model assumes that the probability of a home win, draw or away win is larger than the probability implied by the odds, I will place a back bet. For simplicity I chose a flat staking plan with 1 point per selection:
def bet(x): if x["HomeProba"] > 1/x["PSH"]: if x["FTHG"] > x["FTAG"]: return x["PSH"] - 1 else: return -1 elif x["DrawProba"] > 1/x["PSD"]: if x["FTHG"] == x["FTAG"]: return x["PSD"] - 1 else: return -1 elif x["AwayProba"] > 1/x["PSA"]: if x["FTHG"] < x["FTAG"]: return x["PSA"] - 1 else: return -1 else: return 0 df_with_predictions["profit"] = df_with_predictions.apply(lambda x: bet(x), axis=1) df_with_predictions["profit_cumsum"] = df_with_predictions["profit"].cumsum() df_with_predictions
As a result I obtained a small loss of around 1.7 points for the 56 matches in the test set. There are multiple things to follow up upon the initial steps: A focus might be on adding more data, not only a single season. More sophisticated feature engineering might also improve the results. One could for instance include the attacking and defending strength from the Dixon-Coles model or even the odds from bookmaker or betting exchanges to improve the performance. Also the model itself could be fine-tuned -there are plenty of parameters- or other models could be trialed. Still lots of room for experiments!
Please log in to leave a comment.
If you would like to learn more about this strategy, please do not hesitate to contact us.Contact Us!