Using Machine Learning to Predict Football Matches

Published on June 10, 2020, 3:21 p.m. - comment 9 Comments - Sports: Football

Machine Learning and Artificial Intelligence are powerful tools to learn from large amounts of data and help to make better decisions. In this article I would like to train a machine learning model that is capable of predicting the outcome of football matches.

How does Machine Learning work?

The basic idea behind machine learning is that relations in historical data can be learned by a computer program to make predictions on fresh data in the future. Nowadays machine learning is used in many applications: Some widely known examples are self driving-cars, handwriting recognition or credit-card fraud detection. I would like to use these examples to quickly explain how machine learning is used and applied in a broad context.

Data for Machine Learning

All machine learning applications have in common that they require data as input. For self driving cars video streams from cameras as well as sensor data (LIDAR) is recorded during test rides. For credit-card fraud detection data around payment transactions such as time, IP address, etc is used. For handwriting recognition image files of handwritten text is used. I refer to this data as input. However, the input alone is often not very helpful. You would also need the desired output or label that you would like to predict. In the process of training a machine learning model features are generated based on the input data and then the goal is to link it to the label or output. The following table gives an overview of the data for some example applications:

Application Input Data Label / Output
Self driving cars Video Streams, Lidar data Driving wheel position, speed
Handwriting Detection Images with handwritten text Actual text written on the image
Credit card fraud detection Time of transaction, IP address, credit card holder details, etc. Fraud yes / no

A scenario where you have both, input and output data, is called supervised learning. There are also algorithms which work without the output data or labels, this is called unsupervised learning with clustering as an important example. In a previous blog post I used a clustering technique to identify patterns in Betfair starting price bets.

You might also have noticed that the format of the output varies and depends on the application: It could be a continuous variable in a certain range such as driving wheel position or speed for the self driving car example. In this case we talk about a regression. If the output is restricted to certain classes such es fraud yes or no - then we talk about a classification problem.

When training a machine learning model the data is typically split into a train and test set. The training set is used to train the machine learning model. The test set is then used to evaluate the performance of the machine learning model.

Machine Learning Models

There are many different types of machine learning models, some examples are artificial neural networks, decision trees and regression models.

How can machine learning be applied to football matches?

Our task is now to apply the machine learning framework on football data to successfully derive a betting or trading strategy. If the machine learning model is better at predicting the outcome of football matches compared to the market then we should have some competitive edge that we can exploit on a betting exchange market through value betting.

What is the output / target?

My goal is to predict the outcome of a football match and with outcome I mean home team wins, draw or away team wins (1x2). I will treat it as a classification problem with three classes: home, draw, away. Obviously I could also pick different targets, such as as over or under 2.5 (classification problem), predicting the number of goals (which would be a regression problem), etc.

What is my input data?

Input data is obviously historical data of football match, including goals, shots on target, cards and other statistics. It is also possible to add other data sources such as video of football matches, weather data or the odds from bookmaker or betting exchange markets. The goal is to have data at hand which correlates with the outcome of the football match. The better your input data describes the outcome, the better the performance of your machine learning model.

Instead of simply adding the raw data as input for the machine learning model it might make sense to handcraft certain features, which is called feature engineering. Such features could be the expected goals, average goals in the past x matches, etc.

Training a Random Forest Classifier to Predict Football Matches

Now let's get into some coding. I will be using python programming stack, use pandas to handle tabular data and then I will train a machine learning model using the famous scikit-learn package. I believe that scikit-learn is a great tool for getting into machine learning as it offers a fantastic documentation.

First step, we download relevant data from and load it into a pandas dataframe:

import pandas as pd
from datetime import datetime

df = pd.read_csv(
    date_parser=lambda x: datetime.strptime(x, "%d/%m/%Y"),
    usecols=["Date", "HomeTeam", "AwayTeam", "FTHG", "FTAG", "PSH", "PSD", "PSA"],
df["Target"] = df.apply(lambda x: 0 if x["FTHG"] > x["FTAG"] else 1 if x["FTHG"] == x["FTAG"] else 2, axis=1)

The dataframe should look like the following. The target column contains 0 for home win, 1 for draw and 2 for away win which are the classes that I would like my model to predict (framing it as classification problem):

  Date HomeTeam AwayTeam FTHG FTAG PSH PSD PSA Target
0 2018-08-10 Man United Leicester 2 1 1.58 3.93 7.50 0
1 2018-08-11 Bournemouth Cardiff 2 0 1.89 3.63 4.58 0
2 2018-08-11 Fulham Crystal Palace 0 2 2.50 3.46 3.00 2
3 2018-08-11 Huddersfield Chelsea 0 3 6.41 4.02 1.62 2
4 2018-08-11 Newcastle Tottenham 1 2 3.83 3.57 2.08 2

Next, I add two columns to the table which contain the the average goals scored in the last 5 matches at home or away. These two columns are the features that I will put into the machine learning algorithm:

def get_average_goals(df, team, side, date):
    temp_df = df[df[side] == team]
    temp_df = temp_df[temp_df["Date"] < date]
    temp_df.sort_values("Date", inplace=True)
    if side == "HomeTeam":
        return temp_df.FTHG[-5:].mean()
        return temp_df.FTAG[-5:].mean()

df["Home_AvgHG"] = df.apply(lambda x: get_average_goals(df, x["HomeTeam"], "HomeTeam", x["Date"]), axis=1)
df["Home_AvgAG"] = df.apply(lambda x: get_average_goals(df, x["HomeTeam"], "AwayTeam", x["Date"]), axis=1)
df["Away_AvgHG"] = df.apply(lambda x: get_average_goals(df, x["AwayTeam"], "HomeTeam", x["Date"]), axis=1)
df["Away_AvgAG"] = df.apply(lambda x: get_average_goals(df, x["AwayTeam"], "AwayTeam", x["Date"]), axis=1)

Next is splitting the data. The first 100 matches of the 2018/19 season are ignored and only serve as warm-up period to calculate the average goals of the teams. The first 80% of the matches are used to train the machine learning model. The remaining 20% of the data are used as test set to evaluate the performance of the machine learning model at the end. The sklearn library has a handy function called train_test_split which splits the data for us:

from sklearn.model_selection import train_test_split

df_after_warmup = df[100:]

X_train, X_test, y_train, y_test = train_test_split(
    df_after_warmup[["Home_AvgHG", "Home_AvgAG", "Away_AvgHG", "Away_AvgAG"]].values, 

Typically I would start of using a dummy classifier to have a baseline for the problem. The scikit-learn dummy classifier can use different strategies, here I am using the "prior" strategy which simply predicts the most frequent class (always a home win). The classifiers on scikit-learn use a very simple interface: fit is used to train a model and predict for inference. The predict_proba method returns the actual prior, the ratio of home win, draws and away wins for the train set:

from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

dummy_classifier = DummyClassifier(strategy="prior", random_state=42), y_train)
proba = dummy_classifier.predict_proba([[1, 1, 1, 1]])
predictions = dummy_classifier.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

When looking at the performance of the model we can first look at the confusion matrix. The confusion matrix is a simple table that outputs the actual class versus the predicted class. The dummy classifier predicts a home win for all matches, which means all 56 matches in the test set are predicted as home win. However, only 23 of those matches actually ended in a home win, which means that the accuracy of the model is around 41% (correct predicted samples / all samples).

  0 - Predicted Home 1 - Predicted Draw 2 - Predicted Away
0 - Home 23 0 0
1 - Draw 13 0 0
2 - Away 20 0 0

Next, I will use an actual machine learning model, a simple random forest classifier which uses multiple decision trees to make a prediction. The interface is exactly the same as for the dummy classifier:

from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(random_state=42), y_train)
predictions = rf_classifier.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

Now the confusion matrix is a bit more interesting as also draw and away wins occur in the predictions. The accuracy on the test set has improved to around 50% on the test set. But is this sufficient to make a profit with sports betting?

  0 - Predicted Home 1 - Predicted Draw 2 - Predicted Away
0 - Home 19 1 3
1 - Draw 10 0 3
2 - Away 11 0 9

To translate the accuracy of the machine learning model into actual profit I will append the predictions of the machine learning model to the dataframe with the test data:

df_test = df[-len(X_test):]
predictions = pd.DataFrame(
    rf_classifier.predict_proba(df_test[["Home_AvgHG", "Home_AvgAG", "Away_AvgHG", "Away_AvgAG"]].values),
    columns=["HomeProba", "DrawProba", "AwayProba"]
predictions.set_index(df_test.index, inplace=True)
df_with_predictions = pd.concat([df_test, predictions], axis=1)

Next I will apply a simple value betting strategy: If my machine learning model assumes that the probability of a home win, draw or away win is larger than the probability implied by the odds, I will place a back bet. For simplicity I chose a flat staking plan with 1 point per selection:

def bet(x):
    if x["HomeProba"] > 1/x["PSH"]:
        if x["FTHG"] > x["FTAG"]:
            return x["PSH"] - 1
            return -1
    elif x["DrawProba"] > 1/x["PSD"]:
        if x["FTHG"] == x["FTAG"]:
            return x["PSD"] - 1
            return -1
    elif x["AwayProba"] > 1/x["PSA"]:
        if x["FTHG"] < x["FTAG"]:
            return x["PSA"] - 1
            return -1
        return 0
df_with_predictions["profit"] = df_with_predictions.apply(lambda x: bet(x), axis=1)
df_with_predictions["profit_cumsum"] = df_with_predictions["profit"].cumsum()

As a result I obtained a small loss of around 1.7 points for the 56 matches in the test set. There are multiple things to follow up upon the initial steps: A focus might be on adding more data, not only a single season. More sophisticated feature engineering might also improve the results. One could for instance include the attacking and defending strength from the Dixon-Coles model or even the odds from bookmaker or betting exchanges to improve the performance. Also the model itself could be fine-tuned -there are plenty of parameters- or other models could be trialed. Still lots of room for experiments!

Do you like our content? Please share with your friends!

Share on Facebook Share on Twitter


Aug. 10, 2020, 6:15 p.m.

Cool article! I'm trying this out, but there are a few things I don't understand from the above example:
Why is the goal average computed for each team? It is mentioned later that the first 100 samples are used for calculating the goal average, but I can't see that from the code samples?
Why is a goal average computed at all? Shouldn't you let the ML get all the data and then you'll supply just the team names to get a prediction of who wins?

Aug. 11, 2020, 7:03 p.m.

Thanks a lot for your comment.

For the training of the random forest classifier a matrix with samples (matches) x features is required. The goal average is just an example for a feature that is used to train the ML model. Ideally some time is spend on feature engineering to figure out which features work best and have more predictive power.

The "get_average_goals" function calculates the avg goals over the past 5 matches. Slicing happens later with "df_after_warmup = df[100:]". Hence df_after_warmup -which is used for training- contains all the data except for the first 100 matches. Will try to add line numbers to the code snippets to be able to reference or maybe move some of the code to github.

Re your last question - what do you mean with "all the data"? At the moment I am not aware of such ML algorithms that only take a list of matches as input and still yield good predictions without any manual feature engineering. Certainly an interesting topic - please let me know in case you have more details on such algorithms.

Aug. 12, 2020, 5:55 p.m.

Oh I have learned so much since my comment, haha! I'm trying to implement this myself, using a bunch of this code as inspiration.

My comment about "all the data" doesn't make sense, I didn't understand the features.

I tried experimenting with the mean goals, and also making sure to only remove the absolutely necessary warmup data.

It's awesome to think about more complex features, you could, as you say, use the betting numbers. But if you want to do this for real then you'd have to use the betting numbers from the site you will bet on?

I also tried adding more means, you could have an array of means for the past 3, 5 AND 10 games, but the warmup would be greater.

Aug. 13, 2020, 5:26 p.m.

Happy to hear about your progress.
Re odds as feature: Maybe it is worth checking the correlation between odds and the actual outcome of the match for different betting sites and then use the one with highest correlation (the site with most accurate odds). Odds between sites should be almost perfect positively correlated anyway, using odds of multiple sites as input probably won't boost model performance much. This is just my intuition - I might look into it and maybe publish an article in the blog if I find some time.
You can always train multiple ML models with different features on a training set and then check the performance of the models on a test set to compare.

Oct. 6, 2020, 9:35 p.m.

This is quite interesting, how donI apply this model to live fixtures?

Also, you mentioned using a regression model to predict goals. Do you mind to share please?

Oct. 9, 2020, 3:43 p.m.

Thanks for your comment, wemustwin.

I am not entirely sure what you mean with live fixtures? In order to use the random forest model for predictions -such as upcoming matches- you would simply need to calculate the features the model was trained with and then call the predict or predict_proba method. If you would like to develop a machine learning model for inplay betting then inplay data is required to train the model on.

We are planning to publish additional Machine-Learning based strategies in the future and also want to cover regression models but some more time is required to publish.

Oct. 12, 2020, 3:30 p.m.

Interesting. I also thought of using random forest to predict matches. Glad I found someone who has done it already :)

Anyway I think it is good to test your model accuracy against "picking the favourite" accuracy. I think last time I checked for EPL/LALIGA (not sure which), by just picking the favourite team (according to bookies odds) you'd have a chance of somewhere around 50% , can't recall the exact number.

Nov. 6, 2020, 11:56 p.m.

I guess your code in the end was written wrong:

def bet(x):
if x["HomeProba"] > 1/x["PSA"]:

Instead of PSA I would expect PSH, right?

Nov. 7, 2020, 6:43 a.m.

Yes, you are right. Thanks for catching that! I just corrected the code and updated the backtest result.

Please log in to leave a comment.

Similar Strategies
See all Strategies!
Any Questions or Suggestions?

If you would like to learn more about this strategy, please do not hesitate to contact us.

Contact Us!