Predicting Football Using FIFA Rankings

Published on Dec. 7, 2020, 10:53 a.m. - Sports: Football

In this article we describe how FIFA rankings can be used to predict football matches. FIFA rankings are publicly available and can be used to train a Neural Network. This type of machine learning model is used to predict odds of football matches and derive a sports betting strategy.

Recently, Yarden Gur -one of our users- contacted us and brought to our attention that he has developed a Neural Network model that uses FIFA rankings to predict football matches. He trained the Neural Network model using the tensorflow library. Code and some description can be found in his github repo.

In this article we would like to cover the idea behind the FIFA rankings but also look at the machine learning development cycle to train a Neural Network model and derive a betting strategy.

EA FIFA Rankings

When talking about FIFA rankings I am referring to the ratings that are used in the EA FIFA football simulation video game. I personally never owned the video game but remember playing (or more accurately getting beaten up) when visiting and playing with friends. In the video game it is possible to select a football team to play with. Inside the software ratings are associated with every team. Every team has an attack, midfield, defense and overall rating in form of an integer value as illustrated in the following table. The higher the value, the better the team.

Team Attack Rating Midfield Rating Defense Rating Overall Rating
Chelsea 81 82 82 83
Manchester City 84 82 81 82
Arsenal 82 79 80 82

There are various websites that publish FIFA ratings - one example is fifaindex.com. This website does not only have the team ratings but also details and rankings for individual football players.

You might ask yourself where the EA FIFA rankings come from? Unfortunately, I could not find much of information how FIFA team ratings are calculated. My guess is that it is derived from historical data, kind of an aggregation of past results of the teams. Please let me know in the comments below if you have any information on how FIFA rankings are calculated. Since there are only four rankings associated with every team (attack, midfield, defense and overall strength) I was a bit concerned that models would not perform very well compared to models that would use additional data points. Anyway, we will figure this out in the following.

Creating a Scraper to Collect FIFA Rankings

In order to work with the FIFA rankings we need to get hold of the data in a format we can work with. Yarden has added a Python script in his github repo that uses BeautifulSoup -a Python package for scraping- to parse the fifaindex.com website and load the FIFA team rankings into a pandas dataframe.

The code is similar to the following snippet that would scrape the rankings for the Premier League and the EFL Championship for various seasons. Later I want to backtest the model on historical data of the Premiere League. I also included the EFL Championship to have data points for the teams that are promoted to the Premier League.

import pandas as pd
import requests
from bs4 import BeautifulSoup


def get_rankings(league_id: int, season: int) -> pd.DataFrame:
    """Get data frame with rankings."""
    url = f"https://www.fifaindex.com/teams/fifa{season}/?league={league_id}"
    response = requests.get(url)
    assert response.status_code == 200
    soup = BeautifulSoup(response.content)
    date = soup.find("a", attrs={"class": "dropdown-toggle", "data-placement": "bottom"}).text
    datetime = date_to_datetime(date)
    teams_table = soup.find("table", attrs={"class": "table-teams"})
    row_with_team = [row for row in teams_table.find_all("tr") if row.find("a", attrs={"class": "link-team"})]
    teams = []
    for row in row_with_team:
        team = row.find("td", attrs={"data-title": "Name"}).find("a", attrs={"class": "link-team"}).text
        rankings = []
        team_rankings = ("ATT", "MID", "DEF", "OVR")
        for ranking in team_rankings:
            rankings.append(int(row.find("td", attrs={"data-title": ranking}).find("span").text))
        teams.append((league_id, season, datetime, team, *rankings))
    return pd.DataFrame(teams, columns=("league_id", "seaon", "datetime", "team", *team_rankings))


if __name__ == "__main__":
    leagues = (13, 14)  # Premier League and EFL Championship
    seasons = (16, 17, 18, 19, 20)
    dfs = []
    for league in leagues:
        for season in seasons:
            dfs.append(get_rankings(league, season))
    df = pd.concat(dfs, ignore_index=True)
    df.to_csv("fifa_rankings.csv", sep=";", index=False)

In the main routine I define for which FIFA versions I want to scrape the data along with the leagues. The function get_rankings uses the requests package to get the rankings from fifaindex.com and BeautifulSoup is used to parse the table with the results. The date format used on fifaindex is a bit odds which is why I use a function called date_to_datetime to parse the str and convert into a Python datetime object. The FIFA version, league ID, date of ranking and the teams with the rankings are then saved to a csv file:

league_id seaon datetime team ATT MID DEF OVR
13 16 2016-09-22 Chelsea 81 82 82 83
13 16 2016-09-22 Manchester City 84 82 81 82
13 16 2016-09-22 Arsenal 82 79 80 82

Training the Neural Network Model

Now that we have the FIFA team ratings in a workable format we will start with the next step, developing a model that can predict the outcome of football matches given the team ratings. There are many different statistical models or machine learning models that can be used here. However, I would like to take the opportunity to look at a special type of machine learning models called artificial neural networks. Neural networks belong to a class of software algorithm that tries to mimic how a human brain works. A neural network has an input and an output that is connected through a set of operations represented by artificial neurons.

Preprocessing

In order to train the model I will use the football data from football-data.co.uk. This data contains football matches for major European leagues along with the results and odds that will be used in the backtest later. When training the model it is important to only use FIFA rankings that are available at the date of the match, otherwise we would introduce a look-ahead bias.

The preprocessing script basically iterates over the football matches which are kept in a pandas Dataframe. For every match, the rankings are written into the Dataframe for both, home and away team. The result is then saved to another csv file that contains all the data which is required to train and test a neural network model.

from datetime import datetime
import pandas as pd

team_rankings = ["ATT", "MID", "DEF", "OVR"]
team_mapping = {
    "Man United": "Manchester United",
    "Man City": "Manchester City",
    "Wolves": "Wolverhampton Wanderers",
}


def get_rankings(date: datetime, team: str):
    if team in team_mapping:
        team = team_mapping[team]
    rankings = pd.read_csv("fifa_rankings.csv", sep=";")
    rankings.sort_values("datetime", inplace=True)
    rankings = rankings[
       (rankings["team"].str.startswith(team)) & (rankings["datetime"] < date)
    ]
    if not len(rankings):
        print(team)
        raise RuntimeError("No result found")
    return rankings.iloc[0][team_rankings]


if __name__ == "__main__":
    football_data = pd.read_csv("football-data.csv", sep=";")
    football_data["Target"] = football_data.apply(
        lambda x: 0 if x["FTHG"] > x["FTAG"] else 1 if x["FTHG"] == x["FTAG"] else 2, axis=1)
    football_data[[f"Home_{ranking}" for ranking in team_rankings]] = 0
    football_data[[f"Away_{ranking}" for ranking in team_rankings]] = 0
    for index, row in football_data.iterrows():
        home_rankings = get_rankings(row["Date"], row["HomeTeam"])
        away_rankings = get_rankings(row["Date"], row["AwayTeam"])
        football_data.loc[index, [f"Home_{ranking}" for ranking in team_rankings]] = home_rankings.values
        football_data.loc[index, [f"Away_{ranking}" for ranking in team_rankings]] = away_rankings.values
    football_data.to_csv("prepared-data.csv", sep=";", index=False)

Model Training

In our case the input is the FIFA team ratings of the home and away team. Remember that each team has 4 ratings which means that we have 8 inputs in total. The outcome of a football match is either a home win, draw or away win which means that we have 3 output nodes. The question is now how we connect the input to the output and how we can find a suitable neural network model architecture. We do not want to take a too simple approach here and simply connect the input to the output without any additional layers as the model would not have many parameters to be learnt and hence would under-fit or predict poorly. We also don't want to have a too deep neural network which would have too many parameters and hence the risk of over-fitting. For a start I will just use 2 dense (fully connected) layers with a dropout operation in between which is the same model that Yarden used. Whilst Yarden used tensorflow to train the model I will try to go ahead with pytorch. Both are great Python packages to train neural network models.

The data set is split into a train and test or validation set. The train set is used to train the neural network model and derive its parameters. The test set is then used to evaluate the performance of the model. The following pytorch script is used to train a very simple neural network model on a GPU:

if __name__ == "__main__":
    df = pd.read_csv("prepared-data.csv", sep=";")
    X_train, X_test, y_train, y_test = train_test_split(
        df[[f"{side}_{ranking}" for ranking in team_rankings for side in ("Home", "Away")]].values / 100.,
        df["Target"].values,
        test_size=0.2,
        random_state=42,
        shuffle=True,
    )

    model = nn.Sequential(
        nn.Linear(8, 6),
        nn.Dropout(0.5),
        nn.Linear(6, 3),
        nn.Softmax(dim=1),
    )
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.0001)

    trainloader = DataLoader(
        TensorDataset(
            torch.Tensor(X_train),
            torch.Tensor(y_train).type(torch.LongTensor),
        ),
        batch_size=100,
    )

    for e in range(1000):  # loop over the dataset multiple times
        train_epoch_loss = 0
        total = 0
        correct = 0
        for X_train_batch, y_train_batch in trainloader:
            X_train_batch, y_train_batch = X_train_batch.to(device), y_train_batch.to(device)
            y_train_pred = model(X_train_batch)
            train_loss = criterion(y_train_pred, y_train_batch)
            total += y_train_batch.size(0)
            correct += (torch.max(y_train_pred.data, 1)[1] == y_train_batch).sum().item()
            train_loss.backward()
            optimizer.step()
            train_epoch_loss += train_loss.item()
        print(f"Epoch {e}, Loss: {train_epoch_loss/len(trainloader):.3f}, Accuracy: {correct/total:.3f}")

    PATH = './fifa-rankings.pth'
    torch.save(model.state_dict(), PATH)

    testloader = DataLoader(
        TensorDataset(
            torch.Tensor(X_test),
            torch.Tensor(y_test).type(torch.LongTensor),
        )
    )

    correct = 0
    total = 0
    predictions = []

    with torch.no_grad():
        for data in testloader:
            inputs, labels = data[0].to(device), data[1].to(device)
            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            predictions.append(predicted.cpu())
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    print(f"Accuracy on test set: {100 * correct / total:.3f}")
    print(confusion_matrix(y_test, predictions))

The training process of the model takes around 30s on my GPU. If you don't have a GPU with CUDA support you might also use your CPU which probably takes a bit longer. For the accuracy on the test set I got a value of 52.2% which is below the value that Yarden reported (60% for La Liga). When looking at the confusion matrix I could observe as well that no draws were predicted by the model (draws are the class with least examples in the test set). There are various approaches towards tackling imbalances in data sets that might help with improving accuracy. Other approaches such as changing the neural network model architecture could also help to improve accuracy. Of course one could also use a completely different Machine-Learning model, such as a RandomForest or SVM for instance. In another strategy we covered how to use machine learning models to predict football matches and showed in more detail how to train a Random-Forest classifier.

Evaluating the Model Performance on Historical Data

Now that we have a neural network model that can predict the match outcome of a football game we want to know whether the model is good enough to make some money with sports betting. The data from football-data.co.uk contains odds for major bookmakers. These odds will be used to test the performance of the model on historical data. This will then give us some idea of how profitable this model would have been in the past.

We are currently performing the backtest on the test set mentioned above an will publish the results shortly. We will also check if we can boost model performance. Please bookmark this page and check back later.

Do you like our content? Please share with your friends!

Share on Facebook Share on Twitter

Comments

No comments published yet.

Please log in to leave a comment.

Similar Strategies
See all Strategies!
Any Questions or Suggestions?

If you would like to learn more about this strategy, please do not hesitate to contact us.

Contact Us!