- Data Science PyData
- Peadar Coyle: Probabilistic Programming in Sports Analytics
Click on text below to jump to the corresponding location in the video (except iOS)
I am going to talk about rugby analytics. The agenda is not rugby the sport but probabilistic programming and how you can use this to build a predictive model for an interesting problem.
Who am I?I am a data analytics professional based on Luxembourg.
Contents: Probabilistic programming applied to rugbyStandard xkcd cartoonSports commentary
How can statistic help with sports?-fundamentally rugby is a simulatable event
-how do we generate a model to predict the outcome of a tournament?
-how do we quantify our uncertainty in our model?
What influenced me on this?Quantopian talk
What's wrong with statisticsModels should not be built for mathematical convenience, but to accurately model the data.
What is Bayesian statistics?Implies that we have a prior belief about the world. Bayesian statistics is a formula to update our beliefs after having observed data.
Bayesian rugbyBased on an original paper by Baio and Blangiardo
What Zalando didThey used it for automatic weight estimations for items
So why Bayesians?Probabilistic programming is a new paradigm. I will be comparing blackbox machine learning with scikit-learn.
Blackbox machine learningPredictions based on a blackbox
Limitations of machine learningThe models being blackbox is in itself a limitation - hard to explain to customersProbabilistic programmingOpenbox modes. Blackbox inference engine.
Probabilistic programming - what's the big deal?We are able to use data and our prior beliefs to generate a model. Generating a model is extremely powerful, and we can tell a story.
Six nations rugbyMotivationYour estimate of the strength of a team depends on your estimates of others' strengths.
Results from previous yearsPreparing model for PyMCWhat do we want to infer?We want to infer the team strengths. We want to infer latent parameters. Probabilistic programming allows us to get these latent parameters.
MCMC samplesWhat do we want?We want to quantify the uncertainty, to use this to generate a model, and we want answers as distributions and not point estimates.
What assumptions do we know?Finite number of team. We have data from last year, and sports scoring is modeled as a Poisson distribution.
The modelHome advantage is taken into account.
Key assumption: home effect is an advantage in sports. Bayesian models allow you to incorporate these beliefs into your model.
Digression: why the flat priors were pickedIt made no statistical difference
A prior distribution is non-informative if the prior is flat relative to the likelihood functionOften in Bayesian modelling it doesn't matter what your priors are. Even bad guesses will give enough information to find an interesting answer.
Let us run the modelDiagnosticsThe plot indicates that the model converges.
The home advantage gives about 0.55 points advantage.
Simulating a seasonWe are going to simulate 1000 seasons. So the model predicted Ireland would win most of the time 4 games. We can also see how many points they score.
What happened in reality?Shrinkage: Fundamentally all models are wrong, but some are useful.
What are the predictions of the model?Let us look at the winning team on average.
What actually happenedWe need to investigate like scientists.
Ireland won the six nationsConclusionLearn moreProbabilistic programming for hackers
Doing Bayesian data analysis
QuestionsCan we sell this tool as a compliance tool for FIFA?Never going to happen.Maybe.
Why don't you use nested sampling instead of MCMC?The PyMC3 project has a few more samplers. You can submit a pull request to add it yourself.
PyMC2 vs PyMC3. Can you tell us why you chose PyMC 2?At the time I did it, the documentation for PyMC 3 was not that good. I will probably port it to PyMC3 at some point. Video outline created using VideoJots. Click and drag lower right corner to resize video. On iOS devices you cannot jump to video location by clicking on the outline.