Bayesian Statistics

Documentation reimplemented from this series of posts, with one major difference. The original posts uses the dplyr package when manipulating with data, whereas this one uses data.table. Code accompanying the documentation can be found here.

Which of these two proportions is higher: 4 out of 10, or 300 out of 1000? This sounds like a silly question. Obviously \(4/10=.4\), which is greater than \(300/1000=.3\)!

But suppose you were a baseball recruiter, trying to decide which of two potential players is a better batter based on how many hits they get. One has achieved 4 hits in 10 chances, the other 300 hits in 1000 chances. While the first player has a higher proportion of hits, it’s not a lot of evidence: a typical player tends to achieve a hit around 27% of the time, and this player’s \(4/10\) could be due to luck. The second player, on the other hand, has a lot of evidence that he’s an above-average batter.

A lot of data takes the form of these success/total counts, where you want to estimate a “proportion of success” for each instance. When you work with pairs of successes/totals like this, you tend to get tripped up by the uncertainty in low counts. \(1/2\) does not mean the same thing as \(50/100\); nor does \(0/1\) mean the same thing as \(0/1000\). One approach is to filter out all cases that don’t meet some minimum, but this isn’t always an option: you’re throwing away useful information. One approach to help make your estimate more accurate and practical is to use the beta distribution to represent your prior expectations, and update them based on new evidences.

Beta Distribution

The beta distribution can be understood as representing a probability distribution of probabilities. That is, it represents all the possible values of a probability when we don’t know what that probability is. Let’s consider the following example:

Prior & Posterior

Anyone who follows baseball is familiar with batting averages - simply the number of times a player gets a base hit divided by the number of times he goes up at bat (so it’s just a percentage between 0 and 1). .266 is in general considered an average batting average, while .300 is considered an excellent one.

Imagine we have a baseball player, and we want to predict what his season-long batting average will be. You might say we can just use his batting average so far, but this will be a very poor measure at the start of a season! If a player goes up to bat once and gets a single, his batting average is briefly 1.000, while if he strikes out or walks, his batting average is 0.000. It doesn’t get much better if you go up to bat for another five or six times, you could get a lucky streak and get an average of 1.000, or an unlucky streak and get an average of 0, neither of which are a remotely good predictor of how you will bat that season.

Why is your batting average in the first few hits not a good predictor of your eventual batting average? When a player’s first at-bat is a strikeout, why do we predict that he’ll never get a hit all season? Because we’re going in with prior expectations. We know that in history, most batting averages over a season have hovered between something like .215 and .360, with some extremely rare exceptions on either side. We know that if a player gets a few strikeouts in a row at the start, that might indicate he’ll end up a bit worse than average, but we know he probably won’t deviate from that range.

Given our batting average problem, which can be represented with a binomial distribution (a series of successes and failures), the best way to represent these prior is with the beta distribution. It’s saying, before we’ve seen the player take his first swing, what we roughly expect his batting average to be. The domain of the beta distribution is \((0, 1)\), just like a probability, so we already know we’re on the right track, but the appropriateness of the beta for this task goes far beyond that.

We expect that the player’s season-long batting average will be most likely around .27, but that it could reasonably range from .21 to .35. This can be represented with a beta distribution with parameters \(\alpha=81\) and \(\beta=219\). The two parameters were chosen based upon the fact that:

  • The mean of the beta distribution is \(\frac{\alpha}{\alpha+\beta}=\frac{81}{81+219}=.270\)
  • As you can see in the plot below, this distribution lies almost entirely within \((.2, .35)\), the reasonable range for a batting average.
library(ggplot2)
library(data.table)

# set float numbers to print only three digits after the decimal point
options( digits = 3 )


# simulated data,
# generate a sequence of numbers for each combination of a and b
# to plot the probability density function.
# "\u03B1" unicode for the greek letter alpha
sim <- data.table( a = c( 81, 82, 81 + 100 ),
                   b = c( 219, 219, 219 + 200 ) )
sim <- sim[ , .( x = seq( 0, 1, by = 0.002 ) ), by = .( a, b ) ]

sim[ , `:=`( y = dbeta( x, a, b ),
             parameters = paste0( "\u03B1 = ", a, ", \u03B2 = ", b ) ) ]
sim[ , parameters := factor( parameters, levels = unique(parameters) ) ]
sim
##         a   b     x         y       parameters
##    1:  81 219 0.000  0.00e+00  α = 81, β = 219
##    2:  81 219 0.002 2.35e-140  α = 81, β = 219
##    3:  81 219 0.004 1.83e-116  α = 81, β = 219
##    4:  81 219 0.006 1.45e-102  α = 81, β = 219
##    5:  81 219 0.008  9.22e-93  α = 81, β = 219
##   ---                                         
## 1499: 181 419 0.992  0.00e+00 α = 181, β = 419
## 1500: 181 419 0.994  0.00e+00 α = 181, β = 419
## 1501: 181 419 0.996  0.00e+00 α = 181, β = 419
## 1502: 181 419 0.998  0.00e+00 α = 181, β = 419
## 1503: 181 419 1.000  0.00e+00 α = 181, β = 419
# plot of the distribution
PlotBeta <- function(sim)
{
    ggplot( sim, aes( x, y, color = parameters ) ) + geom_line() +
    xlim( 0, .5 ) + ylab("Density of beta") + theme_bw()
}
PlotBeta( sim = sim[ a == 81, ] )

In the preceding plot, the x axis in a beta distribution density plot represents a player’s batting average. Thus notice that in this case, not only is the y-axis a probability (or more precisely a probability density), but the x-axis is as well (batting average is just a probability of a hit, after all)! The beta distribution is representing a probability distribution of probabilities.

But here’s why the beta distribution is so appropriate. Imagine the player gets a single hit. His record for the season is now “1 hit; 1 at bat.” We have to then update our probabilities, we want to shift this entire curve over just a bit to reflect our new information. The new beta distribution will then become:

\[beta(\alpha_0 + hits, \beta_0 + misses)\]

Where \(\alpha_0\) and \(\beta_0\) are the parameters we started with, that is 81 and 219. Thus, in this case, \(\alpha\) has increased by 1 (his one hit), while \(\beta\) has not increased at all (no misses yet). That means our new distribution is \(\beta(81+1, 219)\). Let’s compare that to the original:

# update 1 hit of 1 bat
PlotBeta( sim = sim[ a %in% c( 81, 82 ), ] )

Notice that it has barely changed at all, that’s because one extra hit doesn’t really mean anything yet. However, the more the player hits over the course of the season, the more the curve will shift to accommodate the new evidence, and the furthermore more it will narrow based on the fact that we have more proof. Let’s say halfway through the season he has been up to bat 300 times, hitting 100 out of those times. The new distribution would be \(\beta(81 + 100, 219 + 200)\):

# update 100 hit of 300 bat
PlotBeta( sim = sim )

Notice the curve is now both thinner and shifted to the right (higher batting average) than it used to be, since we now have a better sense of what the player’s batting average is most likely to be.

One of the most interesting outputs of this formula is the expected value of the resulting beta distribution, which is basically your new estimate. Recall that the expected value of the beta distribution is \(\frac{\alpha}{\alpha+\beta}\). Thus, after 100 hits of 300 real at-bats, the expected value of the new beta distribution is \(\frac{82+100}{82+100+219+200}=.303\). Notice that it is lower than the naive estimate of \(\frac{100}{100+200}=.333\), but higher than the estimate you started the season with (\(\frac{81}{81+219}=.270\)). Thus you can think of this formula is equivalent to adding a “head start” to the number of hits and non-hits of a player. You’re saying “start him off in the season with 81 hits and 219 non hits on his record”.

Section Takeaways

  • Beta distribution is only defined for values of x in the interval [0,1] and it is best for representing a probabilistic distribution of probabilities. Cases where we don’t know what a probability is in advance, but we have some reasonable guesses.
  • A quick side note on the parameters for the beta distribution. As \(\alpha\) gets bigger, the bulk of the distribution moves rightward over higher values of x, but as \(\beta\) gets bigger, the bulk of the distribution moves leftward over lower values of x. And as \(\alpha\) and \(\beta\) get bigger together, the beta distribution gets narrower.

Empirical Bayes Estimation

After introducing the beta distribution, we’ll next see how it can be used in context with a very useful statistical method for estimating a large number of proportions, called empirical Bayes estimation.

In the last section, we made some vague guesses about the distribution of batting averages across history. Here we’ll apply empirical Bayes estimation to a real baseball dataset from the Lahman package, with the goal of improving our estimate of each player’s batting average.

Working with Batting Averages

In the dataset, we first filtered out pitchers (generally the weakest batters, who should be analyzed separately) using anti joins. We then summarized each player across multiple years to get their career Hits (H) and At Bats (AB), and batting average (H/AB). Finally, we added the player’s first and last names to the dataset, so we could work with them rather than an identifier.

# load the Batting and Pitching data from the Lahman package
# the Master is used to get further details e.g. corresponding 
# player name for the player id column in the Batting and Pitching data
library(Lahman)
data(Master)
data(Batting)
data(Pitching)

master   <- data.table( Master  , key = "playerID" )
pitching <- data.table( Pitching, key = "playerID" )
batting  <- data.table( Batting , key = "playerID" )

# ! stands for not join,
# return all rows from x, where there're no matching values in y
career <- batting[ AB > 0, ][!pitching]
career <- career[ , .( H = sum(H), AB = sum(AB) ), by = playerID ]
career[ , average := H / AB ]

# map the player name to player id
master <- master[ , .( playerID, nameFirst, nameLast ) ]
career <- master[ , name := paste( nameFirst, nameLast ) ][career]
career[ , `:=`( nameFirst = NULL, nameLast = NULL ) ]
career
##        playerID           name    H    AB average
##    1: aaronha01     Hank Aaron 3771 12364  0.3050
##    2: aaronto01   Tommie Aaron  216   944  0.2288
##    3:  abadan01      Andy Abad    2    21  0.0952
##    4: abadijo01    John Abadie   11    49  0.2245
##    5: abbated01 Ed Abbaticchio  772  3044  0.2536
##   ---                                            
## 9338: zuninmi01    Mike Zunino  124   611  0.2029
## 9339: zupcibo01     Bob Zupcic  199   795  0.2503
## 9340:  zupofr01     Frank Zupo    3    18  0.1667
## 9341: zuvelpa01   Paul Zuvella  109   491  0.2220
## 9342: zwilldu01 Dutch Zwilling  364  1280  0.2844

Let’s list out the best batters in history. Well, here are the ones with the highest batting average:

career[ head( order(-average) ), ]
##     playerID             name H AB average
## 1: banisje01    Jeff Banister 1  1       1
## 2:  bassdo01         Doc Bass 1  1       1
## 3: birasst01      Steve Biras 2  2       1
## 4: burnscb01      C. B. Burns 1  1       1
## 5: gallaja01 Jackie Gallagher 1  1       1
## 6: gleasro01      Roy Gleason 1  1       1

Err, that’s not really what we were looking for. These aren’t the best batters, they’re just the batters who went up once or twice and got lucky. How about the worst batters?

career[ head( order(average) ), ]
##     playerID              name H AB average
## 1: abercda01 Frank Abercrombie 0  4       0
## 2: adamsla01        Lane Adams 0  3       0
## 3: allenho01      Horace Allen 0  7       0
## 4: allenpe01        Pete Allen 0  4       0
## 5: alstowa01     Walter Alston 0  1       0
## 6: altheaa01     Aaron Altherr 0  5       0

Also not what we were looking for. That “average” is a really crummy estimate, they’ve just batted less than 10 times. Let’s make a better one.

Estimate a Prior

Let’s look at the distribution of batting averages across players. For the sake of estimating the prior distribution, we’ll filtered out all players that have fewer than 500 at-bats, since we’ll get a better estimate from the less noisy cases.

career_filtered <- career[ AB > 500, ]
ggplot( career_filtered, aes(average) ) +
geom_histogram( binwidth = .005 ) 

The first step of empirical Bayes estimation is to estimate a beta prior using this data. Estimating priors from the data you’re currently analyzing is not the typical Bayesian approach- usually you decide on your priors ahead of time. There’s a lot of debate and discussion about when and where it’s appropriate to use empirical Bayesian methods, but it basically comes down to how many observations we have: if we have a lot, we can get a good estimate that doesn’t depend much on any one individual. Empirical Bayes is an approximation to more exact Bayesian methods- and with the amount of data we have, it’s a very good approximation.

So we know we want to fit the following model:

\[X \sim \mbox{Beta}(\alpha_0,\beta_0)\]

We just need to pick the \(\alpha_0\) and \(\beta_0\), which we call “hyper-parameters” of our model. There are many methods in R for fitting a probability distribution to data and you don’t even have to use maximum likelihood if you have the mean and variance of the distribution. But here we’ll use the fitdistr function from MASS.

# fit beta distribution, you can specify any
# starting parameter for shape1 and shape2 (alpha and beta),
# the outputted warning does not matter
m <- MASS::fitdistr( career_filtered$average, densfun = dbeta,
                     start = list( shape1 = 70, shape2 = 200 ) )

alpha0 <- m$estimate[1]
beta0  <- m$estimate[2]

This comes up with \(\alpha_0 = 79.173\) and \(\beta_0 = 226.422\). How well does this fit the data?

ggplot(career_filtered) +
geom_histogram( aes( x = average, y = ..density.. ), binwidth = .005 ) +
xlab("Batting average") + 
stat_function( fun = function(x) dbeta( x, alpha0, beta0 ), 
               color = "red", size = 1 )

Not bad!

Using the Prior

Now when we look at any individual to estimate their batting average, we’ll start with our overall prior, and update based on the individual evidence. This is as simple as adding \(\alpha_0\) to the number of hits, and \(\alpha_0 + \beta_0\) to the total number of at-bats.

For example, consider our hypothetical batter from the introduction that went up 1000 times, and got 300 hits. We would estimate his batting average as:

\[\frac{\alpha_0 + 300}{\alpha_0 + \beta_0 + 1000} = \frac{79.2+ 300}{79.2 + 226.4 + 1000}= 0.29\]

How about the batter who went up only 10 times, and got 4 hits. We would estimate his batting average as:

\[\frac{\alpha_0 + 4}{\alpha_0 + \beta_0 + 10} = \frac{79.2 + 4}{79.2 + 226.4 + 10} = 0.264\]

Thus, even though \(\frac{4}{10} > \frac{300}{1000}\), we would guess that the \(\frac{300}{1000}\) batter is better than the \(\frac{4}{10}\) batter!

Performing this calculation for all the batters is simple enough:

career[ , estimate := ( alpha0 + H ) / ( alpha0 + beta0 + AB ) ]

Now we can ask: who are the best and worst batters by this improved estimate?

career[ head( order(-estimate) ), ]
##     playerID                 name    H   AB average estimate
## 1: hornsro01       Rogers Hornsby 2930 8173   0.358    0.355
## 2: jacksjo01 Shoeless Joe Jackson 1772 4981   0.356    0.350
## 3: delahed01         Ed Delahanty 2596 7505   0.346    0.343
## 4: hamilbi01       Billy Hamilton 2158 6268   0.344    0.340
## 5: heilmha01       Harry Heilmann 2660 7787   0.342    0.338
## 6: keelewi01        Willie Keeler 2932 8591   0.341    0.338
career[ head( order(estimate) ), ]
##     playerID            name   H   AB average estimate
## 1: bergebi01     Bill Bergen 516 3028   0.170    0.179
## 2: oylerra01       Ray Oyler 221 1265   0.175    0.191
## 3: vukovjo01   John Vukovich  90  559   0.161    0.196
## 4: humphjo01  John Humphries  52  364   0.143    0.196
## 5: bakerge01    George Baker  74  474   0.156    0.196
## 6: eastehe01 Henry Easterday 203 1129   0.180    0.197

Notice that in each of these cases, empirical Bayes estimates didn’t simply pick the players who had 1 or 2 at-bats. It found players who batted well, or poorly, across a long career. What a load off our minds~ we can start using these empirical Bayes estimates in downstream analyses and algorithms, and not worry that we’re accidentally letting cases like 0/1 or 1/1 ruin everything.

Let’s see how empirical Bayes estimates changed all of the batting average estimates overall:

ggplot( career, aes( average, estimate, color = AB ) ) + geom_point() + 
geom_hline( yintercept = alpha0 / ( alpha0 + beta0 ), color = "red", lty = 2 ) + 
labs( x = "Batting average", y = "Empirical Bayes batting average" ) +
geom_abline( intercept = 0, slope = 1, color = "red" ) + 
scale_color_gradient( trans = "log", breaks = 10^(1:4) )

The horizontal dashed red line marks \(y = \frac{\alpha_0}{\alpha_0 + \beta_0} = 0.259\) - that’s what we would guess someone’s batting average was if we had no evidence at all. Notice that points above that line tend to move down towards it, while points below it move up.

The diagonal red line marks \(x=y\). Points that lie close to it are the ones that didn’t get shrunk at al. Notice that they’re the ones with the highest number of at-bats (the brightest blue): they have enough evidence that we’re willing to believe the naive batting average estimate.

This is why this process is sometimes called shrinkage: we’ve moved all our estimates towards the average. How much it moves these estimates depends on how much evidence we have: if we have very little evidence (4 hits out of 10) we move it a lot, if we have a lot of evidence (300 hits out of 1000) we move it only a little.

Section Takeaways

Recall that there were two major steps in empirical Bayes estimation:

  1. Estimate the overall distribution of your data.
  2. Use that distribution as your prior for updating each average.

Step 1 can be done once, “offline” - analyze all your data and come up with some estimates of your overall distribution. Step 2 is done for each new observation you’re considering, you can even apply thins to estimate the success of a post or an ad.

And because we’re using the beta and the binomial, consider how easy that second step is. All we did was add one number to the successes, and add another number to the total. You can build that into your production system with a single line of code that takes nanoseconds to run.

# We hired a Data Scientist to analyze our Big Data
# and all we got was this lousy line of code.

float estimate = (successes + 78.7) / (total + 303.5);

Understanding Credible Intervals

Recall in the last section, we walked through the method of empirical Bayes estimation, a way to calculate useful proportions out of many pairs of success/total counts (e.g. 0/1, 3/10, 235/1000). we used the example of estimating baseball batting averages based on \(x\) hits in \(n\) opportunities. If we run into a player with 0 hits in 2 chances, or 1 hit in 1 chance, we know we can’t trust it, and this method uses information from the overall distribution to improve our guess.

Empirical Bayes estimation gives us a single value for each player that can be reliably used as an estimate, but sometimes you want to know more than just our “best guess”. Instead you wish to know how much uncertainty is present in our point estimate. We normally would use a binomial proportion confidence interval ( like a margin of error in a political poll, for more info about the calculation, refer to here and here ), but this does not bring in information from our whole dataset. For example, the confidence interval for someone who hits 1 time out of 3 chances is:

binom.test( x = 1, n = 3 )$conf.int
## [1] 0.0084 0.9057
## attr(,"conf.level")
## [1] 0.95

We can indeed be quite confident that that interval contains the true batting average… but from our knowledge of batting averages, we could have drawn a much tighter interval than that! There’s no way that the player’s real batting average is .1 or .9: it probably lies in the .2-.3 region that most other players’ do.

career
##        playerID           name    H    AB average estimate
##    1: aaronha01     Hank Aaron 3771 12364  0.3050    0.304
##    2: aaronto01   Tommie Aaron  216   944  0.2288    0.236
##    3:  abadan01      Andy Abad    2    21  0.0952    0.249
##    4: abadijo01    John Abadie   11    49  0.2245    0.254
##    5: abbated01 Ed Abbaticchio  772  3044  0.2536    0.254
##   ---                                                     
## 9338: zuninmi01    Mike Zunino  124   611  0.2029    0.222
## 9339: zupcibo01     Bob Zupcic  199   795  0.2503    0.253
## 9340:  zupofr01     Frank Zupo    3    18  0.1667    0.254
## 9341: zuvelpa01   Paul Zuvella  109   491  0.2220    0.236
## 9342: zwilldu01 Dutch Zwilling  364  1280  0.2844    0.279

The end result of the last section gave us a point estimate for each proportion (the estimate column) and these new values tend to be pushed towards the overall mean (giving this the name “shrinkage”):

This shrunken value is generally more useful than the raw estimate (the average column): we can use it to sort our data, or feed it into another analysis, without worrying too much about the noise introduced by low counts. But there’s still uncertainty in the empirical Bayes estimate, and the uncertainty is very different for different players. We may want an interval of possible batting averages: one that will be wide for players we know very little about, and narrow for players with more information. Luckily, the Bayesian approach has a method to handle this.

Posterior distribution

Consider that what we’re really doing with empirical Bayes estimation is computing two new values, \(\alpha_1\) and \(\beta_1\), for each player. These are the posterior shape parameters for each distribution, after the prior (estimated on the whole dataset) has been updated.

# alpha0 and beta0 is calculated in the last section
career[ , `:=`( alpha1 = alpha0 + H, beta1 = beta0 + AB - H ) ]
career
##        playerID           name    H    AB average estimate alpha1 beta1
##    1: aaronha01     Hank Aaron 3771 12364  0.3050    0.304 3850.2  8819
##    2: aaronto01   Tommie Aaron  216   944  0.2288    0.236  295.2   954
##    3:  abadan01      Andy Abad    2    21  0.0952    0.249   81.2   245
##    4: abadijo01    John Abadie   11    49  0.2245    0.254   90.2   264
##    5: abbated01 Ed Abbaticchio  772  3044  0.2536    0.254  851.2  2498
##   ---                                                                  
## 9338: zuninmi01    Mike Zunino  124   611  0.2029    0.222  203.2   713
## 9339: zupcibo01     Bob Zupcic  199   795  0.2503    0.253  278.2   822
## 9340:  zupofr01     Frank Zupo    3    18  0.1667    0.254   82.2   241
## 9341: zuvelpa01   Paul Zuvella  109   491  0.2220    0.236  188.2   608
## 9342: zwilldu01 Dutch Zwilling  364  1280  0.2844    0.279  443.2  1142

Using this information, we can visualize this posterior distribution for each player. Here, we’ll just pick a few players from the 1998 Yankee lineup to illustrate the idea.

# filter sample
yankee_1998 <- c( "brosisc01", "jeterde01", "knoblch01", "martiti02", 
                  "posadjo01", "strawda01", "willibe02" )
career_yankee_1998 <- career[ playerID %in% yankee_1998, ]

# create the x axis for the beta distribution's probability density function
expand <- career_yankee_1998[ , .( x = seq( .18, .33, .0005 ) ), by = playerID ]
yankee_beta <- career_yankee_1998[expand]
yankee_beta[ , density := dbeta( x, alpha1, beta1 ) ]

# visualize posterior beta
ggplot( yankee_beta, aes( x, density, color = name ) ) + geom_line() +
stat_function( fun = function(x) dbeta( x, alpha0, beta0 ),
               lty = 2, color = "black" )

The prior is shown as a dashed curve. All the other curves is our probability distribution of what the player’s batting average could be, after updating based on that player’s performance. That’s what we’re really estimating with this method: those posterior beta distributions.

Credible Intervals

These density curves are nice, but can be hard to interpret visually, especially as the number of players increases, and it can’t be summarized into a table or text. We’d instead prefer to create a credible interval, which says that some percentage (e.g. 95%) of the posterior distribution lies within an particular region. Here’s Derek Jeter’s credible interval:

# visualize the credible interval for one player
jeter <- yankee_beta[ name == "Derek Jeter", ]

# calculate the cumulative probability and
# extract the ones that lies between the 95 percent credible interval
p <- 0.95
ci_low  <- ( 1 - p ) / 2
ci_high <- p + ci_low

jeter_pred <- jeter[ , cumulative := pbeta( x, alpha1, beta1 ) 
                  ][ cumulative > ci_low & cumulative < ci_high ]

# obtain the x coordinate of the 95 percent credible interval's 
# endpoint to visualize the error bar
jeter_low  <- qbeta( ci_low , jeter$alpha1[1], jeter$beta1[1] )
jeter_high <- qbeta( ci_high, jeter$alpha1[1], jeter$beta1[1] )

# credible interval plot
ggplot( jeter, aes( x, density ) ) + geom_line( color = "blue" ) +
xlim( .18, .34 ) + 
geom_ribbon( data = jeter_pred, aes( ymin = 0, ymax = density ),
             alpha = .25, fill = "red" ) +
stat_function( fun = function(x) dbeta( x, alpha0, beta0 ),
               lty = 2, color = "black" ) +
geom_errorbarh( aes( xmin = jeter_low, xmax = jeter_high, y = 0 ), 
                height = 3.5, color = "red" )

We can compute the edges of the interval quite easily for the rest of the players.

# all the player's credible interval
career_yankee_1998[ , `:=`( low  = qbeta( ci_low , alpha1, beta1 ),
                            high = qbeta( ci_high, alpha1, beta1 ) ) ]

col_names <- colnames(career_yankee_1998)
show <- col_names[ !col_names %in% c( "alpha1", "beta1", "estimate" ) ]

# use kable from knitr for a more nicely formatted printed table
knitr::kable( career_yankee_1998[ , show, with = FALSE ] )
playerID name H AB average low high
brosisc01 Scott Brosius 1001 3889 0.257 0.244 0.271
jeterde01 Derek Jeter 3465 11195 0.310 0.300 0.317
knoblch01 Chuck Knoblauch 1839 6366 0.289 0.277 0.298
martiti02 Tino Martinez 1925 7111 0.271 0.260 0.280
posadjo01 Jorge Posada 1664 6092 0.273 0.262 0.283
strawda01 Darryl Strawberry 1401 5418 0.259 0.247 0.270
willibe02 Bernie Williams 2336 7869 0.297 0.286 0.305

And we can also view the intervals in a plot like this:

ggplot( career_yankee_1998, aes( average, reorder( name, average ) ) ) +
geom_point() +
geom_errorbarh( aes( xmin = low, xmax = high ), height = 0.8 ) +
geom_vline( xintercept = alpha0 / (alpha0 + beta0 ), color = "red", lty = 2 ) +
xlab("Estimated batting average (95% interval)") +
ylab("Player")

The vertical dashed red line is \(\frac{\alpha_0}{\alpha_0 + \beta_0}\): the mean batting average across history (based on our beta fit). The earlier plot showing each posterior beta distribution communicated more information, but this is far more readable (depending on your purpose).

Credible Intervals versus Confidence Intervals

Extract with the wikipedia’s definition on credible intervals. Credible intervals are analogous to confidence intervals in frequentist statistics, although they differ on a philosophical basis; Bayesian intervals treat their bounds as fixed and the estimated parameter as a random variable, whereas frequentist confidence intervals treat their bounds as random variables and the parameter as a fixed value. For a quick review on confidence interval, refer to the beginning of this Stitchfix blog post.

But there’s also a very practical difference, in that credible intervals take prior information into account. If I take 20 random players and construct both confidence intervals (specifically a binomial proportion confidence interval) and posterior credible intervals for each, it could look something like this:

library(broom) # for tidying binom.test's output
library(ggthemes)

career[ , `:=`( low  = qbeta( ci_low , alpha1, beta1 ),
                high = qbeta( ci_high, alpha1, beta1 ) ) ]

# draw random 20 players
set.seed(2015)
some <- career[ sample.int( nrow(career), 20 ), ]
some[ , name := paste0( name, " (", H, "/", AB, ")" ) ]

# credible interval
bayesian <- some[ , .( name, AB, estimate, low, high ) ]
bayesian[ , method := "Credible" ]

# confidence interval
frequentist <- some[ , broom::tidy( binom.test( H, AB ) ), 
                       by = .( playerID, name, AB ) ]
frequentist <- frequentist[ , .( name, AB, estimate, 
                                 low = conf.low, high = conf.high ) ]
frequentist[ , method := "Confidence"]

combined <- rbind( bayesian, frequentist )
ggplot( combined, aes( estimate, reorder( name, -AB ), color = method ) ) +
geom_point() +
geom_errorbarh( aes( xmin = low, xmax = high ) ) +
geom_vline( xintercept = alpha0 / (alpha0 + beta0), color = "red", lty = 2 ) +
labs( x = "Estimated batting average", y = "Player", color = "" ) +
scale_color_tableau()

These are sorted in order of how many times a player went up to bat (thus, how much information we have about them). Notice that once there’s enough information, the credible intervals and confidence intervals are nearly identical. But for the 0/3 and 0/6 cases, the credible interval is much narrower. This is because empirical Bayes estimate brings in our knowledge from the full data, just as it did for the point estimate.

Bayesian False Discovery Rates

We now have a credible interval for each player, including the lower and upper bounds for their batting average. But sometimes, rather than estimating a value or an interval, we’re looking to answer a yes or no question about each hypothesis, and thus classify them into two groups. For example, suppose we were constructing a Hall of Fame, where we wanted to include all players that have a batting average (chance of getting a hit) greater than .300 (note that this is just for illustration purpose, in real world, there are a lot of other, better metrics to judge a player by). We want to include as many players as we can, but we need to be sure that each belongs.

The problem of hypothesis testing appears whenever we’re trying to identify candidates for future study. In this section we’ll use the posterior distributions we’ve created to apply a Bayesian approach to a method usually associated with frequentist statistics, namely false discovery rate control.

Posterior Error Probabilities

Consider the legendary player Hank Aaron.

( hank_aaron <- career[ name == "Hank Aaron", ] )
##     playerID       name    H    AB average estimate alpha1 beta1   low
## 1: aaronha01 Hank Aaron 3771 12364   0.305    0.304   3850  8819 0.296
##     high
## 1: 0.312

We can see from the table that his career batting average is 0.3050, but if we were to use this number to determine whether he should be permitted in our >.300 Hall of Fame, we’re basing our decisions on his “true probability” of hitting.

On the other hand, when Aaron’s batting average is shrunken by the empirical Bayes estimate, we get an estimate of 0.3039. We thus suspect that his true probability of hitting is higher than .300, but we’re not necessarily certain because of the credible interval. Let’s take a look at his posterior beta distribution:

# hall of fame threshold
threshold <- .3

# posterior beta
# merge the original data.table with the generated sequence ones,
hank_aaron <- hank_aaron[ hank_aaron[ , .( x = seq( .27, .33, .0004 ) ), by = playerID ] ]
hank_aaron[ , density := dbeta( x, alpha1, beta1 ) ]

ggplot( hank_aaron, aes( x, density ) ) + geom_line() +
geom_ribbon( aes( ymin = 0, ymax = density * ( x < threshold ) ),
                  alpha = .1, fill = "red" ) +
geom_vline( color = "red", lty = 2, xintercept = threshold )

We can see that there is a nonzero probability (shaded area) that his true probability of hitting is less than .3. We can calulate this with the cumulative distribution function (CDF) of the beta distribution, which in R is computed by the pbeta function:

# posterior error probability (PEP)
pbeta( threshold, hank_aaron$alpha1[1], hank_aaron$beta1[1] )
## [1] 0.171

This probability that he doesn’t belong in the Hall of Fame is called the Posterior Error Probability, or PEP. We could easily have calculated the probability Aaron does belong, which we would call the Posterior Inclusion Probability, or PIP. (Note that \(\mbox{PIP} = 1-\mbox{PEP}\)) The reason we chose to measure the PEP rather than the PIP will become clear in the next section.

It’s equally straightforward to calculate the PEP for every player, just like we calculated the credible intervals for each player in the last section.

# PEP for every player
career[ , PEP := pbeta( threshold, alpha1, beta1 ) ]

What does the distribution of the PEP look like across players?

# histogram of PEP
ggplot( career, aes(PEP) ) + geom_histogram( binwidth = .02 ) +
xlab("Posterior Error Probability (PEP)") + xlim( 0, 1 )

Unsurprisingly, for most players, it’s almost certain that they don’t belong in the hall of fame: we know that their batting averages are way below .300. If they were included, it is almost certain that they would be an error. In the middle are the borderline players: the ones where we’re not sure. And down there close to 0 are the rare but proud players who we’re (almost) certain belong in the hall of fame.

The PEP is closely related to the estimated batting average:

# PEP and estimate batting average
ggplot( career, aes( estimate, PEP, color = AB ) ) + geom_point( size = 1 ) +
xlab("(Shrunken) batting average estimate") +
ylab("Posterior Error Probability (PEP)") +
geom_vline( color = "red", lty = 2, xintercept = threshold ) +
scale_colour_gradient( trans = "log", breaks = 10^(1:5) )

Notice that crossover point: to have a PEP less than 50%, you need to have a shrunken batting average greater than .3. That’s because the shrunken estimate is the center of our posterior beta distribution (the “over/under” point). If a player’s shrunken estimate is above .3, it’s more likely than not that their true average is as well. And the players we’re not sure about (PEP \(\approx\) .5) have batting averages very close to .300.

Notice also the relationship between the number of at-bats (the amount of evidence) and the PEP. If a player’s shrunken batting average is .28, but he hasn’t batted (low AB) many times, it is still possible his true batting average is above .3, because his credible interval is wide. However, if the player with .28 has a high AB (light blue), the credible interval becomes thinner, we become confident that his batting average’s true probability is under .3, and the PEP goes up to 1.

False Discovery Rate

Now we want to set some threshold for inclusion in our Hall of Fame. This criterion is up to us: what kind of goal do we want to set? There are many options and here’s a propose one: let’s try to include as many players as possible, while ensuring that no more than 5% of the Hall of Fame was mistakenly included. Put another way, we want to ensure that if you’re in the Hall of Fame, the probability you belong there is at least 95%.

This criterion is called false discovery rate (FDR) control. It’s particularly relevant in scientific studies, where we might want to come up with a set of candidates (e.g. genes, countries, individuals) for future study. There’s nothing special about 5%: if we wanted to be more strict, we could choose the same policy, but change our desired FDR to 1% or .1%. Similarly, if we wanted a broader set of candidates to study, we could set an FDR of 10% or 20%.

Let’s start with the easy cases. Who are the players with the lowest posterior error probability?

# ranking PEP
career <- career[ order(PEP), ]
by_PEP <- career[ , rank := 1:nrow(career) 
               ][ , .( rank, name, H, AB, estimate, PEP ) ]

head( by_PEP, 10 )
##     rank                 name    H   AB estimate      PEP
##  1:    1       Rogers Hornsby 2930 8173    0.355 9.03e-28
##  2:    2         Ed Delahanty 2596 7505    0.343 2.91e-16
##  3:    3 Shoeless Joe Jackson 1772 4981    0.350 2.23e-15
##  4:    4        Willie Keeler 2932 8591    0.338 2.54e-15
##  5:    5           Nap Lajoie 3242 9589    0.336 9.63e-15
##  6:    6           Tony Gwynn 3141 9288    0.336 2.37e-14
##  7:    7       Harry Heilmann 2660 7787    0.338 4.15e-14
##  8:    8           Lou Gehrig 2721 8001    0.337 1.49e-13
##  9:    9       Billy Hamilton 2158 6268    0.340 9.10e-13
## 10:   10        Eddie Collins 3315 9949    0.331 5.91e-12

These players are a no-brainer for our Hall of Fame: there’s basically no risk in including them. But suppose we instead tried to include the top 100. What do the 90th-100th players look like?

by_PEP[ 90:100, ]
##     rank           name    H    AB estimate   PEP
##  1:   90   Rip Radcliff 1267  4074    0.307 0.145
##  2:   91    Mike Piazza 2127  6911    0.306 0.146
##  3:   92    Denny Lyons 1333  4294    0.307 0.151
##  4:   93  Don Mattingly 2153  7003    0.305 0.157
##  5:   94   Taffy Wright 1115  3583    0.307 0.169
##  6:   95     Hank Aaron 3771 12364    0.304 0.171
##  7:   96     John Stone 1391  4494    0.306 0.172
##  8:   97      Ed Morgan  879  2810    0.308 0.181
##  9:   98  Matt Holliday 1837  5972    0.305 0.184
## 10:   99   George Burns 2018  6573    0.305 0.190
## 11:  100 Home Run Baker 1838  5984    0.305 0.204

OK, so these players are borderline. We would guess that their career batting average is greater than .300, but we aren’t as certain.

So let’s say we chose to take the top 100 players for our Hall of Fame (thus, cut it off at Home Run Baker). What would we predict the false discovery rate to be? That is, what fraction of these 100 players would be falsely included? Well, we know the PEP of each of these 100 players, which is the probability that that individual player is a false positive, we can simply add up these probabilities to get the expected value (the average) of the total number of false positives.

# top 100 players false positive rate
top_players <- career[ 1:100, ]
sum(top_players$PEP)
## [1] 4.7

This means that of these 100 players, we expect that about 4.697 of them are false discoveries. Now, we don’t know which four or five players we are mistaken about! (If we did, we could just kick them out of the hall). But we can make predictions about the players in aggregate. Here, we can see that taking the top 100 players would get pretty close to our goal of FDR = 5%.

Note that we’re calculating the FDR as 4.697 / 100 (number of players). Thus, we’re really computing the mean PEP: the average Posterior Error Probability.

mean(top_players$PEP)
## [1] 0.047

We can experiment with many thresholds to get our desired false discovery rate, but it’s even easier just to compute them all at once, by computing the cumulative mean of all the (sorted) posterior error probabilities.

Q-values

Notice that the name of the false discovery rate’s cumulative mean is qvalue. The q-value is convenient because we can say “to control the FDR at X%, collect only hypotheses where \(q < X\)”.

# PEP's cumulative mean for meeting the false discovery rate
career[ , qvalue := cumsum(PEP) / 1:nrow(career) ]

# number of hall of famers for 5% false discovery rate
hall_of_fame <- career[ qvalue < .05, ]

This ends up with 101 players in the Hall of Fame. If we wanted to be more careful about letting players in, we’d simply set a stricter (smaller) q-value threshold.

It’s also useful to look at how many players would be included at various thresholds:

# numbers of players included at various q-value cutoff
ggplot( career[ qvalue < .25, ], aes( qvalue, rank ) ) + geom_line() +
labs( x = "q-value cutoff", y = "Number of players included" )

This shows that you could include 200 players in the Hall of Fame, but at that point you’d expect that about 25% of them would be incorrectly included. On the other side, you could create a hall of fame with just 50 players and be very confident that all of them have a batting probability of .300.

It’s worth emphasizing the difference between measuring an individual’s posterior error probability and the q-value, which is the false discovery rate of a group including that player. Hank Aaron has a PEP of 0.171, but he can be included in the Hall of Fame while keeping the FDR below 5%. If this is surprising, imagine that you were instead trying to keep the average height above 6’0“. You would start by including all players taller than 6’0”, but could also include some players who were 5’10" or 5’11" while preserving your average. Similarly, we simply need to keep the average PEP of the players below 5%.

Takeaways

During this modeling process, we’ve made an enormous simplification by assuming that all batting averages are drawn from a single distribution. In reality, we’d expect that it depends on some known factors. For instance, the distribution of batting averages has changed over time. Ideally, we’d want to estimate a different Beta prior for each decade. Similarly, we could estimate separate priors for each team, a separate prior for pitchers, and so on. One useful approach to this is the Bayesian hierarchical modeling, which is another story.

R Session Information

sessionInfo()
## R version 3.2.4 (2016-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggthemes_3.0.3   broom_0.4.0      Lahman_4.0-1     data.table_1.9.6
## [5] ggplot2_2.1.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.4      formatR_1.3      questionr_0.5    highr_0.5.1     
##  [5] plyr_1.8.3       rmdformats_0.2   tools_3.2.4      digest_0.6.9    
##  [9] lattice_0.20-33  nlme_3.1-125     evaluate_0.8.3   gtable_0.2.0    
## [13] psych_1.5.8      shiny_0.13.2     DBI_0.3.1        rstudioapi_0.5  
## [17] yaml_2.1.13      parallel_3.2.4   stringr_1.0.0    dplyr_0.4.3     
## [21] knitr_1.12.3     grid_3.2.4       R6_2.1.2         rmarkdown_0.9.6 
## [25] tidyr_0.4.1      reshape2_1.4.1   magrittr_1.5     scales_0.4.0    
## [29] htmltools_0.3.5  MASS_7.3-45      mnormt_1.5-4     assertthat_0.1  
## [33] mime_0.4         xtable_1.8-2     colorspace_1.2-6 httpuv_1.3.3    
## [37] labeling_0.3     stringi_1.0-1    miniUI_0.1.1     munsell_0.4.3   
## [41] chron_2.3-47

Ming-Yu Liu

2016-07-01