A PRIMER OF MAXIMUM LIKELIHOOD
PROGRAMMING IN R
Marco R. Steenbergen∗
2012
Abstract
R is an excellent platform for maximum likelihood programming. These
notes describe the maxLik package, a “wrapper” that gives access to the
most important hill-climbing algorithms and provides a convenient way of
displaying results. The notes also include a discussion of the likelihood
ratio and Wald tests in R.
Correspondence:
Chair of Political Methodology, Department of Political Science (IPZ), University of Zurich, 50 Affolternstrasse, CH-8050, Switzerland.
Email:
[email protected].
∗
1
Introduction
The programming language R is rapidly gaining ground among political methodologists. A major reason is that R is a flexible and versatile language, which
makes it easy to program new routines. In addition, R algorithms are generally
very precise.
R is well-suited for programming your own maximum likelihood routines.
Indeed, there are several procedures for optimizing likelihood functions. Here I
shall focus on the maxLik package, which implements a variety of algorithms
and offers a convenient way of displaying results.1 Among the implemented
hill-climbing algorithms are Newton-Raphson, BFGS, and BHHH. Our focus in
these notes is on using numerical derivatives.
Syntactic Structure
Estimating likelihood functions entails a two-step process. First, one declares the
log-likelihood function and next one optimizes the log-likelihood function. The
log-likelihood function and optimization command may be typed interactively
into the R command window or they may be contained in a text file. I would
recommend saving log-likelihood functions into a text file, especially if you plan
on using them frequently.
Declaring the Log-Likelihood Function
The log-likelihood function is declared as an R function. In R, functions take
at least one argument, to wit a parameter or vector of parameters. It is often
useful, however, to also declare a data object. This serves as a placeholder for
the actual data that will be analyzed.
The generic syntax takes the following form:
name<-function(pars,object) {
declarations
1
Full documentation can be downloaded from http://cran.r-project.org/. The package was written by Ott Toomet and Arne Henningsen. Other options for optimizing likelihood
functions include nlm, optim, and optimx.
2
}
Here pars is the name of a parameter or vector of parameters, while object
is the name of a generic data object.2 The declarations section includes the
log-likelihood function. Depending on the needs, it may also specify the contents
of the parameter vector. The precise specification of the log-likelihood function
depends on whether one uses the BHHH algorithm or BFGS/Newton-Raphson.
Single Parameter When the log-likelihood function includes only one parameter and numerical derivatives are used in the estimation, the declaration
typically requires only one line. What this line looks like depends on the specific
hill-climbing algorithm that will be used. For the BHHH algorithm, the loglikelihood is specified in terms of a single observation, as the following example
illustrates.3
Example 1(a) Consider the Poisson probability mass function with
unit log-likelihood values of
ℓi = −µ + yi ln µ − ln(yi !)
This may be programmed in the following way:
lnl<-function(mu,y) {
-1*mu + y*log(mu) - logfactorial(y)
}
Here y is the generic placeholder for the data and logfactorial(y)
is the R command for ln(y!). Since the declaration of the function consists of only one line, the brackets could have been omitted.
For the BFGS and Newton-Raphson algorithms, one needs to specify the
P
full log-likelihood function, i.e., ℓ = i ℓi . The following example illustrates
this for the Poisson distribution.
2
Depending on the situation, more than one object may have to be specified (see Example
3
This is akin to specifying the log-likelihood in Stata under the linear form restriction.
3).
3
Example 1(b) For the Poisson distribution, the full log-likelihood
is given by
ℓ = −nµ +
X
yi
i
!
ln µ −
X
ln(yi )
i
Assuming the data are stored in the object data, this may be programmed in the following manner:4
n<-length(data)
lnl<-function(mu,y) {
-n*mu + sum(y)*log(mu) -sum(logfactorial(y))
}
Here, sum is the sum operator.
Multiple Parameters
When the log-likelihood function contains multiple pa-
rameters, then the R commands change only in the sense that pars is now a
vector whose contents need to be specified. This modification is generally quite
easy to make, as the following example for the normal distribution shows.
Example 2 Imagine Y ∼ N (µ, σ 2 ). We seek to estimate the parameters µ and σ using the Newton-Raphson algorithm. The kernel
of the log-likelihood function is:5
ℓ = −n ln(σ) − .5
X (yi − µ)2
i
σ2
We can program this log-likelihood in R using the following syntax:
lnl<-function(pars,y) {
mu<-pars[1]
sigma<-pars[2]
4
It is assumed there are no missing values. In general, in R it is best to remove missing
values (or, much better, impute them) prior to performing an analysis.
5
For the BHHH algorithm one would simply re-specify this for a single observation.
4
-n*log(sigma)-.5*sum((y-mu)∧ 2/sigma∧ 2)
}
We have now declared pars to be a vector with two elements, the
first of which is µ and the second of which is σ. Because the
declaration now contains multiple lines, the brackets are mandatory.
Linear Regression Model
In the examples so far, no covariates were included.
In most applied work, however, covariates are of central importance. Including
them is not difficult but does require a bit of additional work. The following
example illustrates the programming steps for the linear regression model with
normal errors.
Example 3 Consider the linear regression model y = Xβ + ǫ or
ǫ = y − Xβ, with ǫ ∼ N (0, σ 2 I). The kernel of the log-likelihood
function is given by
ℓ = −.5n ln(σ 2 ) − .5
ǫT ǫ
σ2
The following syntax allows us to estimate the parameter vector β
and the scalar σ 2 :
n<-nrow(X)
k<-ncol(X)
reg.lik<-function(theta,y,X) {
beta<-theta[1:k]
sigma2<-theta[k+1]
e<-y-X%*%beta
-.5*n*log(sigma2)-.5*(t(e)%*%e)/sigma2
}
The first line computes the number of observations (n), whereas the
second line calculates the number of predictors (k), including the
constant. The third line starts the declaration of the log-likelihood
5
function. The parameter vector is theta and there are two data
objects, to wit the vector of outcomes (y) and the matrix of covariates (X). The fourth and fifth lines partition theta into a k × 1
vector β and the scalar σ 2 . The sixth line specifies the vector ǫ,
which is then fed into the log-likelihood function in line seven.6
Optimization
Once the log-likelihood function has been declared, the actual maxLik routine
is invoked to perform optimization. The generic syntax is
name<-maxLik(loglikelihood, method, starting values, object)
This causes R to maximize the function specified in loglikelihood with starting values specified in starting values and an algorithm specified in method.
The estimation results are stored in name. The whole process is applied to the
data in object.
Method The maxLik command contains a relatively large number of optimization algorithms. For our purposes, the most useful of these are BFGS,
BHHH, and Newton-Raphson. These algorithms can be specified as follows:
method=’’BFGS’’/’’BHHH’’/’’NR’’)
One should select only one of the aliases, BFGS for BFGS, BHHH for BHHH, or
NR for Newton-Raphson. Since maxLik is a so-called “wrapper,” it will now
select the appropriate R routine to implement the desired algorithm.7
Starting Values
When a single parameter is estimated, the declaration of
starting values is extremely simple:
start=#
6
Note that %*% is the matrix product operator in R. Also note that t(e) transposes the
vector ǫ.
7
In general, “wrappers” in R are programs that provide a single syntax to access a variety
of built-in routines, each of which may have a unique syntax. The “wrapper” takes arguments
specified by the user and translates them for the desired routine. After passing the arguments
to the routine, the results are then displayed in a uniform format.
6
Here # is a particular value for the parameter, chosen in such a way that the loglikelihood function is defined for that value. The following example illustrates
this for the Poisson distribution.
Example 4 Consider Example 1(a). To optimize the log-likelihood
using the BHHH algorithm and a starting value µ0 = 1, one provides
the following syntax:
results<-maxLik(lnl,start=1,method=’’BHHH’’,y=data)
Here data is assumed to be the name of the data object. The
syntax y=data substitutes this for the placeholder y. The results
from the optimization are stored in results.
When the log-likelihood contains several parameters, these will need to be
concatenated:
start=c(#1,#2, · · · )
Here #1, #2, etc. are successive starting values, which should be specified in
the same order as the parameters, and c is the concatenation operator. The
following example shows this process for the normal distribution.
Example 5 In Example 2, we programmed the normal log-likelihood
function. Imagine we want to optimize this using the NewtonRaphson algorithm with starting values taken from the standard
normal distribution. We could do this by issuing the following syntax:
results<-maxLik(lnl,method=’’NR’’,
start=c(0,1), y=data)
Remember that the vector pars had µ as its first element and σ as
its second element. The command start=c(0,1) thus assigns 0
to µ and 1 to σ as starting values. Those are the parameter values
from the standard normal distribution.
It is possible to use previous estimation results as starting values, as the
following example shows.
7
Example 6 Consider again Example 3. Imagine that X consists
of a constant and the variable X1. One can now run the following
regression and save the regression coefficients:
ols<-lm(y ∼ X1)
beta.start<-coefficients(ols)
Imagine we wish to use beta.start as the starting values for β and
assign the starting value 1 to σ 2 . The following syntax accomplishes
this task.
estimates<-maxLik(reg.lik,start=c(beta.start,1),
method=’’BFGS’’,y=ydata,X=xdata)
Here the BFGS algorithm was chosen for optimization. Note that
the maxLik command now specifies two data objects, which happen
to be named ydata (for y) and xdata (for X).
Output
The maxLik produces a variety of output. Specifically, one can display the
estimates, the value of the log-likelihood, the number of estimated parameters, the variance-covariance matrix of the estimates, and messages concerning
convergence. To display the estimates, all one needs to specify is
summary(name)
where name is the name of the object specified in the maxLik command. This
shows the optimization routine used, whether the routine exited normally, the
value of the log-likelihood at the maximum, and the parameter estimates, their
standard errors, t-values, and p-values. Figure 1 shows a sample output for the
normal distribution.
The value of the log-likelihood function at the maximum can also be obtained using
logLik(name)
8
Figure 1: R MLE Commands and Output
This can be useful if one needs the log-likelihood for additional computations
such as the computation of a likelihood ratio test statistic. In this case, one
should assign a name to the log-likelihood (e.g., loglik<-logLik(name)). In
conjunction with the log-likelihood, it may be useful to obtain the number of
estimated parameters, which can be done using8
nParam(name,free=TRUE)
The variance-covariance matrix of the estimates may also be of considerable
interest and can be obtained using
vcov(name)
This can be used to compute a Wald test statistic, as will be shown below.
To see if the optimizer performed properly, one can issue the following commands:
8
The default is to set free=FALSE, but this gives the total number of parameters including
those that have been constrained.
9
returnCode(name)
returnMessage(name)
The first command yields a numeric value. When this is an integer, the estimation was successful. Specifically, the integer 0 indicates success for the
BFGS algorithm, 1 indicates success for the Newton-Raphson algorithm, and
2 indicates success for the BHHH algorithm. The second command yields a
verbal statement. For Newton-Raphson, one would like to see: “gradient close
to zero.” For BFGS, the desired message is “successful convergence.” Finally,
a successful completion of the BHHH algorithm is marked by the message “successive function values within tolerance limit.”
Constrained Optimization
It is easy to impose linear constraints in R. All one has to do is to add a fixed
option to the maxLik command. This allows one to specify which parameters
should be estimates and which ones should be fixed. Fixed parameters will be
set equal to their starting value. The following example shows how this is done.
Example 7 Consider again the normal distribution from Example
2. Imagine we only wish to estimate the mean and set the standard
deviation to 1. This can be done as follows.
r<-maxLik(lnl,method=’’NR’’,y=data,
start=c(0,1), fixed=c(FALSE,TRUE)
The “action” is in the second line. Here we state that the first
parameter, µ, should not be fixed (fixed=FALSE), whereas the second parameter, σ, should be fixed (fixed=TRUE). Since the starting
value for that parameter is 1, the effect of the estimation command
is to set σ = 1.
10
Hypothesis Testing
The following syntax shows how to perform likelihood ratio and Wald tests
based on the results from maxLik. The code is a bit ugly but works.9
Likelihood Ratio Test
The likelihood ratio test assumes that two models are estimated, to wit an
unconstrained model and one that imposes certain constraints. Letting ℓU be
the log-likelihood of the unconstrained model and ℓR the log-likelihood of the
constrained model, then
LR = 2(ℓU − ℓR ) ∼ χ2q
asy
where q is the number of constraints.
To perform the likelihood ratio test in R, one needs to store ℓU , ℓR , and the
number of estimated parameters in the constrained and unconstrained models.
One should then compute LR, q, and the p-value. Imagine that the objects
lnlu and lnlr are the log-likelihoods for the unconstrained and constrained
models, respectively. Imagine further that the number of free parameters in the
unconstrained model is stored in pu and that of the constrained model in pr.
Then the following syntax will perform the required computations and display
the results.
LR<-2*(lnlu-lnlr)
q<-pu-pr
p<-pchisq(LR,q,ncp=0,lower.tail=FALSE)
results<-cbind(LR,df,p)
colnames(results)<-c(‘‘LR’’,’’df’’,’’p’’)
rownames(results)<-c(‘‘Test’’)
print(results,digits=#)
where # is the desired precision with which the test statistic and p-value are
reported. The following example illustrates the use of this syntax.
9
For models estimated using lm or glm, I recommend the lmtest package, which is more
elegant.
11
Example 8 Consider again the normal distribution. We wish to
test the hypothesis that σ = 1 using a likelihood ratio test. We
begin by estimating the unconstrained model (in this case using the
Newton-Raphson algorithm).
res1<-maxLik(lnl,start=c(0,1),method=’’NR’’,y=data)
lnlu<-logLik(res1)
pu<-nParam(res1,free=TRUE)
Next, we estimate the constrained model (as shown in Example 7).
res2<-maxLik(lnl,start=c(0,1),method=’’NR’’,y=data,
fixed=c(FALSE,TRUE))
lnlr<-logLik(res2)
pr<-nParam(res2,free=TRUE)
Now proceed with the syntax to compute the LR test statistic,
degrees of freedom, and p-value.
Wald Test
For the Wald test, hypotheses are formulated in the form of the linear function
Rθ = r
Here, the matrix R selects and combines parameters from the vector θ. The
resulting linear combinations are then set equal to the values specified in the
vector r. The following example illustrates how this is done.
Example 9 Consider again the normal distribution and the hypothesis σ = 1. Placing the parameters in the vector θ T = (µ σ), the
hypothesis can be formulated using the following linear constraint:
h
0 1
i
" Rθ
# = r
µ
= 1
σ
12
Here, the matrix R selects the element σ from θ and sets the result
equal to 1.
To perform a Wald test only the unconstrained model needs to be estimated.
The estimates are then compared to the values under constraint, using Rθ̂ − r.
If the null hypothesis is true, then this quantity should be close to zero, i.e.,
the estimates should be close to the hypothesized values. The complete test
statistic is given by
W
=
∼
asy
Rθ̂ − r
χ2q
T
RV RT
−1
Rθ̂ − r
Here V is the variance-covariance matrix of the estimates. which in R can
be obtained using vcov(name). Further q is the degrees of freedom, i.e., the
number of constraints, which is equal to the number of rows in R. The following
example shows how the test can be computed.10
Example 10 Consider again the normal distribution with the null
hypothesis σ = 1. Imagine the results are stored in the object res.
We begin by storing the estimates in a vector and by obtaining the
variance-covariance matrix.
theta<-coef(res)
V<-vcov(res)
Next, the matrices for the constraints are defined.
Q<-matrix(c(0,1),nrow=1,ncol=2)
q<-1
Now we are ready to define the test statistic, degrees of freedom,
and p-value and to report these in the desired format.11
10
Here the symbol R has been replaced by Q, as t(R) is sometimes interpreted as the
trademark symbol for R rather than as a transpose.
11
Note that solve is the R command for obtaining the inverse of a matrix.
13
W<-t(Q%*%theta-q) %*% solve(Q%*%V%*%t(Q)) %*%
(Q%*%theta-q)
df<-nrow(Q)
p<-pchisq(W,df,ncp=0,lower.tail=FALSE)
results<-cbind(W,df,p)
colnames(results)<-c(‘‘Wald’’,’’df’’,’’p’’)
textttrownames(results)¡-c(“Test”)
print(results,digits=#)
14