\documentclass[11pt]{article} \usepackage[margin=1in]{geometry} \usepackage{mathpazo} \usepackage{fancyvrb} \usepackage{natbib} \usepackage{hyperref} \let\dq=" \DefineShortVerb{\"} \def\pkg#1{\textsf{#1}} \def\lsm{\pkg{lsmeans}} \def\code{\texttt} \def\proglang{\textsf} % double-quoted text \def\dqt#1{\code{\dq{}#1\dq{}}} % The objects I want to talk about \def\rg{\dqt{ref.grid}} \def\lsmo{\dqt{lsmobj}} \def\R{\proglang{R}} \def\SAS{\proglang{SAS}} \def\Fig#1{Figure~\ref{#1}} \def\bottomfraction{.5} \title{Extending \lsm{}} \author{Russell V.~Lenth} %\VignetteIndexEntry{Extending lsmeans} %\VignetteDepends{lsmeans} %\VignetteKeywords{least-squares means} %\VignettePackage{lsmeans} % Initialization <>= options(show.signif.stars=FALSE, prompt="R> ", continue=" ") set.seed(271828) @ \begin{document} \SweaveOpts{concordance=TRUE} \maketitle{} \section{Introduction} Suppose you want to use \lsm{} for some type of model that it doesn't (yet) support. Or, suppose you have developed a new package with a fancy model-fitting function, and you'd like it to work with \lsm{}. What can you do? Well, there is hope because \lsm{} is designed to be extended. The first thing to do is to look at the help page for extending the package: <>= help("extending-lsmeans", package="lsmeans") @ It gives details about the fact that you need to write two S3 methods, "recover.data" and "lsm.basis", for the class of object that your model-fitting function returns. The "recover.data" method is needed to recreate the dataset so that the reference grid can be identified. The "lsm.basis" method then determines the linear functions needed to evaluate each point in the reference grid and to obtain associated information---such as the variance-covariance matrix---needed to do estimation and testing. This vignette presents an example where suitable methods are developed, and discusses a few issues that arise. \section{Data example} The \pkg{MASS} package contains various functions that do robust or outlier-resistant model fitting. We will cobble together some \lsm{} support for these. But first, let's create a suitable dataset (a simulated two-factor experiment) for testing.\footnote{I unapologetically use \code{=} as the assignment operator. It is good enough for C and Java, and supported by R.} <<>>= fake = expand.grid(rep = 1:5, A = c("a1","a2"), B = c("b1","b2","b3")) fake$y = c(11.46,12.93,11.87,11.01,11.92,17.80,13.41,13.96,14.27,15.82, 23.14,23.75,-2.09,28.43,23.01,24.11,25.51,24.11,23.95,30.37, 17.75,18.28,17.82,18.52,16.33,20.58,20.55,20.77,21.21,20.10) @ The $y$ values were generated using predetermined means and Cauchy-distributed errors. There are some serious outliers in these data. \section{Supporting \code{rlm}} The \pkg{MASS} package provides an "rlm" function that fits robust-regression models using $M$~estimation. We'll fit a model using the default settings for all tuning parameters: <<>>= library(MASS) fake.rlm = rlm(y ~ A * B, data = fake) library(lsmeans) lsmeans(fake.rlm, ~B | A) @ The first lesson to learn about extending \lsm{} is that sometimes, it already works! It works here because "rlm" objects inherit from "lm", which is supported by the \lsm{} package, and "rlm" objects aren't enough different to create any problems. \section{Supporting \code{lqs} objects} The \pkg{MASS} resistant-regression functions "lqs", "lmsreg", and "ltsreg" are another story, however. They create "lqs" objects that are not extensions of any other class, and have other issues, including not even having a "vcov" method. So for these, we really do need to write new methods for "lqs" objects. First, let's fit a model. <<>>= fake.lts = ltsreg(y ~ A * B, data = fake) @ \subsection{The \code{recover.data} method} It is usually an easy matter to write a "recover.data" method. Look at the one for "lm" objects: <<>>= lsmeans:::recover.data.lm @ Note that all it does is obtain the "call" component and call the method for class \dqt{call}, with additional arguments for its "terms" component and "na.action". It happens that we can access these attributes in exactly the same way as for "lm" objects; so, \ldots <<>>= recover.data.lqs = lsmeans:::recover.data.lm @ The trickier part is testing it, as it isn't clear that there is a required "data" argument. <<>>= rec.fake = recover.data(fake.lts, data = NULL) head(rec.fake) @ Our recovered data excludes the response variable "y" (owing to the "delete.response" call), and this is fine. By the way, the "data" argument is handed to "recover.data" if it is specified in the "ref.grid" or "lsmeans" call. It is needed to cover a desperate situation that occurs with certain kinds of models that are fitted by iteratively modifying the data. In those cases, the only way to recover the data is to for the user to give it explicitly, and "recover.data" just adds a few needed attributes to it. \subsection{The \code{lsm.basis} method} The "lsm.basis" method has four required arguments: <<>>= args(lsmeans:::lsm.basis.lm) @ These are, respectively, the model object, its "terms" component (at least for the right-hand side of the model), a "list" of levels of the factors, and the grid of predictor combinations that specify the reference grid. The function must obtain six things and return them in a named "list". They are the matrix "X" of linear functions for each point in the reference grid, the regression coefficients "bhat"; the variance-covariance matrix "V"; a matrix "nbasis" for non-estimable functions; a function "dffun(k,dfargs)" for computing degrees of freedom for the linear function "sum(k*bhat)"; and a list "dfargs" of arguments to pass to "dffun". To write your own "lsm.basis" function, examining some of the existing methods can help; but the best resource is the "predict" method for the object in question, looking carefully to see what it does to predict values for a new set of predictors (e.g., "newdata" in "predict.lm"). Following this advice, let's take a look at it: <<>>= MASS:::predict.lqs @ \RecustomVerbatimEnvironment{Sinput}{Verbatim}{numbers=left} Based on this, here is a listing of an "lsm.basis" method for "lqs" objects: <<>>= lsm.basis.lqs = function(object, trms, xlev, grid, ...) { m = model.frame(trms, grid, na.action = na.pass, xlev = xlev) X = model.matrix(trms, m, contrasts.arg = object$contrasts) bhat = coef(object) Xmat = model.matrix(trms, data=object$model) V = rev(object$scale)[1]^2 * solve(t(Xmat) %*% Xmat) nbasis = matrix(NA) dfargs = list(df = nrow(Xmat) - ncol(Xmat)) dffun = function(k, dfargs) dfargs$df list(X=X, bhat=bhat, nbasis=nbasis, V=V, dffun=dffun, dfargs=dfargs) } @ \RecustomVerbatimEnvironment{Sinput}{Verbatim}{numbers=none} Before explaining it, let's verify that it works: <<>>= lsmeans(fake.lts, ~ B | A) @ Hooray! Note the results are comparable to those we had for "fake.rlm", albeit the standard errors are quite a bit smaller. \subsection{Dissecting \code{lsm.basis.lqs}} Let's go through the listing of this method, by line numbers. \begin{itemize} \item[2--3:] Construct the linear functions, "X". This is a pretty standard standard two-step process: First obtain a model frame, "m", for the grid of predictors, then pass it as data to "model.data" to create the associated design matrix. As promised, this code is essentially identical to what you find in "predict.lqs". \item[4:] Obtain the coefficients, "bhat". Most model objects have a "coef" method. \item[5--6:] Obtain the covariance matrix, "V", of "bhat". In many models, this can be obtained using the object's "vcov" method. But not in this case. Instead, I cobbled one together using what it would be for ordinary regression: $\hat\sigma^2(\mathbf{X}'\mathbf{X})^{-1}$, where $\mathbf{X}$ is the design matrix for the whole dataset (not the reference grid). Here, $\hat\sigma$ is obtained using the last element of the "scale" element of the object (depending on the method, there are one or two scale estimates). This probably under-estimates the variances and distorts the covariances, because robust estimators have some efficiency loss. \item[7:] Compute the basis for non-estimable functions. This applies only when there is a possibility of rank deficiency in the model, and "lqs" methods cannot handle that. All linear functions are estimable, and we signal that by setting "nbasis" equal to a $1\times1$ matrix of "NA". If rank deficiency were possible, \lsm{} provides a fairly pain-free way to handle this---I would have coded: <>= nbasis = nonest.basis(Xmat) @ There is a subtlety you need to know, though. Suppose the model is rank-deficient, so that the design matrix $\mathbf{X}$ has $p$ columns but rank $r>= lsm.basis.lqs = function(object, trms, xlev, grid, ...) { m = model.frame(trms, grid, na.action = na.pass, xlev = xlev) X = model.matrix(trms, m, contrasts.arg = object$contrasts) bhat = coef(object) V = diag(rep(NA, length(bhat))) nbasis = matrix(NA) dffun = function(k, dfargs) NA list(X=X, bhat=bhat, nbasis=nbasis, V=V, dffun=dffun, dfargs=list()) } @ And here is a test: <<>>= lsmeans(fake.lts, pairwise ~ B) @ \section{Conclusions} It is relatively simple to write appropriate methods that work with \lsm{} for model objects it does not support. I hope this vignette is helpful for understanding how. Furthermore, if you are the developer of a package that fits linear models, I encourage you to include "recover.data" and "lsm.basis" methods for those classes of objects, and to remember to export them in your "NAMESPACE" file as follows: \begin{verbatim} S3method(myobject, recover.data) S3method(myobject, lsm.basis) \end{verbatim} \end{document}