\documentclass[12pt]{article} %\VignettePackage{doBy} %\VignetteIndexEntry{doBy: Groupwise computations and other utilities} \usepackage{hyperref,color,boxedminipage,Sweave} \usepackage[utf8]{inputenc} \def\proglang#1{{#1}} \def\pkg#1{{#1}} \def\doby{\pkg{doBy}} \def\code#1{\texttt{#1}} \def\shd#1{\footnote{SHD: #1}} \def\summaryby{\code{summaryBy}} \def\R{\proglang{R}} \special{html: } <>= require( doBy ) prettyVersion <- packageDescription("doBy")$Version prettyDate <- format(Sys.Date()) @ \SweaveOpts{keep.source=T,prefix.string=figures/doBy} %\setkeys{Gin}{height=3in} \title{Groupwise computations and other utilities in the \texttt{doBy} package} \author{S{\o}ren H{\o}jsgaard} \date{\pkg{doBy} version \Sexpr{prettyVersion} as of \Sexpr{prettyDate}} \begin{document} \maketitle %\SweaveInput{Rmarkup.STY} \definecolor{darkred}{rgb}{.7,0,0} \definecolor{midnightblue}{rgb}{0.098,0.098,0.439} \DefineVerbatimEnvironment{Sinput}{Verbatim}{ fontfamily=tt, %%fontseries=b, %% xleftmargin=2em, formatcom={\color{midnightblue}} } \DefineVerbatimEnvironment{Soutput}{Verbatim}{ fontfamily=tt, %%fontseries=b, %% xleftmargin=2em, formatcom={\color{darkred}} } \DefineVerbatimEnvironment{Scode}{Verbatim}{ fontfamily=tt, %%fontseries=b, %% xleftmargin=2em, formatcom={\color{blue}} } \fvset{listparameters={\setlength{\topsep}{-2pt}}} \renewenvironment{Schunk}{\linespread{.90}}{} \tableofcontents % \renewenvironment{Schunk}{\begin{center} % \scriptsize % \begin{boxedminipage}{1.0\textwidth}}{ % \end{boxedminipage}\end{center}} @ <>= dir.create("figures") oopt <- options() options("digits"=4, "width"=80, "prompt"=" ", "continue"=" ") options(useFancyQuotes="UTF-8") @ %def \parindent0pt\parskip5pt \section{Introduction} \label{sec:introduction} The \doby{} package contains a variety of utility functions. This working document describes some of these functions. The package originally grew out of a need to calculate groupwise summary statistics (much in the spirit of \code{PROC SUMMARY} of the \proglang{SAS} system), but today the package contains many different utilities. % The \doby\ package (and this document as a .pdf file) is available % from % \url{http://cran.r-project.org/web/packages/doBy/index.html} %The package is loaded with: @ <>= library(doBy) @ %def \section{Data used for illustration} \label{sec:co2data} The description of the \code{doBy} package is based on the following datasets. \paragraph{CO2 data} The \code{CO2} data frame comes from an experiment on the cold tolerance of the grass species {\em Echinochloa crus-galli}. To limit the amount of output we modify names and levels of variables as follows @ <<>>= data(CO2) CO2 <- transform(CO2, Treat=Treatment, Treatment=NULL) levels(CO2$Treat) <- c("nchil","chil") levels(CO2$Type) <- c("Que","Mis") CO2 <- subset(CO2, Plant %in% c("Qn1", "Qc1", "Mn1", "Mc1")) @ %def %Data is shown in Section~\ref{sec:appdata}. \paragraph{Airquality data} The \code{airquality} dataset contains air quality measurements in New York, May to September 1973. The months are coded as $5,\dots,9$. To limit the output we only consider data for two months: @ <<>>= airquality <- subset(airquality, Month %in% c(5,6)) @ %def %Data is shown in Section~\ref{sec:appdata}. \paragraph{Dietox data} The \code{dietox} data are provided in the \code{doBy} package and result from a study of the effect of adding vitamin E and/or copper to the feed of slaughter pigs. \section{Working with groupwise data} \subsection{The \code{summaryBy} function} \label{sec:summaryBy} The \summaryby{} function is used for calculating quantities like ``the mean and variance of $x$ and $y$ for each combination of two factors $A$ and $B$''. Examples are based on the \code{CO2} data. \subsubsection{Basic usage} \label{sec:xxx} The mean and variance of \code{uptake} and \code{conc} for each value of \code{Plant} is obtained by: @ <<>>= myfun1 <- function(x){c(m=mean(x), v=var(x))} summaryBy( conc + uptake ~ Plant, data=CO2, FUN=myfun1) @ %def Above \code{myfun1()} is a function that returns a vector of named values. Note that the values returned by the function has been named as \code{m} and \code{v}. An alternative specification is: @ <<>>= summaryBy( list(c("conc","uptake"), "Plant"), data=CO2, FUN=myfun1) @ %def If the result of the function(s) are not named, then the names in the output data in general become less intuitive: @ <<>>= myfun2 <- function(x){c(mean(x), var(x))} summaryBy( conc + uptake ~ Plant, data=CO2, FUN=myfun2) @ %def Another usage is to specify a list of functions each of which returns a single value: @ <<>>= summaryBy( conc + uptake ~ Plant, data=CO2, FUN=list( mean, var ) ) @ %def Notice that if we specify a list of functions of which some returns a vector with more than one element, then the proper names are not retrieved: @ <<>>= summaryBy(uptake~Plant, data=CO2, FUN=list( mean, var, myfun1 )) @ %def One can ``hard code'' the function names into the output as @ <<>>= summaryBy(uptake~Plant, data=CO2, FUN=list( mean, var, myfun1 ), fun.names=c("mean","var","mm","vv")) @ %def \subsubsection{Statistics on functions of data} \label{sec:xxx} We may want to calculate the mean and variance for the logarithm of \code{uptake}, for \code{uptake}+\code{conc} (not likely to be a useful statistic) as well as for \code{uptake} and \code{conc}. This can be achieved as: @ <<>>= summaryBy(log(uptake) + I(conc+uptake) + conc+uptake ~ Plant, data=CO2, FUN=myfun1) @ %def The names of the variables become involved with this. The user may control the names of the variables directly: @ <<>>= summaryBy(log(uptake) + I(conc+uptake) + conc + uptake ~ Plant, data=CO2, FUN=myfun1, var.names=c("log.upt", "conc+upt", "conc", "upt")) @ %def If one does not want output variables to contain parentheses then setting \code{p2d=TRUE} causes the parentheses to be replaced by dots (``.''). @ <<>>= summaryBy(log(uptake)+I(conc+uptake)~Plant, data=CO2, p2d=TRUE, FUN=myfun1) @ %def \subsubsection{Copying variables out with the \code{id} argument} \label{sec:xxx} To get the value of the \code{Type} and \code{Treat} in the first row of the groups (defined by the values of \code{Plant}) copied to the output dataframe we use the \code{id} argument in one of the following forms: @ <<>>= summaryBy(conc+uptake~Plant, data=CO2, FUN=myfun1, id=~Type+Treat) summaryBy(conc+uptake~Plant, data=CO2, FUN=myfun1, id=c("Type","Treat")) @ %def \subsubsection{Using '.' on the left hand side of a formula} \label{sec:xxx} It is possible to use the dot (".") on the left hand side of the formula. The dot means "all numerical variables which do not appear elsewhere" (i.e.\ on the right hand side of the formula and in the \code{id} statement): @ <<>>= summaryBy(log(uptake)+I(conc+uptake)+. ~Plant, data=CO2, FUN=myfun1) @ %def \subsubsection{Using '.' on the right hand side of a formula} \label{sec:xxx} The dot (".") can also be used on the right hand side of the formula where it refers to "all non--numerical variables which are not specified elsewhere": @ <<>>= summaryBy(log(uptake) ~Plant+., data=CO2, FUN=myfun1) @ %def \subsubsection{Using '1' on the right hand side of the formula} \label{sec:xxx} Using 1 on the right hand side means no grouping: @ <<>>= summaryBy(log(uptake) ~ 1, data=CO2, FUN=myfun1) @ %def \subsubsection{Preserving names of variables using \code{keep.names}} \label{sec:xxx} If the function applied to data only returns one value, it is possible to force that the summary variables retain the original names by setting \code{keep.names=TRUE}. A typical use of this could be @ <<>>= summaryBy(conc+uptake+log(uptake)~Plant, data=CO2, FUN=mean, id=~Type+Treat, keep.names=TRUE) @ %def \subsection{The \code{orderBy} function} \label{orderBy} Ordering (or sorting) a data frame is possible with the \code{orderBy} function. Suppose we want to order the rows of the the \code{airquality} data by \code{Temp} and by \code{Month} (within \code{Temp}). This can be achieved by: @ <>= x<-orderBy(~Temp+Month, data=airquality) @ %def The first lines of the result are: @ <>= head(x) @ %def If we want the ordering to be by decreasing values of one of the variables, we change the sign, e.g. @ <<>>= x<-orderBy(~-Temp+Month, data=airquality) head(x) @ %def \subsection{The \code{splitBy} function} \label{splitBy} Suppose we want to split the \code{airquality} data into a list of dataframes, e.g.\ one dataframe for each month. This can be achieved by: @ <<>>= x<-splitBy(~Month, data=airquality) x @ %def Hence for month 5, the relevant entry-name in the list is '5' and this part of data can be extracted as @ <>= x[['5']] @ %def Information about the grouping is stored as a dataframe in an attribute called \code{groupid} and can be retrieved with: <<>>= attr(x,"groupid") @ %def \subsection{The \code{sampleBy} function} \label{sampleBy} Suppose we want a random sample of 50 \% of the observations from a dataframe. This can be achieved with: @ <>= sampleBy(~1, frac=0.5, data=airquality) @ %def Suppose instead that we want a systematic sample of every fifth observation within each month. This is achieved with: @ <>= sampleBy(~Month, frac=0.2, data=airquality,systematic=T) @ %def \subsection{The \code{subsetBy} function} \label{subsetBy} Suppose we want to select those rows within each month for which the the wind speed is larger than the mean wind speed (within the month). This is achieved by: @ <>= subsetBy(~Month, subset=Wind>mean(Wind), data=airquality) @ %def Note that the statement \code{Wind>mean(Wind)} is evaluated within each month. \subsection{The \code{transformBy} function} \label{sec:transformby} The \code{transformBy} function is analogous to the \code{transform} function except that it works within groups. For example: @ <>= transformBy(~Month, data=airquality, minW=min(Wind), maxW=max(Wind), chg=sum(range(Wind)*c(-1,1))) @ %def \subsection{The \code{lapplyBy} function} \label{sec:transformby} This \code{lapplyBy} function is a wrapper for first splitting data into a list according to the formula (using splitBy) and then applying a function to each element of the list (using apply). Suppose we want to calculate the weekwise feed efficiency of the pigs in the \code{dietox} data, i.e. weight gain divided by feed intake. @ <<>>= data(dietox) dietox <- orderBy(~Pig+Time, data=dietox) FEfun <- function(d){c(NA, diff(d$Weight)/diff(d$Feed))} v <- lapplyBy(~Pig, data=dietox, FEfun) dietox$FE <- unlist(v) @ %def Technically, the above is the same as @ <<>>= dietox <- orderBy(~Pig+Time, data=dietox) wdata <- splitBy(~Pig, data=dietox) v <- lapply(wdata, FEfun) dietox$FE <- unlist(v) @ %def \subsection{The \code{scaleBy} function} Standardize the \code{iris} data within each value of \code{"Species"}: @ <<>>= x<-scaleBy( list(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"), "Species"), data=iris) head(x) head(iris) @ %def \section{Create By--functions on the fly} \label{sec:create-functions-fly} Create a function for creating groupwise t-tests @ <<>>= mydata <- data.frame(y=rnorm(32), x=rnorm(32), g1=factor(rep(c(1,2),each=16)), g2=factor(rep(c(1,2), each=8)), g3=factor(rep(c(1,2),each=4))) head(mydata) @ %def @ <<>>= ## Based on the formula interface to t.test t.testBy1 <- function(formula, group, data, ...){ formulaFunBy(formula, group, data, FUN=t.test, class="t.testBy1", ...) } ## Based on the default interface to t.test t.testBy2 <- function(formula, group, data, ...){ xyFunBy(formula, group, data, FUN=t.test, class="t.testBy1", ...) } @ %def Notice: The optional \code{class} argument will facilitate that you create your own print / summary methods etc. @ <<>>= t.testBy1(y~g1, ~g2, data=mydata) t.testBy2(y~x, ~g2, data=mydata) @ %def \section{Miscellaneous} \label{sec:xxx} \subsection{Specialize} \label{sec:specialize} @ <<>>= ff <- function(a,b=2,c=4){a+b+c} ff1 <- specialize(ff, arglist=list(a=1, b=7, yy=123)) ff1 gg <- rnorm gg1 <- specialize(gg, list(n=10)) gg1 @ %def Notice that this result is absurd: @ <<>>= f <- function(a) {a <- a + 1; a} f1 <- specialize(f, list(a = 10)) f1 @ %def \subsection{The \code{firstobs()} / \code{lastobs()} function} \label{firstlast} To obtain the indices of the first/last occurences of an item in a vector do: @ <<>>= x <- c(1,1,1,2,2,2,1,1,1,3) firstobs(x) lastobs(x) @ %def The same can be done on a data frame, e.g. @ <<>>= firstobs(~Plant, data=CO2) lastobs(~Plant, data=CO2) @ %def \subsection{The \code{which.maxn()} and \code{which.minn()} functions} \label{sec:whichmaxn} The location of the $n$ largest / smallest entries in a numeric vector can be obtained with @ <<>>= x <- c(1:4,0:5,11,NA,NA) which.maxn(x,3) which.minn(x,5) @ %def \subsection{Subsequences - \code{subSeq()}} \label{sec:xxx} Find (sub) sequences in a vector: @ <<>>= x <- c(1,1,2,2,2,1,1,3,3,3,3,1,1,1) subSeq(x) subSeq(x, item=1) subSeq(letters[x]) subSeq(letters[x],item="a") @ %def \subsection{Recoding values of a vector - \code{recodeVar()}} \label{sec:xxx} @ <<>>= x <- c("dec","jan","feb","mar","apr","may") src1 <- list(c("dec","jan","feb"), c("mar","apr","may")) tgt1 <- list("winter","spring") recodeVar(x,src=src1,tgt=tgt1) @ %def \subsection{Renaming columns of a dataframe or matrix -- \code{renameCol()}} \label{sec:xxx} @ <<>>= head(renameCol(CO2, 1:2, c("kk","ll"))) head(renameCol(CO2, c("Plant","Type"), c("kk","ll"))) @ %def \subsection{Time since an event - \code{timeSinceEvent()}} \label{sec:xxx} Consider the vector @ <<>>= yvar <- c(0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0) @ %def Imagine that "1" indicates an event of some kind which takes place at a certain time point. By default time points are assumed equidistant but for illustration we define time time variable @ <<>>= tvar <- seq_along(yvar) + c(0.1,0.2) @ %def Now we find time since event as @ <>= tse<- timeSinceEvent(yvar,tvar) @ %def The output reads as follows: \begin{itemize} \item \verb'abs.tse': Absolute time since (nearest) event. \item \verb'sign.tse': Signed time since (nearest) event. \item \verb'ewin': Event window: Gives a symmetric window around each event. \item \verb'run': The value of \verb'run' is set to $1$ when the first event occurs and is increased by $1$ at each subsequent event. \item \verb'tae': Time after event. \item \verb'tbe': Time before event. \end{itemize} @ <>= plot(sign.tse~tvar, data=tse, type="b") grid() rug(tse$tvar[tse$yvar==1], col='blue',lwd=4) points(scale(tse$run), col=tse$run, lwd=2) lines(abs.tse+.2~tvar, data=tse, type="b",col=3) @ %def @ <>= plot(tae~tvar, data=tse, ylim=c(-6,6),type="b") grid() lines(tbe~tvar, data=tse, type="b", col='red') rug(tse$tvar[tse$yvar==1], col='blue',lwd=4) lines(run~tvar, data=tse, col='cyan',lwd=2) @ %def @ <>= plot(ewin~tvar, data=tse,ylim=c(1,4)) rug(tse$tvar[tse$yvar==1], col='blue',lwd=4) grid() lines(run~tvar, data=tse,col='red') @ %def We may now find times for which time since an event is at most 1 as @ <<>>= tse$tvar[tse$abs<=1] @ %def \subsection{Example: Using \code{subSeq()} and \code{timeSinceEvent()}} Consider the \verb|lynx| data: @ <>= lynx <- as.numeric(lynx) tvar <- 1821:1934 plot(tvar,lynx,type='l') @ %def Suppose we want to estimate the cycle lengths. One way of doing this is as follows: @ <<>>= yyy <- lynx>mean(lynx) head(yyy) sss <- subSeq(yyy,TRUE) sss @ %def @ <>= plot(tvar,lynx,type='l') rug(tvar[sss$midpoint],col='blue',lwd=4) @ %def Create the 'event vector' @ <<>>= yvar <- rep(0,length(lynx)) yvar[sss$midpoint] <- 1 str(yvar) @ %def @ <<>>= tse <- timeSinceEvent(yvar,tvar) head(tse,20) @ %def We get two different (not that different) estimates of period lengths: @ <>= len1 <- tapply(tse$ewin, tse$ewin, length) len2 <- tapply(tse$run, tse$run, length) c(median(len1),median(len2),mean(len1),mean(len2)) @ %def We can overlay the cycles as: @ <>= tse$lynx <- lynx tse2 <- na.omit(tse) plot(lynx~tae, data=tse2) @ %def @ <>= plot(tvar,lynx,type='l',lty=2) mm <- lm(lynx~tae+I(tae^2)+I(tae^3), data=tse2) lines(fitted(mm)~tvar, data=tse2, col='red') @ %def \section{Acknowledgements} \label{discussion} Credit is due to Dennis Chabot, Gabor Grothendieck, Paul Murrell, Jim Robison-Cox and Erik J{\o}rgensen for reporting various bugs and making various suggestions to the functionality in the \doby{} package. @ <>= options(oopt) @ %def \end{document} \appendix \section{The data} \label{sec:appdata} The reduced \code{C02} are: @ <<>>= CO2 @ %def The reduced \code{airquality} data are: @ <<>>= head(airquality, n=20) @ %def