\documentclass[a4paper]{article} \usepackage[margin=3cm]{geometry} %%\usepackage[round]{natbib} \usepackage[colorlinks=true,urlcolor=blue]{hyperref} %%\newcommand{\acronym}[1]{\textsc{#1}} %%\newcommand{\class}[1]{\mbox{\textsf{#1}}} \newcommand{\code}[1]{\mbox{\texttt{#1}}} \newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}} \newcommand{\proglang}[1]{\textsf{#1}} \SweaveOpts{keep.source=TRUE} %% \VignetteIndexEntry{Frequently asked questions} <>= if (!exists("data.table",.GlobalEnv)) library(data.table) # see Intro.Rnw for comments on these two lines rm(list=as.character(tables()$NAME),envir=.GlobalEnv) options(width=70) # so lines wrap round @ \begin{document} \title{FAQs about the \pkg{data.table} package in \proglang{R}} \author{Matthew Dowle} \date{Revised: \today\\(A later revision may be available on the \href{http://datatable.r-forge.r-project.org/}{homepage})} \maketitle The first section, Beginner FAQs, is intended to be read in order, from start to finish. It may be read before reading the \href{http://datatable.r-forge.r-project.org/datatable-intro.pdf}{10 minute introduction to data.table} vignette. \tableofcontents \section{Beginner FAQs} \subsection{Why does \code{DT[,5]} return \code{5}?} Because, by default, unlike a \code{data.frame}, the 2nd argument is an \emph{expression} which is evaluated within the scope of \code{DT}. 5 evaluates to 5. It is generally bad practice to refer to columns by number rather than name. If someone else comes along and reads your code later, they may have to hunt around to find out which column is number 5. Furthermore, if you or someone else changes the column ordering of \code{DT} higher up in your \proglang{R} program, you might get bugs if you forget to change all the places in your code which refer to column number 5. Say column 5 is called ''region'', just do \code{DT[,region]} instead. Notice there are no quotes around the column name. This is what we mean by j being evaluated within the scope of the \code{data.table}. That scope consists of an environment where the column names are variables. You can write \emph{any} \proglang{R} expression in the \code{j}; e.g., \code{DT[,colA*colB/2]}. Further, \code{j} may be a \code{list()} of many \proglang{R} expressions, including calls to any \proglang{R} package; e.g., \code{DT[,fitdistr(d1-d1,"normal")]}. Having said this, there are some circumstances where referring to a column by number is ok, such as a sequence of columns. In these situations just do \code{DT[,5:10,with=FALSE]} or \newline \code{DT[,c(1,4,10),with=FALSE]}. See \code{?data.table} for an explanation of the \code{with} argument. Note that \code{with()} has been a base function for a long time. That's one reason we say \code{data.table} builds upon base functionality. There is little new here really, \code{data.table} is just making use of \code{with()} and building it into the syntax. \subsection{Why does \code{DT[,"region"]} return \code{"region"}?} See answer to 1.1 above. Try \code{DT[,region]} instead. Or \code{DT[,"region",with=FALSE]}. \subsection{Why does \code{DT[,region]} return a vector? I'd like a 1-column \code{data.table}. There is no \code{drop} argument like I'm used to in \code{data.frame}.} Try \code{DT[,list(region)]} instead. \subsection{Why does \code{DT[,x,y,z]} not work? I wanted the 3 columns \code{x},\code{y} and \code{z}.} The \code{j} expression is the 2nd argument. The correct way to do this is \code{DT[,list(x,y,z)]}. \subsection{I assigned a variable \code{mycol="x"} but then \code{DT[,mycol]} returns \code{"x"}. How do I get it to look up the column name contained in the \code{mycol} variable?} This is what we mean when we say the \code{j} expression 'sees' objects in the calling scope. The variable \code{mycol} does not exist as a column name of \code{DT} so \proglang{R} then looked in the calling scope and found \code{mycol} there, and returned its value \code{"x"}. This is correct behaviour. Had \code{mycol} been a column name, then that column's data would have been returned. What you probably meant was \code{DT[,mycol,with=FALSE]}, which will return the \code{x} column's data as you wanted. Alternatively, since a \code{data.table} \emph{is} a \code{list}, too, you can write \code{DT[["x"]]} or \code{DT[[mycol]]}. \subsection{Ok but I don't know the expressions in advance. How do I programatically pass them in?} To create expressions use the \code{quote()} function. We refer to these as \emph{quote()-ed} expressions to save confusion with the double quotes used to create a character vector such as \code{c("x")}. The simplest quote()-ed expression is just one column name : \code{q = quote(x)} \code{DT[,eval(q)] \# returns the column x as a vector} \code{q = quote(list(x))} \code{DT[,eval(q)] \# returns the column x as a 1-column data.table} \newline Since these are \emph{expressions}, we are not restricted to column names only : \code{q = quote(mean(x))} \code{DT[,eval(q)] \# identical to DT[,mean(x)]} \code{q = quote(list(x,sd(y),mean(y*z)))} \code{DT[,eval(q)] \# identical to DT[,list(x,sd(y),mean(y*z))]} \newline However, if it's just simply a vector of column names you need, it may be simpler to pass a character vector to \code{j} and use \code{with=FALSE}. To pass an expression into your own function, one idiom is as follows : <<>>= DT = as.data.table(iris) setkey(DT,Species) myfunction = function(dt, expr) { e = substitute(expr) dt[,eval(e),by=Species] } myfunction(DT,sum(Sepal.Width)) @ \subsection{This is really hard. What's the point?} \code{j} doesn't have to be just column names. You can write any \proglang{R} \emph{expression} of column names directly as the \code{j}; e.g., \code{DT[,mean(x*y/z)]}. The same applies to \code{i}; e.g., \code{DT[x>1000, sum(y*z)]}. This runs the \code{j} expression on the set of rows where the \code{i} expression is true. You don't even need to return data; e.g., \code{DT[x>1000, plot(y,z)]}. When we get to compound table joins we will see how \code{i} and \code{j} can themselves be other \code{data.table} queries. We are going to stretch \code{i} and \code{j} much further than this, but to get there we need you on board first with FAQs 1.1-1.6. \subsection{OK, I'm starting to see what \code{data.table} is about, but why didn't you enhance \code{data.frame} in \proglang{R}? Why does it have to be a new package?} As FAQ 1.1 highlights, \code{j} in \code{[.data.table} is fundamentally different from \code{j} in \code{[.data.frame}. Even something as simple as \code{DF[,1]} would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ \ref{faq:SmallerDiffs}). Furthermore, \code{data.table} \emph{inherits} from \code{data.frame}. It \emph{is} a \code{data.frame}, too. A \code{data.table} can be passed to any package that only accepts \code{data.frame} and that package can use \code{[.data.frame} syntax on the \code{data.table}. We \emph{have} proposed enhancements to \proglang{R} wherever possible, too. One of these was accepted as a new feature in \proglang{R} 2.12.0 : \begin{quotation}unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated in unique.c.\end{quotation} A second proposal was to use memcpy in duplicate.c, which is much faster than a for loop in C. This would improve the \emph{way} that \proglang{R} copies data internally (on some measures by 13 times). The thread on r-devel is here : \url{http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html}. \subsection{Why are the defaults the way they are? Why does it work the way it does?} The simple answer is because the author designed it for his own use, and he wanted it that way. He finds it a more natural, faster way to write code, which also executes more quickly. \subsection{Isn't this already done by \code{with()} and \code{subset()} in base?} Some of the features discussed so far are, yes. The package builds upon base functionality. It does the same sorts of things but with less code required, and executes many times faster if used correctly. \subsection{Why does \code{X[Y]} return all the columns from \code{Y} too? Shouldn't it return a subset of \code{X}?} This was changed in v1.5.3. \code{X[Y]} now includes \code{Y}'s non-join columns. We refer to this feature as \emph{join inherited scope} because not only are \code{X} columns available to the j expression, so are \code{Y} columns. The downside is that \code{X[Y]} is less efficient since every item of \code{Y}'s non-join columns are duplicated to match the (likely large) number of rows in \code{X} that match. We therefore strongly encourage \code{X[Y,j]} instead of \code{X[Y]}. See next FAQ. \subsection{What is the difference between \code{X[Y]} and \code{merge(X,Y)}?} \code{X[Y]} is a join, looking up \code{X}'s rows using \code{Y} (or \code{Y}'s key if it has one) as an index.\newline \code{Y[X]} is a join, looking up \code{Y}'s rows using \code{X} (or \code{X}'s key if it has one) as an index.\newline \code{merge(X,Y)}\footnote{Here we mean either the \code{merge} \emph{method} for \code{data.table} or the \code{merge} method for \code{data.frame} since both methods work in the same way in this respect. See \code{?merge.data.table} and FAQ 2.24 for more information about method dispatch.} does both ways at the same time. The number of rows of \code{X[Y]} and \code{Y[X]} usually differ; whereas the number of rows returned by \code{merge(X,Y)} and \code{merge(Y,X)} is the same. \emph{BUT} that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards? You may suggest \code{merge(X[,ColsNeeded1],Y[,ColsNeeded2])}, but that takes copies of the subsets of data, and it requires the programmer to work out which columns are needed. \code{X[Y,j]} in data.table does all that in one step for you. When you write \code{X[Y,sum(foo*bar)]}, \code{data.table} automatically inspects the \code{j} expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the \code{j} uses, and \code{Y} columns enjoy standard \proglang{R} recycling rules within the context of each group. Let's say \code{foo} is in \code{X}, and \code{bar} is in \code{Y} (along with 20 other columns in \code{Y}). Isn't \code{X[Y,sum(foo*bar)]} quicker to program and quicker to run than a \code{merge} followed by a \code{subset}? \subsection{Anything else about \code{X[Y,sum(foo*bar)]}?} Remember that \code{j} (in this example \code{sum(foo*bar)}) is run for each \emph{group} of \code{X} that each row of \code{Y} matches to. This feature is \emph{grouping by i} or \emph{by without by}. For example, and making it complicated by using \emph{join inherited scope}, too : <<>>= X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7) setkey(X,grp) Y = data.table(c("b","c"), bar=c(4,2)) X Y X[Y,sum(foo*bar)] @ \subsection{That's nice but what if I really do want to evaluate \code{j} for all rows once, not by group?} If you really want \code{j} to run once for the whole subset of \code{X} then try \code{X[Y][,sum(foo*bar)]}. If that needs to be efficient (recall that \code{X[Y]} joins all columns) then you will have to work a little harder since this is outside the common use-case: \code{X[Y,list(foo,bar)][,sum(foo*bar)]}. \section{General syntax} \subsection{How can I avoid writing a really long \code{j} expression? You've said I should use the column \emph{names}, but I've got a lot of columns.} When grouping, the \code{j} expression can use column names as variables, as you know, but it can also use a reserved symbol \code{.SD} which refers to the {\bf S}ubset of the \code{{\bf D}ata.table} for each group (excluding the grouping columns). So to sum up all your columns it's just \code{DT[,lapply(.SD,sum),by=grp]}. It might seem tricky, but it's fast to write and fast to run. Notice you don't have to create an anonymous \code{function}. See the timing vignette and wiki for comparison to other methods. The \code{.SD} object is efficiently implemented internally and more efficient than passing an argument to a function. But if the \code{.SD} symbol appears in \code{j} then \code{data.table} has to populate \code{.SD} fully for each group even if \code{j} doesn't use all of it. So please don't do this, for example, \code{DT[,sum(.SD[["sales"]]),by=grp]}. That works but is inefficient and inelegant. This is what was intended: \code{DT[,sum(sales),by=grp]} and could be 100's of times faster. If you do use all the data in \code{.SD} for each group (such as in \code{DT[,lapply(.SD,sum),by=grp]}) then that's very good usage of \code{.SD}. Also see \code{?data.table} for the \code{.SDcols} argument. \subsection{Why is the default for \code{mult} now \code{"all"}?} In v1.5.3 the default was changed to \code{"all"}. When \code{i} (or \code{i}'s key if it has one) has fewer columns than \code{x}'s key, \code{mult} was already set to \code{"all"} automatically. Changing the default makes this clearer and easier for users as it came up quite often. In versions up to v1.3, \code{"all"} was slower. Internally, \code{"all"} was implemented by joining using \code{"first"}, then again from scratch using \code{"last"}, after which a diff between them was performed to work out the span of the matches in \code{x} for each row in \code{i}. Most often we join to single rows, though, where \code{"first"},\code{"last"} and \code{"all"} return the same result. We preferred maximum performance for the majority of situations so the default chosen was \code{"first"}. When working with a non-unique key (generally a single column containing a grouping variable), \code{DT["A"]} returned the first row of that group so \code{DT["A",mult="all"]} was needed to return all the rows in that group. In v1.4 the binary search in C was changed to branch at the deepest level to find first and last. That branch will likely occur within the same final pages of RAM so there should no longer be a speed disadvantage in defaulting \code{mult} to \code{"all"}. We warned that the default might change, and made the change in v1.5.3. A future version of \code{data.table} may allow a distinction between a key and a \emph{unique key}. Internally \code{mult="all"} would perform more like \code{mult="first"} when all \code{x}'s key columns were joined to and \code{x}'s key was a unique key. \code{data.table} would need checks on insert and update to make sure a unique key is maintained. An advantage of specifying a unique key would be that \code{data.table} would ensure no duplicates could be inserted, in addition to performance. \subsection{I'm using \code{c()} in the \code{j} and getting strange results.} This is a common source of confusion. In \code{data.frame} you are used to, for example: <<>>= DF = data.frame(x=1:3,y=4:6,z=7:9) DF DF[,c("y","z")] @ which returns the two columns. In \code{data.table} you know you can use the column names directly and might try : <<>>= DT = data.table(DF) DT[,c(y,z)] @ but this returns one vector. Remember that the \code{j} expression is evaluated within the environment of \code{DT}, and \code{c()} returns a vector. If 2 or more columns are required, use list() instead: <<>>= DT[,list(y,z)] @ \code{c()} can be useful in a \code{data.table} too, but its behaviour is different from that in \code{[.data.frame}. \subsection{I have built up a complex table with many columns. I want to use it as a template for a new table; i.e., create a new table with no rows, but with the column names and types copied from my table. Can I do that easily?} Yes. If your complex table is called \code{DT}, try \code{NEWDT = DT[0]}. \subsection{Is a null data.table the same as \code{DT[0]}?} No. By "null data.table" we mean the result of \code{data.table(NULL)} or \code{as.data.table(NULL)}; i.e., <<>>= data.table(NULL) data.frame(NULL) as.data.table(NULL) as.data.frame(NULL) is.null(data.table(NULL)) is.null(data.frame(NULL)) @ The null \code{data.table|frame} is \code{NULL} with some attributes attached, making it not NULL anymore. In R only pure \code{NULL} is \code{NULL} as tested by \code{is.null()}. When referring to the "null data.table" we use lower case null to help distinguish from upper case \code{NULL}. To test for the null data.table, use \code{length(DT)==0} or \code{ncol(DT)==0} (\code{length} is slightly faster as it's a primitive function). An \emph{empty} data.table (\code{DT[0]}) has one or more columns, all of which are empty. Those empty columns still have names and types. <<>>= DT = data.table(a=1:3,b=c(4,5,6),d=c(7L,8L,9L)) DT[0] sapply(DT[0],class) @ \subsection{Why has the \code{DT()} alias been removed?}\label{faq:DTremove1} \code{DT} was introduced originally as a wrapper for a list of \code{j} expressions. Since \code{DT} was an alias for \code{data.table}, this was a convenient way to take care of silent recycling in cases where each item of the \code{j} list evaluated to different lengths. The alias was one reason grouping was slow, though. As of v1.3, \code{list()} should be passed instead to the \code{j} argument. \code{list()} is a primitive and is much faster, especially when there are many groups. Internally, this was a nontrivial change. Vector recycling is now done internally, along with several other speed enhancements for grouping. \subsection{But my code uses \code{j=DT(...)} and it works. The previous FAQ says that \code{DT()} has been removed.}\label{faq:DTremove2} Then you are using a version prior to 1.5.3. Prior to 1.5.3 \code{[.data.table} detected use of \code{DT()} in the \code{j} and automatically replaced it with a call to \code{list()}. This was to help the transition for existing users. \subsection{What are the scoping rules for \code{j} expressions?} Think of the subset as an environment where all the column names are variables. When a variable \code{foo} is used in the \code{j} of a query such as \code{X[Y,sum(foo)]}, \code{foo} is looked for in the following order : \begin{enumerate} \item The scope of \code{X}'s subset; i.e., \code{X}'s column names. \item The scope of each row of \code{Y}; i.e., \code{Y}'s column names (\emph{join inherited scope}) \item The scope of the calling frame; e.g., the line that appears before the \code{data.table} query. \item Exercise for reader: does it then ripple up the calling frames, or go straight to \code{globalenv()}? \item The global environment \end{enumerate} This is \emph{lexical scoping} as explained in \href{http://cran.r-project.org/doc/FAQ/R-FAQ.html#Lexical-scoping}{R FAQ 3.3.1}. The environment in which the function was created is not relevant, though, because there is \emph{no function}. No anonymous \emph{function} is passed to the \code{j}. Instead, an anonymous \emph{body} is passed to the \code{j}; for example, <<>>= DT = data.table(x=rep(c("a","b"),c(2,3)),y=1:5) DT DT[,{z=sum(y);z+3},by=x] @ Some programming languages call this a \emph{lambda}. \subsection{Can I trace the \code{j} expression as it runs through the groups?} Try something like this: <<>>= DT[,{ cat("Objects:",paste(objects(),collapse=","),"\n") cat("Trace: x=",as.character(x)," y=",y,"\n") sum(y) },by=x] @ \subsection{Inside each group, why is the group variable a long vector containing the same value repeated?} Please upgrade to v1.6.1, or later; this is no longer true. In the previous FAQ, \code{x} is a grouping variable and now has length 1 for efficiency and convenience. Prior to v1.6.1, \code{x} repeated the group value to match the number of rows in that group. There is no longer any difference between the following two statements. <<>>= DT[,list(g=1,h=2,i=3,j=4,repeatgroupname=x,sum(y)),by=x] DT[,list(g=1,h=2,i=3,j=4,repeatgroupname=x[1],sum(y)),by=x] @ Code written prior to v1.6.1 that uses \code{[1]} will still work, but the \code{[1]} is no longer necessary. \subsection{Only the first 10 rows are printed, how do I print more?} There are two things happening here. First, if the number of rows in a \code{data.table} are large (\code{> 100} by default), then a summary of the \code{data.table} is printed to the console by default. Second, the summary of a large \code{data.table} is printed by takingthe top and bottm \code{n} rows of the \code{data.table} and only printing those. Both of these parametes (when to trigger a summary, and how much of a table to use as a summary) are configurable by \proglang{R}'s \code{options} mechanism, or by calling the \code{print} function directly. For instance, to enforce the summary of a \code{data.table} to only happen when a \code{data.table} is greater than 50 rows, you could \code{options(datatable.print.nrows=50)}. To disable the summary-by-default completely, you could \code{options(datatable.print.nrows=Inf)}. You could also call \code{print} directly, as in \code{print(your.data.table, nrows=Inf)}. If you want to show more than just the top (and bottom) 10 rows of a \code{data.table} summary (say you like 20), set \code{options(datatable.print.topn=20)}, for example. Again, you could also just call \code{print} directly, as in \code{print(your.data.table, topn=20)} \subsection{With an \code{X[Y]} join, what if \code{X} contains a column called \code{"Y"}?} When \code{i} is a single name such as \code{Y} it is evaluated in the calling frame. In all other cases such as calls to \code{J()} or other expressions, \code{i} is evaluated within the scope of \code{X}. This facilitates easy \emph{self joins} such as \code{X[J(unique(colA)),mult="first"]}. \subsection{\code{X[Z[Y]]} is failing because \code{X} contains a column \code{"Y"}. I'd like it to use the table \code{Y} in calling scope.} The \code{Z[Y]} part is not a single name so that is evaluated within the frame of \code{X} and the problem occurs. Try \code{tmp=Z[Y];X[tmp]}. This is robust to \code{X} containing a column \code{"tmp"} because \code{tmp} is a single name. If you often encounter conflics of this type, one simple solution may be to name all tables in uppercase and all column names in lowercase, or some similar scheme. \subsection{Can you explain further why \code{data.table} is inspired by \code{A[B]} syntax in base?} Consider \code{A[B]} syntax using an example matrix \code{A} : <<>>= A = matrix(1:12,nrow=4) A @ To obtain cells (1,2)=5 and (3,3)=11 many users (we believe) may try this first : <<>>= A[c(1,3),c(2,3)] @ That returns the union of those rows and columns, though. To reference the cells, a 2-column matrix is required. \code{?Extract} says : \begin{quotation} When indexing arrays by [ a single argument i can be a matrix with as many columns as there are dimensions of x; the result is then a vector with elements corresponding to the sets of indices in each row of i. \end{quotation} Let's try again. <<>>= B = cbind(c(1,3),c(2,3)) B A[B] @ A matrix is a 2-dimension structure with row names and column names. Can we do the same with names? <<>>= rownames(A) = letters[1:4] colnames(A) = LETTERS[1:3] A B = cbind(c("a","c"),c("B","C")) A[B] @ So, yes we can. Can we do the same with \code{data.frame}? <<>>= A = data.frame(A=1:4,B=letters[11:14],C=pi*1:4) rownames(A) = letters[1:4] A B A[B] @ But, notice that the result was coerced to character. \proglang{R} coerced \code{A} to matrix first so that the syntax could work, but the result isn't ideal. Let's try making \code{B} a \code{data.frame}. <<>>= B = data.frame(c("a","c"),c("B","C")) cat(try(A[B],silent=TRUE)) @ So we can't subset a \code{data.frame} by a \code{data.frame} in base R. What if we want row names and column names that aren't character but integer or float? What if we want more than 2 dimensions of mixed types? Enter \code{data.table}. Furthermore, matrices, especially sparse matrices, are often stored in a 3 column tuple: (i,j,value). This can be thought of as a key-value pair where \code{i} and \code{j} form a 2-column key. If we have more than one value, perhaps of different types it might look like (i,j,val1,val2,val3,...). This looks very much like a \code{data.frame}. Hence \code{data.table} extends \code{data.frame} so that a \code{data.frame X} can be subset by a \code{data.frame Y}, leading to the \code{X[Y]} syntax. \subsection{Can base be changed to do this then, rather than a new package?} \code{data.frame} is used \emph{everywhere} and so it is very difficult to make \emph{any} changes to it. \code{data.table} \emph{inherits} from \code{data.frame}. It \emph{is} a \code{data.frame}, too. A \code{data.table} \emph{can} be passed to any package that \emph{only} accepts \code{data.frame}. When that package uses \code{[.data.frame} syntax on the \code{data.table}, it works. It works because \code{[.data.table} looks to see where it was called from. If it was called from such a package, \code{[.data.table} diverts to \code{[.data.frame}. \subsection{I've heard that \code{data.table} syntax is analogous to SQL.} Yes : \begin{itemize} \item{\code{i} <==> where} \item{\code{j} <==> select} \item{\code{:=} <==> update} \item{\code{by} <==> group by} \item{\code{i} <==> order by (in compound syntax)} \item{\code{i} <==> having (in compound syntax)} \item{\code{nomatch=NA} <==> outer join} \item{\code{nomatch=0} <==> inner join} \item{\code{mult="first"|"last"} <==> N/A because SQL is inherently unordered} \item{\code{roll=TRUE} <==> N/A because SQL is inherently unordered} \end{itemize} The general form is : \newline \code{\hspace*{2cm}DT[where,select|update,group by][having][order by][ ]...[ ]} \newline\newline A key advantage of column vectors in \proglang{R} is that they are \emph{ordered}, unlike SQL\footnote{It may be a surprise to learn that \code{select top 10 * from ...} does \emph{not} reliably return the same rows over time in SQL. You do need to include an \code{order by} clause, or use a clustered index to guarantee row order; i.e., SQL is inherently unordered.}. We can use ordered functions in \code{data.table} queries, such as \code{diff()}, and we can use \emph{any} \proglang{R} function from any package, not just the functions that are defined in SQL. A disadvantage is that \proglang{R} objects must fit in memory, but with several \proglang{R} packages such as ff, bigmemory, mmap and indexing, this is changing. \subsection{What are the smaller syntax differences between \code{data.frame} and \code{data.table}?}\label{faq:SmallerDiffs} \begin{itemize} \item{\code{DT[3]} refers to the 3rd row, but \code{DF[3]} refers to the 3rd column} \item{\code{DT[3,]} == \code{DT[3]}, but \code{DF[,3]} == \code{DF[3]} (somewhat confusingly)} \item{For this reason we say the comma is \emph{optional} in \code{DT}, but not optional in \code{DF}} \item{\code{DT[[3]]} == \code{DF[3]} == \code{DF[[3]]}} \item{\code{DT[i,]} where \code{i} is a single integer returns a single row, just like \code{DF[i,]}, but unlike a matrix single row subset which returns a vector.} \item{\code{DT[,j,with=FALSE]} where \code{j} is a single integer returns a one column \code{data.table}, unlike \code{DF[,j]} which returns a vector by default} \item{\code{DT[,"colA",with=FALSE][[1]]} == \code{DF[,"colA"]}.} \item{\code{DT[,colA]} == \code{DF[,"colA"]}} \item{\code{DT[,list(colA)]} == \code{DF[,"colA",drop=FALSE]}} \item{\code{DT[NA]} returns 1 row of \code{NA}, but \code{DF[NA]} returns a copy of \code{DF} containing \code{NA} throughout. The symbol \code{NA} is type logical in \proglang{R}, and is therefore recycled by \code{[.data.frame}. Intention was probably \code{DF[NA\_integer\_]}. \code{[.data.table} does this automatically for convenience.} \item{\code{DT[c(TRUE,NA,FALSE)]} treats the \code{NA} as \code{FALSE}, but \code{DF[c(TRUE,NA,FALSE)]} returns \code{NA} rows for each \code{NA}} \item{\code{DT[ColA==ColB]} is simpler than \code{DF[!is.na(ColA) \& !is.na(ColB) \& ColA==ColB,]}} \item{\code{data.frame(list(1:2,"k",1:4))} creates 3 columns, \code{data.table} creates one \code{list} column.} \item{\code{check.names} is by default \code{TRUE} in \code{data.frame} but \code{FALSE} in \code{data.table}, for convenience.} \item{\code{stringsAsFactors} is by default \code{TRUE} in \code{data.frame} but \code{FALSE} in \code{data.table}, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of coverting to factor.} \item{Atomic vectors in \code{list} columns are collapsed when printed using ", " in \code{data.frame}, but "," in \code{data.table} with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.} \end{itemize} In \code{[.data.frame} we very often set \code{drop=FALSE}. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column \code{data.frame}. In \code{[.data.table} we took the opportunity to make it consistent and drop \code{drop}. \newline\newline When a \code{data.table} is passed to a \code{data.table}-unaware package, that package it not concerned with any of these differences; it just works. \subsection{I'm using \code{j} for its side effect only, but I'm still getting data returned. How do I stop that?} In this case \code{j} can be wrapped with \code{invisible()}; e.g., \code{DT[,invisible(hist(colB)),by=colA]}\footnote{\code{hist()} returns the breakpoints in addition to plotting to the graphics device}. \subsection{Why does \code{[.data.table} now have a \code{drop} argument from v1.5?} So that \code{data.table} can inherit from \code{data.frame} without using \code{\dots}. If we used \code{\dots} then invalid argument names would not be caught. The \code{drop} argument is never used by \code{[.data.table}. It is a placeholder for non \code{data.table} aware packages when they use the \code{[.data.frame} syntax directly on a \code{data.table}. \subsection{Rolling joins are cool, and very fast! Was that hard to program?} The prevailing row on or before the \code{i} row is the final row the binary search tests anyway. So \code{roll=TRUE} is essentially just a switch in the binary search C code to return that row. \subsection{Why does \code{DT[i,col:=value]} return the whole of \code{DT}? I expected either no visible value (consistent with \code{<-}), or a message or return value containing how many rows were updated. It isn't obvious that the data has indeed been updated by reference.} This has changed in v1.8.3 to meet your expectations. Please upgrade. The whole of \code{DT} is returned (now invisibly) so that compound syntax can work; e.g., \code{DT[i,done:=TRUE][,sum(done)]}. The number of rows updated is returned when verbosity is on, either on a per query basis or globally using \code{options(datatable.verbose=TRUE)}. \subsection{Ok, thanks. What was so difficult about the result of \code{DT[i,col:=value]} being returned invisibly?} \proglang{R} internally forces visibility on for \code{[}. The value of FunTab's eval column (see src/main/names.c) for \code{[} is 0 meaning force \code{R\_Visible} on (see R-Internals section 1.6). Therefore, when we tried \code{invisible()} or setting \code{R\_Visible} to 0 directly ourselves, \code{eval} in src/main/eval.c would force it on again. To solve this problem, the key was to stop trying to stop the print method running after a \code{:=}. Instead, inside \code{:=} we now (from v1.8.3) set a global flag which the print method uses to know whether to actually print or not. \subsection{I've noticed that \code{base::cbind.data.frame} (and \code{base::rbind.data.frame}) appear to be changed by \code{data.table}. How is this possible? Why?} It is a temporary, last resort solution until we discover a better way to solve the problems listed below. Essentially, the issue is that \code{data.table} inherits from \code{data.frame}, \emph{and}, \code{base::cbind} and \code{base::rbind} (uniquely) do their own S3 dispatch internally as documented by \code{?cbind}. The change is adding one \code{for} loop to the start of each function directly in base; e.g., <<>>= base::cbind.data.frame @ That modification is made dynamically; i.e., the base definition of \code{cbind.data.frame} is fetched, the \code{for} loop added to the beginning and then assigned back to base. This solution is intended to be robust to different definitions of \code{base::cbind.data.frame} in different versions of \proglang{R}, including unknown future changes. Again, it is a last resort until a better solution is known or made available. The competing requirements are : \begin{itemize} \item \code{cbind(DT,DF)} needs to work. Defining \code{cbind.data.table} doesn't work because \code{base::cbind} does its own S3 dispatch and requires that the \emph{first} \code{cbind} method for each object it is passed is \emph{identical}. This is not true in \code{cbind(DT,DF)} because the first method for \code{DT} is \code{cbind.data.table} but the first method for \code{DF} is \code{cbind.data.frame}. \code{base::cbind} then falls through to its internal \code{bind} code which appears to treat \code{DT} as a regular list and returns very odd looking and unusable \code{matrix} output. See FAQ \ref{faq:cbinderror}. We cannot just advise users not to call \code{cbind(DT,DF)} because packages such as ggplot2 make such a call (test 168.5). \item This naturally leads to trying to mask \code{cbind.data.frame} instead. Since a \code{data.table} is a \code{data.frame}, \code{cbind} would find the same method for both \code{DT} and \code{DF}. However, this doesn't work either because \code{base::cbind} appears to find methods in \code{base} first; i.e., \code{base::cbind.data.frame} isn't maskable. This is reproducible as follows : \end{itemize} <<>>= foo = data.frame(a=1:3) cbind.data.frame = function(...)cat("Not printed\n") cbind(foo) @ <>= rm("cbind.data.frame") @ \begin{itemize} \item Finally, we tried masking \code{cbind} itself (v1.6.5 and v1.6.6). This allowed \code{cbind(DT,DF)} to work, but introduced compatibility issues with package IRanges, since IRanges also masks \code{cbind}. It worked if IRanges was lower on the search() path than \code{data.table}, but if IRanges was higher then \code{data.table}'s \code{cbind} would never be called and the strange looking matrix output occurs again (FAQ \ref{faq:cbinderror}). \end{itemize} If you know of a better solution, that still solves all the issues above, then please let us know and we'll gladly change it. \subsection{I've read about method dispatch (e.g. \code{merge} may or may not dispatch to \code{merge.data.table}) but \emph{how} does R know how to dispatch? Are dots significant or special? How on earth does R know which function to dispatch, and when?} This comes up quite a lot, but it's really earth shatteringly simple. A function such as \code{merge} is \emph{generic} if it consists of a call to \code{UseMethod}. When you see people talking about whether or not functions are \emph{generic} functions they are merely typing the function, without \code{()} afterwards, looking at the program code inside it and if they see a call to \code{UseMethod} then it is \emph{generic}. What does \code{UseMethod} do? It literally slaps the function name together with the class of the first argument, separated by period (\code{.}) and then calls that function, passing along the same arguments. It's that simple. For example, \code{merge(X,Y)} contains a \code{UseMethod} call which means it then \emph{dispatches} (i.e. calls) \code{paste("merge",class(X),sep=".")}. Functions with dots in may or may not be methods. The dot is irrelevant really. Other than dot being the separator that \code{UseMethod} uses. Knowing this background should now highlight why, for example, it is obvious to R folk that \code{as.data.table.data.frame} is the \code{data.frame} method for the \code{as.data.table} generic function. Further, it may help to elucidate that, yes you are correct, it is not obvious from its name alone that \code{ls.fit} is not the fit method of the \code{ls} generic function. You only know that by typing \code{ls} (not \code{ls()}) and observing it isn't a single call to \code{UseMethod}. You might now ask: where is this documented in R? Answer: it's quite clear, but, you need to first know to look in \code{?UseMethod}, and \emph{that} help file contains : "When a function calling UseMethod('fun') is applied to an object with class attribute c('first', 'second'), the system searches for a function called fun.first and, if it finds it, applies it to the object. If no such function is found a function called fun.second is tried. If no class name produces a suitable function, the function fun.default is used, if it exists, or an error results." Happily, an internet search for "How does R method dispatch work" (at the time of writing) returns the \code{?UseMethod} help page as the top link. Admittedly, other links rapidly descend into the intracies of S3 vs S4, internal generics and so on. However, features like basic S3 dispatch (pasting the function name together with the class name) is why some R folk love R. It's so simple. No complicated registration or signature is required. There isn't much needed to learn. To create the \code{merge} method for \code{data.table} all that was required, literally, was to merely create a function called \code{merge.data.table}. \section{Questions relating to compute time} \subsection{I have 20 columns and a large number of rows. Why is an expression of one column so quick?} Several reasons: \begin{itemize} \item Only that column is grouped, the other 19 are ignored because \code{data.table} inspects the \code{j} expression and realises it doesn't use the other columns. \item One memory allocation is made for the largest group only, then that memory is re-used for the other groups. There is very little garbage to collect. \item \proglang{R} is an in-memory column store; i.e., the columns are contiguous in RAM. Page fetches from RAM into L2 cache are minimised. \end{itemize} \subsection{I don't have a key on a large table, but grouping is still really quick. Why is that?} \code{data.table} uses radix sorting. This is significantly faster than other sort algorithms. Radix is specifically for integers only, see \code{?base::sort.list(x,method="radix")}. This is also one reason why \code{setkey()} is quick. When no key is set, or we group in a different order from that of the key, we call it an \emph{ad hoc by}. \subsection{Why is grouping by columns in the key faster than an ad hoc by?} Because each group is contiguous in RAM, thereby minimising page fetches, and memory can be copied in bulk (memcpy in C) rather than looping in C. \section{Error messages} \subsection{\code{Could not find function "DT"}} See FAQ \ref{faq:DTremove1} and FAQ \ref{faq:DTremove2}. \subsection{\code{unused argument(s) (MySum = sum(v))}} This error is generated by \code{DT[,MySum=sum(v)]}. \code{DT[,list(MySum=sum(v))]} was intended, or \code{DT[,j=list(MySum=sum(v))]}. \subsection{\code{'translateCharUTF8' must be called on a CHARSXP}} This error (and similar; e.g., \code{'getCharCE' must be called on a CHARSXP}) may be nothing do with character data or locale. Instead, this can be a symptom of an earlier memory corruption. To date these have been reproducible and fixed (quickly). Please report it to datatable-help. \subsection{\code{cbind(DT,DF) returns a strange format e.g. 'Integer,5'}} \label{faq:cbinderror} This occurs prior to v1.6.5, for \code{rbind(DT,DF)} too. Please upgrade to v1.6.7 or later. \subsection{\code{cannot change value of locked binding for '.SD'}} \code{.SD} is locked by design. See \code{?data.table}. If you'd like to manipulate \code{.SD} before using it, or returning it, and don't wish to modify \code{DT} using \code{:=}, then take a copy first (see \code{?copy}); e.g., <<>>= DT = data.table(a=rep(1:3,1:3),b=1:6,c=7:12) DT DT[,{ mySD = copy(.SD) mySD[1,b:=99L] mySD }, by=a] @ \subsection{\code{cannot change value of locked binding for '.N'}} Please upgrade to v1.8.1 or later. From this version, if \code{.N} is returned by \code{j} it is renamed to \code{N} to avoid any abiguity in any subsequent grouping between the \code{.N} special variable and a column called \code{".N"}. The old behaviour can be reproduced by forcing \code{.N} to be called \code{.N}, like this : <<>>= DT = data.table(a=c(1,1,2,2,2),b=c(1,2,2,2,1)) DT DT[,list(.N=.N),list(a,b)] # show intermediate result for exposition cat(try( DT[,list(.N=.N),by=list(a,b)][,unique(.N),by=a] # compound query more typical ,silent=TRUE)) @ If you are already running v1.8.1 or later then the error message is now more helpful than the \code{cannot change value of locked binding} error. As you can see above, since this vignette was produced using v1.8.1 or later. The more natural syntax now works : <<>>= if (packageVersion("data.table") >= "1.8.1") { DT[,.N,by=list(a,b)][,unique(N),by=a] } @ \section{Warning messages} \subsection{\code{The following object(s) are masked from 'package:base': cbind, rbind}} This warning was present in v1.6.5 and v.1.6.6 only, when loading the package. The motivation was to allow \code{cbind(DT,DF)} to work, but as it transpired, broke (full) compatibility with package IRanges. Please upgrade to v1.6.7 or later. \subsection{\code{Coerced numeric RHS to integer to match the column's type}} Hopefully, this is self explanatory. The full message is :\newline \code{Coerced numeric RHS to integer to match the column's type; may have truncated}\newline \code{precision. Either change the column to numeric first by creating a new numeric}\newline \code{vector length 5 (nrows of entire table) yourself and assigning that (i.e. }\newline \code{'replace' column), or coerce RHS to integer yourself (e.g. 1L or as.integer)}\newline \code{to make your intent clear (and for speed). Or, set the column type correctly}\newline \code{up front when you create the table and stick to it, please.}\newline To generate it, try : <<>>= DT = data.table(a=1:5,b=1:5) suppressWarnings( DT[2,b:=6] # works (slower) with warning ) class(6) # numeric not integer DT[2,b:=7L] # works (faster) without warning class(7L) # L makes it an integer DT[,b:=rnorm(5)] # 'replace' integer column with a numeric column @ \section{General questions about the package} \subsection{v1.3 appears to be missing from the CRAN archive?} That is correct. v1.3 was available on R-Forge only. There were several large changes internally and these took some time to test in development. \subsection{Is \code{data.table} compatible with S-plus?} Not currently. \begin{itemize} \item A few core parts of the package are written in C and use internal \proglang{R} functions and \proglang{R} structures. \item The package uses lexical scoping which is one of the differences between \proglang{R} and \proglang{S-plus} explained by \href{http://cran.r-project.org/doc/FAQ/R-FAQ.html#Lexical-scoping}{R FAQ 3.3.1}. \end{itemize} \subsection{Is it available for Linux, Mac and Windows?} Yes, for both 32-bit and 64-bit on all platforms. Thanks to CRAN and R-Forge. There are no special or OS-specific libraries used. \subsection{I think it's great. What can I do?} Please send suggestions, bug reports and enhancement requests to \href{mailto:datatable-help@lists.r-forge.r-project.org}{datatable-help}. This helps make the package better. The list is public and archived. Please do vote for the package on \href{http://crantastic.org/packages/data-table}{Crantastic}. This helps encourage the developers, and helps other \proglang{R} users find the package. If you have time to write a comment too, that can help others in the community. Just simply clicking that you use the package, though, is much appreciated. You can join the project and change the code and/or documentation yourself. \subsection{I think it's not great. How do I warn others about my experience?} Please put your vote and comments on \href{http://crantastic.org/packages/data-table}{Crantastic}. Please make it constructive so we have a chance to improve. \subsection{I have a question. I know the r-help posting guide tells me to contact the maintainer (not r-help), but is there a larger group of people I can ask?} Yes, there are two options. You can post to \href{mailto:datatable-help@lists.r-forge.r-project.org}{datatable-help}. It's like r-help, but just for this package. Or the \href{http://stackoverflow.com/questions/tagged/data.table}{\code{data.table} tag on Stack Overflow}. Feel free to answer questions in those places, too. \subsection{Where are the datatable-help archives?} The \href{http://datatable.r-forge.r-project.org/}{homepage} contains links to the archives in several formats. \subsection{I'd prefer not to contact datatable-help, can I mail just one or two people privately?} Sure. You're more likely to get a faster answer from datatable-help or Stack Overflow, though. Asking publicly in those places helps build the knowledge base. \subsection{I have created a package that depends on \code{data.table}. How do I ensure my package is \code{data.table}-aware so that inheritance from \code{data.frame} works?} You don't need to do anything special. Just include \code{data.table} in either the Imports or Depends field of your package's DESCRIPTION file. \subsection{Why is this FAQ in pdf format? Can it moved to a website?} This FAQ (and the intro and timing documents) are \emph{vignettes} written using Sweave. The benefits of Sweave include the following: \begin{itemize} \item We include \proglang{R} code in the vignettes. This code is \emph{actually run} when the file is created, not copy and pasted. \item This document is \emph{reproducible}. Grab the .Rnw and you can run it yourself. \item CRAN checks the package (including running vignettes) every night on Linux, Mac and Windows, both 32bit and 64bit. Results are posted to \url{http://cran.r-project.org/web/checks/check_results_data.table.html}. Included there are results from r-devel; i.e., not yet released R. That serves as a very useful early warning system for any potential future issues as \proglang{R} itself develops. \item This file is bound into each version of the package. The package is not accepted on CRAN unless this file passes checks. Each version of the package will have its own FAQ file which will be relevant for that version. Contrast this to a single website, which can be ambiguous if the answer depends on the version. \item You can open it offline at your \proglang{R} prompt using \code{vignette()}. \item You can extract the code from the document and play with it using\newline \code{edit(vignette("datatable-faq"))} or \code{edit(vignette("datatable-timings"))}. \item It prints out easily. \item It's quicker and easier for us to write and maintain the FAQ in .Rnw form. \end{itemize} Having said all that, a wiki format may be quicker and easier for users to contribute documentation and examples. Therefore a wiki has now been created; see link on the \href{http://datatable.r-forge.r-project.org/}{homepage}. \end{document}