haotu : an open lab notebook


data.table and data.frame differences

Filed under: Manipulate Data in R, R, R Stats — S @ 16:13

from here

  • DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
  • DT[3,] == DT[3], but DF[,3] == DF[3] (somewhat confusingly)
  • For this reason we say the comma is optional in DT, but not optional in DF
  • DT[[3]] == DF[3] == DF[[3]]
  • DT[i,] where i is a single integer returns a single row, just like DF[i,], but unlike a matrix single row subset which returns a vector.
  • DT[,j,with=FALSE] where j is a single integer returns a one column data.table, unlike DF[,j]which returns a vector by default
  • DT[,"colA",with=FALSE][[1]] == DF[,"colA"].
  • DT[,colA] == DF[,"colA"]
  • DT[,list(colA)] == DF[,"colA",drop=FALSE]
  • DT[NA] returns 1 row of NA, but DF[NA] returns a copy of DF containing NA throughout.
  • The symbol NA is type logical in R, and is therefore recycled by [.data.frame. Intention wasprobably DF[NA_integer_]. [.data.table does this automatically for convenience.
  • DT[c(TRUE,NA,FALSE)] treats the NA as FALSE, but DF[c(TRUE,NA,FALSE)] returns NA rows
    for each NA
  • DT[ColA==ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]
  • data.frame(list(1:2,"k",1:4)) creates 3 columns, data.table creates one list column.
  • check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
  • stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency.
  • Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of coverting to factor.
  • Atomic vectors in list columns are collapsed when printed using “, ” in data.frame, but “,” in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
  • In [.data.frame we very often set drop=FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and drop drop.
  • When a data.table is passed to a data.table-unaware package, that package it not concerned with any of these differences; it just works

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: