Misc. R tips & tricks

1. Proportion table:

table(survivalDF$survival,survivalDF$statusName)
prop.table(table(survivalDF$survival,survivalDF$statusName),margin=1)
prop.table(table(survivalDF$survival,survivalDF$statusName),margin=2)

Generate the Proportion table with percentage:

library(gmodels) #CRAN
CrossTable(survivalDF$survival,survivalDF$statusName,prop.t=F,prop.chisq=F)

2. rm() : exclude some variables or functions:

Suppose you have a busy workspace and want to start clean, except for a few important datasets or vectors or functions, you created. This is possible by the following command.

rm(list=ls()[!ls() %in% c("Var1","Var2","Fun1")])

It is also possible to exclude some variables/functions from the list for rm() if you you have common thing(s) in their variable name.

You can then use grep for excluding those from the list of objects passed to RM rm().

rm(list=ls()[!ls() %in% ls()[grep("TEXT",ls())]])

This feature is especially useful in scenarios when a data is large to read (took a lot of time), or it is genrated after a long series of processing (i.e. took a lot of time to generate 😄).

3. Remove column from a dataframe by its name in R:

parsed.df <- parsed.df[,-which(colnames(parsed)%in%c("AD"))]

If you are using data.table it is also possible to use the := operator to remove columns as:

parsed.DT[,c("AD") := NULL] # parsed.DT is a data.table

In addition to these 2 popular methods, there are a few more options and I benchmarked them to see which one is the most efficient:

# :=NULL operator from the package data.table
testDT[,c("ID_tumour","Accession.Number","Histology.subtype.1","Mutation.CDS","Mutation.genome.position","Mutation.somatic.status") := NULL]

# base::within()
within(testDF,rm(ID_tumour,Accession.Number,Histology.subtype.1,Mutation.CDS,Mutation.genome.position,Mutation.somatic.status))

# using base::subset and specifying the "select"
subset(testDF, select = -c(ID_tumour, Accession.Number, Histology.subtype.1, Mutation.CDS, Mutation.genome.position, Mutation.somatic.status))

# the simplest (and fastest) of all the %in%
testDF[,!names(testDF) %in% removeCols]
testDF[,!colnames(testDF) %in% removeCols]

# using base::which, also uses %in%
testDF[,-which(colnames(testDF) %in% removeCols)]

Barplot for benchmarking removing columns by name

Sometimes the simplest methods are the fastest, in this case the %in% operator is blazing fast. If however, you are using a data.table then the :=NULL operator is the optimal choice.

4. Dynamic variable creation:

Sometimes you want to create variables on the fly with programmatically generated names, you can do so with assign:

# simple case
assign(paste("Var", "1", sep=""),5)

5. Select first item from all the lists in a list of lists.

lapply(ListOfLists, `[[`, 1)

6. Read table with empty columns / unequal columns in rows / unequal lengths of rows:

For this we can use the fill=TRUE option in the base::read.table()

t <- read.table("file.txt",sep="\t",fill=T,as.is=T)

7. Gene name Annotation from annotation database while resolving 1:many mapping:

z <- select(hgu133a.db,rownames(mat),c("SYMBOL","GENENAME"))
temp <- aggregate(z,by = z[1],function(X) paste(unique(X), collapse=", "))
all <- data.frame(symbol=temp$SYMBOL, description=temp$GENENAME, mat)

8. Un-melt or Make LongDF a WideDF

​```r
library(reshape2)
temp <- dcast(data = var,formula = col1+col2+col3 ~ GroupingColumn , value.var = "ValueColumn")

9. Change the directory for installation of R packages in macOS

Create a file called “.Renviron” in the home directory (~)
Write the following line in the file:

R_LIBS="~/path_to/new/R_lib"

10. Genrate combinations

To generate all pairwise combinations from elements vector in a vector. Generates unique combinations only.

groups<-paste("GR0",seq(1,6),sep="")
vec <- c("A","B","C")
combos <- combn(x = groups,m = 2)
vec <- c()
for (i in seq(1,dim(combos)[2])){
  vec <- c(vec,(paste(combos[1,i],combos[2,i],sep="")))
}

11. Example of parallel processing in R

Using foreachand %dopar%construct.

# FORK only available in POSIX systems
library(doParallel)
myCluster <- makeCluster(parallel::detectCores(), type = "FORK",useXDR=F,.combine=cbind); registerDoParallel(myCluster);print(myCluster)

# foreach construct
results <- foreach(mut = mutations[1:10000],.combine = cbind) %dopar% {
  sub <- tmpdf[tmpdf$mutID==mut,]
  var1 <-  plyr::count(sub,"Primary.site")
  # writing to a log file, all the threads access this file
  cat(paste(mut,"\n"),file="log.txt", append=T)
  # returning 3 values, but as a list. The output results will be populate as a collection of lists.
 return(list(mut,sum(var1$freq),stringi::stri_paste(var1$Primary.site,":",var1$freq,collapse=',')))
}

# Save the raw output
saveRDS(object = results,file = "raw.results.RDS")

# Converting the list into a data frame
var <- unlist(results)
mutID <- var[seq(1,length(var),by = 3)]
counts <- var[seq(2,length(var),by = 3)]
Frequency <- var[seq(3,length(var),by = 3)]
DF <- cbind.data.frame(mutID,counts,Frequency)

Using mclapply

findKeyword <- function(x,keyword){
  x <- unlist(x,recursive = F,use.names = F)
  idx <- stringi::stri_locate_first_fixed(str = x,pattern = keyword)
  value <- x[which(complete.cases(idx))]
  return(ifelse(length(value)>0, stringi::stri_c(x = substring(text = value,first = na.omit(idx)[,"end"]), sep = ","), NA))
}

gene_ids <- parallel::mcmapply(FUN = function(X)findKeyword(X,"gene_id "), var, mc.cores = detectCores())

Efficient installation of packges in R (multiple CPUs)

install.packages(c("SCAN.UPC"), type = "source", INSTALL_opts = "--byte-compile",Ncpus = 4)

# OR
options(Ncpus)=4
install.packages(...params... , Ncpus = getOption("Ncpus",1

11. count NA observations in selected columns in a data.frame

apply(X = DF[,3:6], MARGIN = 1, FUN = function(x) sum(is.na(x)))
Previous