1. Combining strings from different rows into one string
For instance, if we want to get the genename annotations from a database, e.g.
hgu133a.db affymetrix microarray annotation database, to account for 1-to-many and many-to-1 probe mapping schemes (which are quite common in microarrays). We can do so by using
probeIDs <- select(hgu133a.db, rownames(mat), c("SYMBOL","GENENAME")) DF <- aggregate(probeIDs, by = probeIDs, function(X) paste(unique(X), collapse=", "))
DF will have 2 columns named
GENENAME, where the Symbols and Genenames will have been aggregated by the probeIDs
2. Replacing text
Simple text replace
Do something like
str.replace() from python in R
A point to keep in mind is that if it is a simple search-replace, i.e. no regular expressions, it can be sped up significantly by setting the parameter
Replace some text and removing some chars:
e.g. replacing text enclosed in parentheses while also getting rid of the parentheses themselves, one can:
sub() replaces first occurance and
gsub() replaces all occurances.
Replace characters in a string
substring(OGstring,start,stop) <- "replace"
N.B. You can Select specific number of characters, substring by:
3. String splitting
Spliting one string vector into 2 vectors (based on a char e.g. ‘,’, etc.)
You can split values in a column or a character vector into two columns/vectors based on a specific character/pattern using the
str_split from the
stringrpackage like this:
temp <- newPdata$characteristics_ch1.2; gender <- stringr::str_split_fixed(string = temp, pattern = ": ", n = 2)[,2]
In case you want to replace based on a regex, just use
Splitting Row names by a specific character:
y <- rownames(mat);temp <- substr(y,1,lapply(strsplit(y,""),function(x) which(x=='|'))) rownames(mat) <- gsub("\\|","",temp);rm(temp,y) temp <- substr(y,lapply(strsplit(y,""),function(x) which(x=='|')),nchar(y))
temp is assigned the characters after first “|” to last character from the values of variable y, in this case the rownames of object “mat”.
Spliting one column into 2 columns based on a character. str_split for column
str_split_fixed <- stringr::str_split_fixed temp <- newPdata$characteristics_ch1.2; gender <- str_split_fixed(string = temp,pattern = ": ",n = 2)[,2]
dat <- unlist(strsplit(tempvcf[i,"INFO"],";",fixed = T),use.names = F)
use.names=F is useful when the list is a named list, and item names need to be preserved, in the context above, they are generally unnamed, or the names are not important to be preserved.
fixed=T so that find exact matches, again in the case above, split happens at a character. When using regular expressions change this flag to F.
An even faster execution is from the
stringi package of functions.
Not %in% operator in R, or !%in% operator in R
Define a function that would negate %in% as:
'%nin%' <- Negate('%in%') # then use DF$UniqueID%nin%c("A1","B2")