String Processing
1. Combining strings from different rows into one string
For instance, if we want to get the genename annotations from a database, e.g. hgu133a.db
affymetrix microarray annotation database, to account for 1-to-many and many-to-1 probe mapping schemes (which are quite common in microarrays). We can do so by using aggregate
probeIDs <- select(hgu133a.db,
DF <- aggregate(probeIDs,
by = probeIDs[1],
function(X) paste(unique(X),
collapse=", "))
will have 2 columns named SYMBOL
, where the Symbols and Genenames will have been aggregated by the probeIDs
2. Replacing text
Simple text replace
Do something like str.replace()
from python in R
A point to keep in mind is that if it is a simple search-replace, i.e. no regular expressions, it can be sped up significantly by setting the parameter fixed=T
Replace some text and removing some chars:
e.g. replacing text enclosed in parentheses while also getting rid of the parentheses themselves, one can:
Note: sub()
replaces first occurance and gsub()
replaces all occurances.
Replace characters in a string
substring(OGstring,start,stop) <- "replace"
N.B. You can Select specific number of characters, substring by:
3. String splitting
Spliting one string vector into 2 vectors (based on a char e.g. ‘,’, etc.)
You can split values in a column or a character vector into two columns/vectors based on a specific character/pattern using the str_split
from the stringr
package like this:
temp <- newPdata$characteristics_ch1.2;
gender <- stringr::str_split_fixed(string = temp,
pattern = ": ",
n = 2)[,2]
In case you want to replace based on a regex, just use stringr::str_split_regex()
Splitting Row names by a specific character:
y <- rownames(mat);temp <- substr(y,1,lapply(strsplit(y,""),function(x) which(x=='|')))
rownames(mat) <- gsub("\\|","",temp);rm(temp,y)
temp <- substr(y,lapply(strsplit(y,""),function(x) which(x=='|')),nchar(y))
Here temp
is assigned the characters after first “|” to last character from the values of variable y, in this case the rownames of object “mat”.
Spliting one column into 2 columns based on a character. str_split for column
str_split_fixed <- stringr::str_split_fixed
temp <- newPdata$characteristics_ch1.2;
gender <- str_split_fixed(string = temp,pattern = ": ",n = 2)[,2]
Faster unlist(strsplit())
dat <- unlist(strsplit(tempvcf[i,"INFO"],";",fixed = T),use.names = F)
is useful when the list is a named list, and item names need to be preserved, in the context above, they are generally unnamed, or the names are not important to be preserved.
so that find exact matches, again in the case above, split happens at a character. When using regular expressions change this flag to F.
An even faster execution is from the stringi
package of functions.
Not %in% operator in R, or !%in% operator in R
Define a function that would negate %in% as:
'%nin%' <- Negate('%in%')
# then use