String Processing
1. Combining strings from different rows into one string
For instance, if we want to get the genename annotations from a database, e.g. hgu133a.db
affymetrix microarray annotation database, to account for 1-to-many and many-to-1 probe mapping schemes (which are quite common in microarrays). We can do so by using aggregate
:
probeIDs <- select(hgu133a.db,
rownames(mat),
c("SYMBOL","GENENAME"))
DF <- aggregate(probeIDs,
by = probeIDs[1],
function(X) paste(unique(X),
collapse=", "))
DF
will have 2 columns named SYMBOL
and GENENAME
, where the Symbols and Genenames will have been aggregated by the probeIDs
2. Replacing text
Simple text replace
Do something like str.replace()
from python in R
gsub("Search","Replace",variable)
A point to keep in mind is that if it is a simple search-replace, i.e. no regular expressions, it can be sped up significantly by setting the parameter fixed=T
Replace some text and removing some chars:
e.g. replacing text enclosed in parentheses while also getting rid of the parentheses themselves, one can:
gsub("\\(Text)","gone",var)
Note: sub()
replaces first occurance and gsub()
replaces all occurances.
Replace characters in a string
substring(OGstring,start,stop) <- "replace"
N.B. You can Select specific number of characters, substring by:
substr(x=colnames(mat),start=1,stop=12)
3. String splitting
Spliting one string vector into 2 vectors (based on a char e.g. ‘,’, etc.)
You can split values in a column or a character vector into two columns/vectors based on a specific character/pattern using the str_split
from the stringr
package like this:
temp <- newPdata$characteristics_ch1.2;
gender <- stringr::str_split_fixed(string = temp,
pattern = ": ",
n = 2)[,2]
In case you want to replace based on a regex, just use stringr::str_split_regex()
Splitting Row names by a specific character:
y <- rownames(mat);temp <- substr(y,1,lapply(strsplit(y,""),function(x) which(x=='|')))
rownames(mat) <- gsub("\\|","",temp);rm(temp,y)
temp <- substr(y,lapply(strsplit(y,""),function(x) which(x=='|')),nchar(y))
Here temp
is assigned the characters after first “|” to last character from the values of variable y, in this case the rownames of object “mat”.
Spliting one column into 2 columns based on a character. str_split for column
str_split_fixed <- stringr::str_split_fixed
temp <- newPdata$characteristics_ch1.2;
gender <- str_split_fixed(string = temp,pattern = ": ",n = 2)[,2]
Faster unlist(strsplit())
execution
dat <- unlist(strsplit(tempvcf[i,"INFO"],";",fixed = T),use.names = F)
use.names=F
is useful when the list is a named list, and item names need to be preserved, in the context above, they are generally unnamed, or the names are not important to be preserved.
fixed=T
so that find exact matches, again in the case above, split happens at a character. When using regular expressions change this flag to F.
An even faster execution is from the stringi
package of functions.
Not %in% operator in R, or !%in% operator in R
Define a function that would negate %in% as:
'%nin%' <- Negate('%in%')
# then use
DF$UniqueID%nin%c("A1","B2")