String Processing

1. Combining strings from different rows into one string

For instance, if we want to get the genename annotations from a database, e.g. hgu133a.db affymetrix microarray annotation database, to account for 1-to-many and many-to-1 probe mapping schemes (which are quite common in microarrays). We can do so by using aggregate :

probeIDs <- select(hgu133a.db,
            rownames(mat),
            c("SYMBOL","GENENAME"))
DF <- aggregate(probeIDs,
                by = probeIDs[1],
                function(X) paste(unique(X),
                                  collapse=", "))

DF will have 2 columns named SYMBOL and GENENAME, where the Symbols and Genenames will have been aggregated by the probeIDs

2. Replacing text

Simple text replace

Do something like str.replace() from python in R

gsub("Search","Replace",variable) 

A point to keep in mind is that if it is a simple search-replace, i.e. no regular expressions, it can be sped up significantly by setting the parameter fixed=T

Replace some text and removing some chars:

e.g. replacing text enclosed in parentheses while also getting rid of the parentheses themselves, one can:

gsub("\\(Text)","gone",var)

Note: sub() replaces first occurance and gsub() replaces all occurances.

Replace characters in a string

substring(OGstring,start,stop) <- "replace"

N.B. You can Select specific number of characters, substring by:

substr(x=colnames(mat),start=1,stop=12)

3. String splitting

Spliting one string vector into 2 vectors (based on a char e.g. ‘,’, etc.)

You can split values in a column or a character vector into two columns/vectors based on a specific character/pattern using the str_split from the stringrpackage like this:

temp <- newPdata$characteristics_ch1.2;
gender <- stringr::str_split_fixed(string = temp,
                                   pattern = ": ",
                                   n = 2)[,2]

In case you want to replace based on a regex, just use stringr::str_split_regex()

Splitting Row names by a specific character:

y <- rownames(mat);temp <- substr(y,1,lapply(strsplit(y,""),function(x) which(x=='|')))
rownames(mat) <- gsub("\\|","",temp);rm(temp,y)
temp <- substr(y,lapply(strsplit(y,""),function(x) which(x=='|')),nchar(y))

Here temp is assigned the characters after first “|” to last character from the values of variable y, in this case the rownames of object “mat”.

Spliting one column into 2 columns based on a character. str_split for column

str_split_fixed <- stringr::str_split_fixed

temp <- newPdata$characteristics_ch1.2;
gender <- str_split_fixed(string = temp,pattern = ": ",n = 2)[,2]

Faster unlist(strsplit()) execution

dat <- unlist(strsplit(tempvcf[i,"INFO"],";",fixed = T),use.names = F)

use.names=F is useful when the list is a named list, and item names need to be preserved, in the context above, they are generally unnamed, or the names are not important to be preserved.

fixed=T so that find exact matches, again in the case above, split happens at a character. When using regular expressions change this flag to F.

An even faster execution is from the stringi package of functions.

Not %in% operator in R, or !%in% operator in R

Define a function that would negate %in% as:

'%nin%' <- Negate('%in%')

# then use
DF$UniqueID%nin%c("A1","B2")
Next