Text Processing & Search

*nix systems, IMHO have a significant advantage over their competition, and that is the extremely powerful, fast and useful command line interface (CLI). And one of the “features” I use several times in a day is searching or “ grepping”. Here I’m collecting some of my use cases.

1. grep & ripgrep tricks

In most of my workflows I use ripgrep as it is very good at recursing through the directory tree. But grep has it’s own place, especially if you switch over to using the GNU grep on MacOS, it is blazing fast [link].

Search in particular file types only, e.g. in R source files only

# grep 'TextToSearch' -R --include="*.R" . 
rg "pattern" *.R

# List only files with matches
rg -l "pattern"
# Doesn't print filenames(I) and line numbers(N)
rg -IN "pattern"
rg "pattern" <file.txt> 

# Exclude files in a directory
rg "as\\.numeric\\(" -g '!EloGroup'

Run the following command to have rg pipe to less with colored output:

rg -p "pattern" | less -RFX 

You can also create the above syntax, into a general purpose command, I made it into lrg. You can get more details here. The following script installs the command as /usr/local/bin/lrg.

cat > /usr/local/bin/lrg <<'EOL'
$(which rg) -p "$@" | less -RFX 
chmod +x /usr/local/bin/lrg

# example use
lrg -N "mutation" *.R

You can install lrg directly by running:

# View the contents of the file
curl -sSL "https://gist.githubusercontent.com/dchakro/3e9792b6e47c3648e725fb518a2dbf68/raw/lrg.sh"
# Install
curl -sSL "https://gist.githubusercontent.com/dchakro/3e9792b6e47c3648e725fb518a2dbf68/raw/lrg.sh" | bash

listing file tricks ls and find

List files in the current directory (one file per row)
ls -1 *.bam

# ls -Search some files in the current directory and list matches with full path. (Either of these 2 work)
ls -d -1 $PWD/*.bam
ls -1 $PWD/*.bam

# find -List specific files in the current directory with the full path
find . -type f | rg "sorted.bam"
find . -type f  # Find only files (skip directories etc.)

ripgrep tricks

# List only file names with
rg -l "text-to-search"

# List number of lines with matches (--count)
# --count-matches gives the exact count of matches (i.e. if more matches per line)
rg -c "text" file

# Match OR in regular expression
rg -IN "^[#]|localhost"

# List number of lines of R code (excluding comments) in a directory
rg -INv "^\s*#" -g '*.R' | wc -l

# List number of lines of R code (excluding comments) in all the files in a directory
rg -Nvc "^\s*#" -g '*.R' 

# Fixed matches (high speed)
rg -F "text"

# list NUM lines after a match
rg -A5 "Starts from here" file.txt
rg --after-context $(wc -l black.txt | awk {'print $1'}) "^# Start StevenBlack" black.txt

# Invert match
rg -v "Match will be excluded" file.txt

# Search recursively in specific filetypes
rg "cairo" -g '*.R'

2. Using ‘awk’

# The following command selects first column from test.txt and writes it to out.txt
awk '{print $1}' test.txt > out.txt

# Divides the numbers in column 1 by 1000 and hides the digits after the decimal point
awk '{printf("%.0f"), $1/1000}'