As another little side-project for learning Rust I wrote a small command-line utility, fastats
, to help with “reference genome archeology”. With this, I unfortunately don’t mean some cool paleogenetic research, but instead the more annoying problem of searching for subtle differences in reference genome files (usually stored as *.fasta
files), e.g. when troubleshooting alignment problems or changes in downstream results.
Preparing a genomic sequence as a reference genome works about as well as any standardization effort in technology, which means there are many reference genome files available by now, even for the same ‘build’ (like hg19 or hg38). The problem is bad enough that there are long support articles to help users understanding the differences between the files.
This problem is addressed by fastats
, as it tries to deliver a better description of a given FASTA file by generating overall statistics per sequence (such as the GC content, or a hash of the sequence) and defining which genomic regions are soft- or hard-masked. The masked (and non-masked) regions are stored in BED files, so that it is very easy to look them up.1
This should make it much easier so spot relevant differences between FASTA files. The tool is available on crates.io and GitHub.
–
Masking is an important technique in reference genome curation, as it allows to ‘rescue’ certain regions that might otherwise not be visible (e.g., because they are duplicates, and thus aligners may non find a non-ambiguous alignment). Soft-masking means changing the capitalization of the sequence (i.e., writing acgt
instead of ACGT
), so the sequence is still available. Hard-masking, on the other hand, replaces a specific region of the sequence with N
, i.e. there could be any nucleotide at this position (which means the aligner will not align any reads in that region). ↩