squab
squab copied to clipboard
Alignment-based gene expression quantification
noodles squab
squab performs gene expression quantification by counting the number of aligned records that intersects a set of features. Output can be the raw counts or normalized counts in TPM (transcripts per million) or FPKM (fragments per kilobase per million mapped reads).
The original goal of this project is to provide a faster alternative to htseq-count. It uses similar counting rules and outputs a compatible data table.
Installation
Install Rust and use cargo to install squab.
$ cargo install --locked --git https://github.com/zaeleus/squab.git
Usage
squab has two subcommands: quantify and normalize.
quantify
quantify performs gene expression quantification by counting the number of
times aligned records intersect known gene annotations.
Gene expression quantification
Usage: squab quantify [OPTIONS] --output <OUTPUT> --annotations <ANNOTATIONS> <BAM>
Arguments:
<BAM> Input alignment file
Options:
--with-secondary-records
Count secondary records (BAM flag 0x100)
--with-supplementary-records
Count supplementary records (BAM flag 0x800)
--with-nonunique-records
Count nonunique records (BAM data tag NH > 1)
--strand-specification <STRAND_SPECIFICATION>
Strand specification [default: auto] [possible values: none, forward, reverse, auto]
-t, --feature-type <FEATURE_TYPE>
Feature type to count [default: exon]
-i, --id <ID>
Feature attribute to use as the feature identity [default: gene_id]
--min-mapping-quality <MIN_MAPPING_QUALITY>
[default: 10]
-o, --output <OUTPUT>
Output destination for feature counts
-a, --annotations <ANNOTATIONS>
Input annotations file (GFF3)
--threads <THREADS>
Force a specific number of threads
-h, --help
Print help information
The default output is a tab-delimited text file with two columns: the feature identifier (string) and the number of reads (integer) from the input alignment that overlap it. This file is compatible as output from htseq-count, meaning it includes statistics in the trailer.
normalize
normalize takes raw counts and normalizes them by gene length, meaning the
annotations used for quantification must be the same given here. Two
normalization methods are available: FPKM for single sample normalization and
TPM for across samples normalization (default).
Typically, this is only used when a sample was previously quanitifed, e.g.,
using squab quantify or htseq-count.
Normalize features counts
Usage: squab normalize [OPTIONS] --annotations <ANNOTATIONS> <COUNTS>
Arguments:
<COUNTS>
Input counts file
Options:
-t, --feature-type <FEATURE_TYPE>
Feature type to count
[default: exon]
-i, --id <ID>
Feature attribute to use as the feature identity
[default: gene_id]
-a, --annotations <ANNOTATIONS>
Input annotations file (GFF3)
--method <METHOD>
Quantification normalization method
[default: tpm]
Possible values:
- fpkm: fragments per kilobase per million mapped reads
- tpm: transcripts per million
-h, --help
Print help information (use `-h` for a summary)
The output is a tab-delimited text file with two columns: the feature identifier (string) and the normalized value (double).
Examples
Count features (exons by gene ID)
$ squab \
quantify \
--annotations annotations.gff3.gz \
--output sample.counts.tsv \
sample.bam
Count featues (genes by gene name)
$ squab \
quantify \
--annotations annotations.gff3.gz \
--feature-type gene \
--id gene_name \
--output sample.counts.tsv \
sample.bam
Normalize counts in FPKM (exons by gene ID)
$ squab \
normalize \
--method fpkm \
--annotations annotations.gff3.gz \
sample.counts.tsv \
> sample.fpkm.tsv
Limitations
- Counts are taken only as the union of matched feature sets, i.e., reads that overlap any part of the feature is considered once.
- For paired end alignments, a read that matches itself before a mate is found replaces the previously known record.
References
-
S Anders, T P Pyl, W Huber. HTSeq – A Python framework to work with high-throughput sequencing data. bioRxiv 2014. https://doi.org/10.1101/002824
-
Wagner, G.P., Kin, K. & Lynch, V.J. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131, 281–285 (2012). https://doi.org/10.1007/s12064-012-0162-3