ngs-bits icon indicating copy to clipboard operation
ngs-bits copied to clipboard

Short-read sequencing tools

ngs-bits - Short-read sequencing tools

Linux build status MacOS build status Windows build status
install with bioconda

Obtaining ngs-bits

Binaries of ngs-bits are available via Bioconda:

  • Binaries for Linux/macOS

Alternatively, ngs-bits can be built from sources. Use git to clone the most recent release (the source code package of GitHub does not contains required sub-modules):

> git clone --recursive https://github.com/imgag/ngs-bits.git
> cd ngs-bits
> git checkout 2024_06
> git submodule update --recursive --init

Depending on your operating system, building instructions vary slightly:

  • Building from sources for Linux
  • Building from sources for MacOS
  • Building from sources for Windows

Support

Please report any issues or questions to the ngs-bits issue tracker.

Documentation

Have a look at the ECCB'2018 poster.

The documentation of individual tools is linked in the tools list below.
For some tools the documentation pages contain only the command-line help, for other tools they contain more information.

License

ngs-bits is provided under the MIT license and is based on other open source software:

Tools list

ngs-bits contains a lot of tools that are used for NGS-based diagnostics in our institute.

Some of the tools need the NGSD, a database that contains for example gene, transcript and exon data.
Installation instructions for the NGSD can be found here.

Main tools

  • SeqPurge - A highly-sensitive adapter trimmer for paired-end short-read data.
  • SampleSimilarity - Calculates pairwise sample similarity metrics from VCF/BAM files.
  • SampleGender - Determines sample gender based on a BAM file.
  • SampleAncestry - Estimates the ancestry of a sample based on variants.
  • CnvHunter - CNV detection from targeted resequencing data using non-matched control samples.
  • RohHunter - ROH detection based on a variant list annotated with AF values.
  • UpdHunter - UPD detection from trio variant data.

QC tools

The default output format of the quality control tools is qcML, an XML-based format for -omics quality control, that consists of an XML schema, which defined the overall structure of the format, and an ontology which defines the QC metrics that can be used.

  • ReadQC - Quality control tool for FASTQ files.
  • MappingQC - Quality control tool for a BAM file.
  • VariantQC - Quality control tool for a VCF file.
  • SomaticQC - Quality control tool for tumor-normal pairs (paper and example output data).
  • TrioMaternalContamination - Detects maternal contamination of a child using SNPs from parents.
  • RnaQC - Calculates QC metrics for RNA samples.

BAM tools

  • BamClipOverlap - (Soft-)Clips paired-end reads that overlap.
  • BamDownsample - Downsamples a BAM file to the given percentage of reads.
  • BamExtract - Extract reads from BAM/CRAM by read name.
  • BamFilter - Filters a BAM file by multiple criteria.
  • BamHighCoverage - Determines high-coverage regions in a BAM file.
  • BamToFastq - Converts a coordinate-sorted BAM file to FASTQ files.

BED tools

  • BedAdd - Merges regions from several BED files.
  • BedAnnotateFromBed - Annotates BED file regions with information from a second BED file.
  • BedAnnotateGC - Annnotates the regions in a BED file with GC content.
  • BedAnnotateGenes - Annotates BED file regions with gene names (needs NGSD).
  • BedChunk - Splits regions in a BED file to chunks of a desired size.
  • BedCoverage - Annotates the regions in a BED file with the average coverage in one or several BAM files.
  • BedExtend - Extends the regions in a BED file by n bases.
  • BedGeneOverlap - Calculates how much of each overlapping gene is covered (needs NGSD).
  • BedHighCoverage - Detects high-coverage regions from a BAM file.
  • BedInfo - Prints summary information about a BED file.
  • BedIntersect - Intersects two BED files.
  • BedLiftOver - Lift-over of regions in a BED file to a different genome build.
  • BedLowCoverage - Calcualtes regions of low coverage based on a input BED and BAM file.
  • BedMerge - Merges overlapping regions in a BED file.
  • BedReadCount - Annoates the regions in a BED file with the read count from a BAM file.
  • BedShrink - Shrinks the regions in a BED file by n bases.
  • BedSort - Sorts the regions in a BED file
  • BedSubtract - Subracts one BED file from another BED file.
  • BedToFasta - Converts BED file to a FASTA file (based on the reference genome).

FASTQ tools

  • FastqAddBarcode - Adds sequences from separate FASTQ as barcodes to read IDs.
  • FastqConvert - Converts the quality scores from Illumina 1.5 offset to Sanger/Illumina 1.8 offset.
  • FastqConcat - Concatinates several FASTQ files into one output FASTQ file.
  • FastqDownsample - Downsamples paired-end FASTQ files.
  • FastqExtract - Extracts reads from a FASTQ file according to an ID list.
  • FastqExtractBarcode - Moves molecular barcodes of reads to a separate file.
  • FastqExtractUMI - Moves unique moleculare identifier from read sequence to read ID.
  • FastqFormat - Determines the quality score offset of a FASTQ file.
  • FastqList - Lists read IDs and base counts.
  • FastqMidParser - Counts the number of occurances of each MID/index/barcode in a FASTQ file.
  • FastqToFasta - Converts FASTQ to FASTA format.
  • FastqTrim - Trims start/end bases from the reads in a FASTQ file.

VCF tools (small variants)

  • VcfAdd - Appends variants from a VCF file to another VCF file.
  • VcfAnnotateConsequence - Adds transcript-specific consequence predictions to a VCF file (similar to Ensembl VEP).
  • VcfAnnotateFromBed - Annotates the INFO column of a VCF with data from a BED file.
  • VcfAnnotateFromBigWig - Annotates the INFO column of a VCF with data from a BED file.
  • VcfAnnotateFromVcf - Annotates a VCF file with data from one or more source VCF files.
  • VcfAnnotateHexplorer - Annotates a VCF with Hexplorer and HBond scores.
  • VcfAnnotateMaxEntScan - Annotates a VCF file with MaxEntScan scores.
  • VcfBreakMulti - Breaks multi-allelic variants into several lines, making sure that allele-specific INFO/SAMPLE fields are still valid.
  • VcfCalculatePRS - Calculates the Polgenic Risk Score(s) for a sample.
  • VcfCheck - Checks a VCF file for errors.
  • VcfExtractSamples - Extract one or several samples from a VCF file.
  • VcfFilter - Filters a VCF based on the given criteria.
  • VcfLeftNormalize - Normalizes all variants and shifts indels to the left in a VCF file.
  • VcfMerge - Merges several VCF files into one VCF.
  • VcfSort - Sorts variant lists according to chromosomal position.
  • VcfSplit - Splits a VCF into several chunks.
  • VcfStreamSort - Sorts entries of a VCF file according to genomic position using a stream.
  • VcfSubstract - Substracts the variants in a VCF from a second VCF.
  • VcfToBed - Converts a VCF file to a BED file.
  • VcfToBedpe - Converts a VCF file containing structural variants to BEDPE format.
  • VcfToTsv - Converts a VCF file to a tab-separated text file.

BEDPE tools (structural variants)

  • BedpeAnnotateFromBed - Annotates a BEDPE file with information from a BED file.
  • BedpeFilter - Filters a BEDPE file by region.
  • BedpeGeneAnnotation - Annotates a BEDPE file with gene information from the NGSD (needs NGSD).
  • BedpeSort - Sort a BEDPE file according to chromosomal position.
  • BedpeToBed - Converts a BEDPE file into BED file.
  • SvFilterAnnotations - Filter a structural variant list in BEDPE format based on variant annotations.

Gene handling tools

  • GenePrioritization: Performs gene prioritization based on list of known disease genes and a PPI graph (see also GraphStringDb).
  • GraphStringDb: Creates simple representation of String-DB interaction graph.
  • GenesToApproved - Replaces gene symbols by approved symbols using the HGNC database (needs NGSD).
  • GenesToBed - Converts a text file with gene names to a BED file (needs NGSD).
  • GenesToTranscripts - Converts a text file with gene names to transcript names (needs NGSD).
  • NGSDExportGenes - Lists genes from NGSD (needs NGSD).
  • TranscriptsToBed - Converts a text file with transcript names to a BED file (needs NGSD).

Phenotype handling tools

  • PhenotypesToGenes - Converts a phenotype list to a list of matching genes (needs NGSD).
  • PhenotypeSubtree - Returns all sub-phenotype of a given phenotype (needs NGSD).

Misc tools

  • PERsim - Paired-end read simulator for Illumina reads.
  • FastaInfo - Basic info on a FASTA file.
  • HgvsToVcf - Transforms a TSV file with transcript ID and HGVS.c change into a VCF file (needs NGSD).

ChangeLog

Changes in release 2024_06:

  • new tools: NGSDExportIgvGeneTrack
  • BamFilter: added parameter for maximum insert size
  • NGSDAddVariantsGermline: now imports REs as well
  • NGSDExportSamples: new paramters -only_with_small_variants and -add_lab_columns.
  • SampleGender: new parameter -include_single_end_reads for long-read data.
  • SampleSimilarity: new parameter -include_single_end_reads for long-read data.
  • SampleSimilarity: new parameter -roi_hg38_wes_wgs to make WES, WGS and lrGS results more comparable.
  • UpdHunter: new parameter -out_informative to write out a IGV track with informative variants.
  • VcfCalculatePRS: new parameter -min_depth and support for variants that are to be imputed independent of the sample genotype.
  • NGSD:
    • added tables for somatic SVs: somatic_somatic_sv_callset, somatic_sv_deletion, somatic_sv_duplication, somatic_sv_insertion, somatic_sv_inversion, somatic_sv_translocation, somatic_report_configuration_sv
    • added tables for repeat expansions: repeat_expansion, repeat_expansion_genotype, re_callset, report_configuration_re
    • processed_sample table: added boolean scheduled_for_resequencing to flag samples for resequencing to increase depth/coverage

For older changes see releases.