Addressing the dynamic nature of reference data: a new nucleotide database for robust metagenomic classification

Jose Manuel Martí*, Car Reen Kok, James B. Thissen, Nisha J. Mulakken, Aram Avila-Herrera, Crystal J. Jaing, Jonathan E. Allen*, Nicholas A. Be*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Accurate metagenomic classification relies on comprehensive, up-to-date, and validated reference databases. While the NCBI BLAST Nucleotide (nt) database, encompassing a vast collection of sequences from all domains of life, represents an invaluable resource, its massive size—currently exceeding 1012 nucleotides—and exponential growth pose significant challenges for researchers seeking to maintain current nt-based indices for metagenomic classification. Recognizing that no current nt-based indices exist for the widely used Centrifuge classifier, and the last public version currently available was released in 2018, we addressed this critical gap by leveraging advanced high-performance computing resources. We present new Centrifuge-compatible nt databases, meticulously constructed using a novel pipeline incorporating different quality control measures, including reference decontamination and filtering. These measures demonstrably reduce spurious classifications, as shown through our reanalysis of published metagenomic data where Plasmodium annotations were dramatically reduced using our decontaminated database, highlighting how database quality can significantly impact research conclusions. Through temporal comparisons, we also reveal how our approach minimizes inconsistencies in taxonomic assignments stemming from asynchronous updates between public sequence and taxonomy databases. These discrepancies are particularly evident in taxa such as Listeria monocytogenes and Naegleria fowleri, where classification accuracy varied significantly across database versions. These new databases, made available as pre-built Centrifuge indexes, respond to the need for an open, robust, nt-based pipeline for taxonomic classification in metagenomics. Applications such as environmental metagenomics, forensics, and clinical metagenomics, which require comprehensive taxonomic coverage, will benefit from this resource. Our work highlights the importance of treating reference databases as dynamic entities, subject to ongoing quality control and validation akin to software development best practices. This approach is crucial for ensuring accuracy and reliability of metagenomic analysis, especially as databases continue to expand in size and complexity.

Original languageEnglish
JournalmSystems
Volume10
Issue number4
DOIs
StatePublished - Apr 2025
Externally publishedYes

Keywords

  • Centrifuge
  • high-performance computing
  • metagenomics
  • NCBI BLAST nt
  • quality control
  • Recentrifuge
  • reference contamination
  • reference database
  • taxonomic classification

Cite this