There’s been a growing need for a deep, comprehensive source covering the myriad of genomics resources and bioinformatics methods aimed at detection, characterization, and further understanding of antimicrobial resistance. Fortunately, in Nature Reviews Genetics, an excellent review of this area was presented Manish Boolchandani, Alaric W. D’Souza and Gautam Dantas (the BDD review), three shoguns of the field (hence the image above).
In this review, the authors make an obvious (yet often overlooked) point:
“Organizing sequencing data is an important pre-processing step before antimicrobial resistance gene analysis.” 
In short – “Garbage In, Garbage Out”. Poorly annotated reference databases lead to poorly analyzed results. This can have huge impacts on the power and precision of pipeline results. Importantly, it also affects the development of the bioinformatics pipeline. This is what I wanted to write about.
If you’re a bioinformatician reading this: Imagine for a moment the impacts on your own development effort if the databases you were using were crap. How do you know you’re benchmarking with quality databases? Is it good enough to just say “everyone else uses that DB so it must be OK?” What does it mean to say a databases is “good enough”? At a minimum, biofx developers need to consider this when designing algorithms or pipelines that rely on reference databases for analysis – because their performance is essentially tied to the quality of the db.
If you’re a researcher or clinician reading this: Imagine for a moment if you found that the “best in class” pipeline recommended to you was developed using data that had not been verified or quality checked. Or many it was just plain crap. What if a paper was published that showed there was widespread errors in one of the databases on which your pipeline depends – or worse: you already published your results. It’s possible you may never know too, perhaps due to the natural silo present across sub-disciplines in genomics – it’s hard to stay up to date. For example, I’m wondering what the impact will be on metagenomics pipelines once NCBI is done correcting all the incorrect taxonomic assignments in GenBank. Who’s going to then update all the “pre-indexed databases” provided by all those metagenomics tools?
“Of the 141 000 prokaryotic genome assemblies in GenBank, approximately 66.8% can be confirmed as correctly identified by comparison with confirmed type strains using ANI methods. Approximately 3.6% can be confirmed as misidentified. The remainder (29.6 %) generally cannot be evaluated due to a lack of relevant type strain assemblies.”
Well, the good news is that someone is on it. Nonetheless, there’s likely a long-list of OMFG scenarios we could imagine here, but the point is that the importance of Data Quality begins long BEFORE a pipeline ships to being in the hands of researchers looking to make use of it.
In the case of AMR genomics – this is really important both from a development perspective (especially if you are building a “trained ML/AI model” or something fancy), and from an analysis perspective (how biased is the database I’m using? Do I even know how it’s biased? Do I care?)
With regards to database bias – folks no doubt often take the “I’ll use all of them and sort it out later” approach. This can be OK for lots of uses (like basic research, where you can hand pick through edge cases and sort out curation errors across databases). But in cases where you need to report your results under a time-table, or you are going to publish something that is time sensitive, you may not have the bandwidth or time to mess around with sweeping up the differences between six different databases.
So – QUALITY and CONTROL are what come into play here.
By “quality” – I mean curation of the databases used. How was it created and curated? How do you know that what’s in it is what you think should be in it? How complete is it? What are it’s weaknesses? What is the strategy to benchmark it’s quality?
By “control” – I mean you need a controlled ontology so that you can segment your databases into partitions that matter for your analysis. For example, if I’m a clinical researcher I may not give a damn about resistance genes found in benthic bacterial populations in the Kardzhali Dam (Bulgaria), but my colleague who is a recovering marine microbiologist might not care about anything else.
“Control” and “Quality” also refer indirectly to integration of data sources. Disparate AMR databases are cropping up all over – and the BDD Review tries to capture some of the better ones. But a significant problem remains around integration of these resources, and what end-users believe is “ground truth”, and by extension there’s a whole other topic of “trust” that users develop when using a database they know is well-curated and maintained by experts (and not just bioinformaticians who can write slick code).
A stark example of this cropped up for me recently when I made what I thought would be a fairly simple search of NDARO’s AMR Isolate Database Browser. I was looking for the following: AMR genes from human clinical sources that had attributable antibiotic susceptibility testing (AST) data. Simple right?
Unfortunately – no. It was a mess. Apparently there are non-different ways to write “homo sapiens”.
I may get flamed for posting this – so I’ll say that integrating other peoples data is hard. And what the NIH is doing something that is important: Integrate and Curate the AMR data of the world. This is hard, and their new AMRFinder Database and the NDARO site is pretty slick. With some careful clicking (or scripting – it’s available on github), you can find what you want. At least what you think you want.
Nonetheless, I would argue that when they write in their blog that you can “Browse a curated database of AMR genes“, they are suggesting Quality and Control have been applied to a high level. I was actually really excited for a moment when I read “curated”.
The data here is not unfortunately curated, it’s “mined” (see below) from 3rd party databases that are curated, each with their own standards. In short, they are using the term “curated” rather loosely.
The NIH AMRFinder pipeline uses the AMR Gene Database, which is essentially based on existing AMR sequence data from eight different resources, some of which were not available in the public domain until the owners gave the data to the NIH . The genes obtained from these resources, and their associated metadata, are then passed through a pipeline (mostly HMMER) aimed at assigning AMR classifications to each gene. The method reads
“In addition, NCBI continually mines the literature for new reports of AMR proteins. The ResFams collection of AMR HMMs aided provided important help early in our efforts to develop the AMR protein hierarchy and the AMRFinder tool, but all models were rebuilt with new seed alignment sequences, new alignments, new cutoff scores, and new biocuration. Development continued until all reportable proteins were covered by at least one AMR HMM, and classification by HMM was sufficiently specific.”
One could argue that “mining” != “curation”, so I guess this is where I beg to differ on the use of the word “curation”. It’s also not clear what “continuously” means either – monthly? Hourly? Creating HMMs on large databases is computationally expensive – so I’m guessing it’s not hourly.
I’m also left with this question: Was the benchmarking of these “new cutoff scores” published? And how did it compare to previously published pipelines? We’ve now come full circle.
- Biofx group publishes a pipeline and analysis method (in this case one of the BDD Review authors own ResFams HMMR profiles for AMR with HMMER).
- Researcher else gets a bunch of new data and realizes that the existing pipeline doesn’t work so well.
- Researcher builds new tool with new parameters.
This is a typical, required part of scientific research and progress in general. New data leads to better methods (right?…). It worked for metagenomics methods, right… ? (not really)
It would be interesting to see the benchmarking of AMRFinder and see how it compares to RGI and other tools. Our own analysis in my group at Q has determined that all of these databases suffer more from a pre-build curation issue rather than a sequence ID/quality issue. Incorrectly curated genes or classifications can really muck up your clustering in the first place… (such as the notes above about the taxonomy classification issues in GenBank… more on this effort in a future post)
In closing – and at the risk of just rambling…
- Upfront _human curation_ matters, and should be done whenever integrating other people’s databases together. Using automated sequence based algos to help is fine, but don’t skip the step where you have a subject-matter expert (who may not be a bioinformatician) assist in the curation of your results.
- Decide on your quality standards before you peek at the database’s contents – whatever they may be. If you’re aiming for content that is focused on human clinical samples, then make sure you are very well covered – but don’t be afraid to remove entries that have nothing to do with the aim of your research.
For AMR genomics – simply rolling everything up into a big bucket database, with little to no curation of the metadata or the end-use, is not going to work well (maybe for basic research needs it would, but labs using this data should apply their own level of curation before accepting it at face value). This is because quality and control of the input data matters. I would love to see the NIH take on the task of curating (really curating) the content again – and then reapplying their HMM pipeline, sharing the benchmarking, and showing whether curation improved (or didn’t improve) the results.
It would also be nice to see databases and algorithms designed from the start with specific mechanisms of AMR in mind…
Oh wait… CARD is doing this already. 🙂 
REFERENCES and FOOTNOTES
- Boolchandani, Manish, Alaric W. D’Souza, and Gautam Dantas. “Sequencing-Based Methods and Resources to Study Antimicrobial Resistance.” Nature Reviews Genetics, March 18, 2019. https://doi.org/10.1038/s41576-019-0108-4.
- Ciufo, Stacy, Sivakumar Kannan, Shobha Sharma, Azat Badretdin, Karen Clark, Seán Turner, Slava Brover, Conrad L. Schoch, Avi Kimchi, and Michael DiCuccio. “Using Average Nucleotide Identity to Improve Taxonomic Assignments in Prokaryotic Genomes at the NCBI.” International Journal of Systematic and Evolutionary Microbiology 68, no. 7 (July 1, 2018): 2386–92. https://doi.org/10.1099/ijsem.0.002809.
- Well, I suppose they are public now – but what’s confusing is that these databases are all governed by different EULAs and Terms of Service for end-users (for example, CARD has a specific licensing restriction on any commercial use without a license). This could be potentially a big deal for commercial entities looking to use NIH’s AMRFinder pipeline and the AMR Database from the NIH. Their licensing page is probably just plain wrong – but I am not a lawyer so…
- Jia, Baofeng, Amogelang R. Raphenya, Brian Alcock, Nicholas Waglechner, Peiyao Guo, Kara K. Tsang, Briony A. Lago, et al. “CARD 2017: Expansion and Model-Centric Curation of the Comprehensive Antibiotic Resistance Database.” Nucleic Acids Research 45, no. D1 (04 2017): D566–73. https://doi.org/10.1093/nar/gkw1004.