And … where is it located?
I’ve been thinking about this topic a lot lately. One might quickly think that the answers are “a ton” and then “SRA?” and they would partly right.
The real answer is “a shit ton” and “all over the place”. I mean literally millions of FASTQ files… located in databases far and wide way beyond just SRA. I’m attempting to catalog this data and capture a number of relevant additional data-points such as # projects, # samples, # libraries, types of libraries, etc. And for now – I’m mainly interested in data in the public domain, or that the public can access. Before I get to that – you may be asking “Why do you care?”
I previously wrote about the massive investment of venture capital into the microbiome startup space – I estimate this to be currently over $1.5B USD. It’s kind of nuts (why are I There are over 100+ startups in this space right now – and many of them are using publicly available data to stand up their services or products. What’s interesting though is that many of these companies are also building their own proprietary data sets – especially in the consumer microbiome testing market (from perhaps consumers who are unaware of this or never read the T&C of their microbiome service). These proprietary datasets are being built largely for derivative commercialization opportunities with large biopharma and diagnostics companies who spend an inordinate amount of time ($$$) collecting, curating, collating, and otherwise re-organizing public data to the benefit of their R&D programs. Companies have a huge value-proposition if they can step forward and say “Hey BigPharma – we have 100,000 human microbiome datasets that all from patients undergoing treatment for disease X, and we have all the patient and sample metadata curated and already organized.” The latter part here is a huge “must-have” feature of the data – and it’s also the most expensive part to potentially fix with data from public sources.
So – I’m interested in finding and organizing this data because everything I’ve heard about “how bad” public microbiome data is — largely — anecdotal. I also don’t believe I’ve seen a systematic review or survey of that really focused on what data is out there and where it is located etc. A page on Rob Edwards lab’s website (aka @linsalrob) comes close, but mainly provides some really useful tips for mining data from SRA (thanks @mindMo for the pointer to his lab’s pages).
Naturally (for me) I first went to Twitter to ask for help in identifying various databases for microbiome or metagenomics data – the response was really nice and added 3 or 4 new databases I wasn’t already aware of. Here’s the current list:
SRA https://www.ncbi.nlm.nih.gov/sra
ENA https://www.ebi.ac.uk/ena
MGnify https://www.ebi.ac.uk/metagenomics/
MG-RAST https://www.mg-rast.org
IMG https://img.jgi.doe.gov/
iMicrobe https://www.imicrobe.us
National Microbiome Data Collaborative https://microbiomedata.org/
Human Microbiome Project https://portal.hmpdacc.org/
GOLD https://gold.jgi.doe.gov
Microbiome DB https://www.microbiomeDB.org
QIITA https://qiita.ucsd.edu
GigaDB http://gigadb.org
curatedMicrobiomeData https://waldronlab.io/curatedMetagenomicData
Some really important caveats about this list
- There’s a significant but not complete overlap with each database and SRA.
- The “not complete” part is a major issue in a way – because while some databases indicate which data is shared with SRA or EBI (or you can tell from the accession numbers), many of them do not make it all that clear.
- They all use their own database schemas and a mix of unstructured or structured metadata. (#facepalm)
- Also – some of these offer both public and private data (i.e. MG-RAST) and other include a hybrid where the data is “semi-public”, e.g. you have to sign up to access the data and are limited in how you can use it.
Sebastian Schmidt also provides this insight…
Nonetheless, these databases potentially represent…
54,992 “studies” (46k of which are in SRA)
2,744,041 samples (~1.6M are in SRA)
3,306,321 sequencing libraries, the vast majority of which are DNA libraries (1.6M in SRA).
Over 3.3 million metagenomics sequencing libraries in the public domain. I’m sorry but this prompts the following gif…

Much of this data is redundant with data in SRA (see points 1 and 2 above, wondering how much storage waste there is here…) – but I’d venture that there is still something like 1.6 – 2.5M FASTQ files at a minimum that are unique. Perhaps I’ve miscalculated, but these numbers are based on combing these databases websites and capturing their current content.
It’s just incredible.
And this brings up another issue related to point #3 above – curation of the metadata.
Building curated databases is hard – especially when you have literally thousands of data providers/contributors. And SRA is not alone in their challenges with curation either. From what I can tell, all of these databases suffer from project, sample, and library/run level curation quality issues to some degree. And while fixed, controlled vocabularies and standardized data structures is a fine and honorable goal to have – it shorts the importance of giving researchers flexibility in sharing their own project data. Having flexible metadata requirements eliminates it as a barrier of entry for groups to share their data – which apparently over 50,000 groups have done.
Still though, organizing and curating all this data is something that should (and can) be tackled. I think there would be tremendous value in a resource that captured all publicly available microbiome data in a unified, structured framework and presented it in standardized formats, analyzed (reanalyzed) under a common set of standardized pipelines, and curated so that every single entry in the database met a common set of quality metrics. The QIITA platform comes close in terms of providing a potential framework for this sort of thing, but I don’t think they’re aiming to “reanalyze” everything in the public domain, it’s more of a platform for researchers to do this for their own data. 😊 It would be hugely time consuming and expensive to do this for all 2.7M samples — but no less valuable IMHO.
I’ll stop rambling now and close with this — if you’re aware of other microbiome databases not listed above, please let me know by leaving a comment below or contacting me via Twitter. I’ll update this post accordingly too.
Thanks to some folks from Twitterville – I’ve been able to ID a couple of additional resources which added an additional 72 studies and 6,058 samples (and have updated the post above)
GigaDB http://gigadb.org
curatedMicrobiomeData (cMD) https://waldronlab.io/curatedMetagenomicData/publications/
NIH Human Microbiome Project: https://hmpdacc.org/hmp/resources/data_browser.php
Human Oral Microbiome Database: http://www.homd.org/
Disbiome is a database covering microbial composition changes in different kinds of diseases, managed by Ghent University: https://disbiome.ugent.be/home
As a side note, the sequence data from the American Gut Project is available from the European Bioinformatics Institute (EBI) under accession number ERP012803/project number PRJEB11419 and Qiita study ID 10317.
Thanks Nita!
One more to add to the list – https://opendata.lifebit.ai/table/sgb contains data from the Segata group’s Cell paper in January- https://doi.org/10.1016/j.cell.2019.01.001
Thanks Mike!