013-gut_microbiome

Gut microbiome dataset

Description

Metagenomics is the study of genetic material from environmental samples, including microbial communities. It involves sequencing the DNA of all microorganisms in the sample, rather than isolating individual organisms. Metagenomics enables the identification and functional analysis of microorganisms in diverse environments, including soil, water, and the human body.

This dataset maps out the human gut microbiota per participant via shotgun metagenomic sequencing given stool samples. It is then compared to known references of gut flora to measure prevalence of specific microbes.

Introduction

The human gut, a complex and dynamic ecosystem, harbors a myriad of microorganisms collectively known as the gut microbiome. This community plays a critical role in various physiological processes such as immune modulation, metabolic regulation, and even functions as an endocrine organ. However, the vast complexity and diversity of the gut microbiome have remained largely elusive due to traditional culture methods’ limitations. Recent advances in next-generation sequencing technologies, particularly metagenomics, have paved the way for a more comprehensive understanding of the gut microbiome.

Gut metagenomics refers to the study of the collective genetic material of all microorganisms present in the gut, gleaned directly from fecal samples. It provides a powerful tool for the identification and quantification of diverse microbial species and their functional roles, including their influence on metabolic pathways, their virulence factors, antibiotic resistance profiles, and more. Moreover, gut metagenomics can reveal the taxonomic diversity and community structure of the microbiome, offering insights into the intricate relationship between the microbiome and its host.

Measurement protocol

To measure the genetic makeup of the human gut flora given stool samples via metagenomics, the following steps done:

Collection of stool sample: For each visit, a stool sample is collected from the individual and stored appropriately to preserve the microbial community.
DNA extraction: DNA is extracted from the stool sample using specialized techniques to isolate the microbial DNA from other materials present in the sample.
DNA fragmentation and sequencing: The extracted DNA is then fragmented into small pieces and sequenced using high-throughput sequencing technologies.
Quality control: The resulting raw sequencing data is then pre-processed, removing low-quality reads and artifacts of the sequencing methodology.
Taxonomic classification: The processed sequencing data is then compared to databases of known microbial sequences to identify and classify the microbial species presence and their respective abundances in the sample.

Data availability

The information is stored in multiple parquet files:

gut_microbiome.parquet: Sequencing and QC statistics.
urs: Segal Lab relative abundance.
metaphlan_*: 8 tables with MetaPhlAn 4 relative abundances, separated by taxonomic levels.

graph LR;
    A(Raw FASTQ File) --> |Trimmomatic| B(Clean FASTQ File)
    A --> |FASTQC| C(QC HTML)
    
    B --> |BWA| D(Non Human Reads)
    B --> |BWA| E(Human Reads)
    
    D --> |MetaPhlAn 4| F(MetaPhlAn 4 Abundances<br>Tabular)
    D --> G("URS (Segal) Abundances<br>Tabular")
    
    E --> |GATK4| H(Human Variants<br>Plink)
    
    F --> I(Pipeline Metadata)
    G --> I
    H --> I
    C --> I

Relevant links

from pheno_utils import PhenoLoader

pl = PhenoLoader('gut_microbiome')
pl

PhenoLoader for gut_microbiome with
46 fields
2 tables: ['gut_microbiome', 'age_sex']

Data dictionary

pl.dict

	folder_id	feature_set	field_string	description_string	bulk_dictionary	relative_location	data_coding	stability	units	sampling_rate	...	array	debut	completed	transformation	list_of_tags	pandas_dtype	min_plausible_value	max_plausible_value	dependency	parent_dataframe
tabular_field_name
collection_timestamp	13.0	gut_microbiome	Sampled timestamp	Time sample was given	NaN	gut_microbiome/gut_microbiome.parquet	NaN	Accruing	Time	NaN	...	Single	NaN	NaN	NaN	Gut Microbiome	datetime64[ns, Asia/Jerusalem]	NaN	NaN	NaN	NaN
collection_date	13.0	gut_microbiome	Sampled date	Date sample was given	NaN	gut_microbiome/gut_microbiome.parquet	NaN	Accruing	Time	NaN	...	Single	NaN	NaN	NaN	Gut Microbiome	datetime64[ns]	NaN	NaN	NaN	NaN
timezone	13.0	gut_microbiome	Timezone	Timezone	NaN	gut_microbiome/gut_microbiome.parquet	NaN	Accruing	NaN	NaN	...	Single	NaN	NaN	NaN	Gut Microbiome	category	NaN	NaN	NaN	NaN
sample_name	13.0	gut_microbiome	Sample name	Sample Name	NaN	gut_microbiome/gut_microbiome.parquet	NaN	Accruing	NaN	NaN	...	Single	NaN	NaN	NaN	Gut Microbiome	string	NaN	NaN	NaN	NaN
urs_metadata_parquet	13.0	urs_abundances_aggregated	URS abundances metadata	Organism classification and taxonomy	NaN	gut_microbiome/gut_microbiome.parquet	NaN	Accruing	NaN	NaN	...	Single	NaN	NaN	NaN	Gut Microbiome	category	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
k__Bacteria\|p__Verrucomicrobia\|c__Verrucomicrobiae\|o__Verrucomicrobiales\|f__Akkermansiaceae\|g__GGB6529\|s__GGB6529_SGB9222\|t__SGB9222	NaN	NaN	k__Bacteria\|p__Verrucomicrobia\|c__Verrucomicro...	t__SGB9222	NaN	NaN	NaN	NaN	percent	NaN	...	NaN	NaN	NaN	NaN	NaN	float64	NaN	NaN	NaN	metaphlan_abundance_strain_parquet
k__Eukaryota\|p__Ascomycota\|c__Saccharomycetes\|o__Saccharomycetales\|f__Saccharomycetaceae\|g__Saccharomyces\|s__Saccharomyces_cerevisiae\|t__EUK4932	NaN	NaN	k__Eukaryota\|p__Ascomycota\|c__Saccharomycetes\|...	t__EUK4932	NaN	NaN	NaN	NaN	percent	NaN	...	NaN	NaN	NaN	NaN	NaN	float64	NaN	NaN	NaN	metaphlan_abundance_strain_parquet
k__Eukaryota\|p__Eukaryota_unclassified\|c__Eukaryota_unclassified\|o__Eukaryota_unclassified\|f__Entamoebidae\|g__Entamoeba\|s__Entamoeba_dispar\|t__EUK46681	NaN	NaN	k__Eukaryota\|p__Eukaryota_unclassified\|c__Euka...	t__EUK46681	NaN	NaN	NaN	NaN	percent	NaN	...	NaN	NaN	NaN	NaN	NaN	float64	NaN	NaN	NaN	metaphlan_abundance_strain_parquet
k__Eukaryota\|p__Eukaryota_unclassified\|c__Eukaryota_unclassified\|o__Eukaryota_unclassified\|f__Eukaryota_unclassified\|g__Blastocystis\|s__Blastocystis_sp_subtype_1\|t__EUK944036	NaN	NaN	k__Eukaryota\|p__Eukaryota_unclassified\|c__Euka...	t__EUK944036	NaN	NaN	NaN	NaN	percent	NaN	...	NaN	NaN	NaN	NaN	NaN	float64	NaN	NaN	NaN	metaphlan_abundance_strain_parquet
k__Eukaryota\|p__Eukaryota_unclassified\|c__Eukaryota_unclassified\|o__Eukaryota_unclassified\|f__Hexamitidae\|g__Giardia\|s__Giardia_intestinalis\|t__EUK5741	NaN	NaN	k__Eukaryota\|p__Eukaryota_unclassified\|c__Euka...	t__EUK5741	NaN	NaN	NaN	NaN	percent	NaN	...	NaN	NaN	NaN	NaN	NaN	float64	NaN	NaN	NaN	metaphlan_abundance_strain_parquet

9929 rows × 24 columns