013-gut_microbiome

Gut microbiome dataset

Metagenomics is the study of genetic material from environmental samples, including microbial communities. It involves sequencing the DNA of all microorganisms in the sample, rather than isolating individual organisms. Metagenomics enables the identification and functional analysis of microorganisms in diverse environments, including soil, water, and the human body.

This dataset maps out the human gut microbiota per participant via shotgun metagenomic sequencing given stool samples. It is then compared to known references of gut flora to measure prevalence of specific microbes.

Gut microbiome metagenomics can be used to identify potential biomarkers of disease, develop personalized treatment strategies, and better understand the complex relationship between the gut microbiota and human physiology. It has already been providing insights into the role of gut microbes in various diseases such as obesity, diabetes and IBD.

To measure the genetic makeup of the human gut flora given stool samples via metagenomics, the following steps done:

  1. Collection of stool sample: For each visit, a stool sample is collected from the individual and stored appropriately to preserve the microbial community.
  2. DNA extraction: DNA is extracted from the stool sample using specialized techniques to isolate the microbial DNA from other materials present in the sample.
  3. DNA fragmentation and sequencing: The extracted DNA is then fragmented into small pieces and sequenced using high-throughput sequencing technologies.
  4. Quality control: The resulting raw sequencing data is then pre-processed, removing low-quality reads and artifacts of the sequencing methodology.
  5. Taxonomic classification: The processed sequencing data is then compared to databases of known microbial sequences to identify and classify the microbial species presence and their respective abundances in the sample.

Data availability:

The information is stored in multiple parquet files: - gut_microbiome.parquet: Sequencing and QC statistics. - urs: Segal Lab relative abundance. - metaphlan_*: 8 tables with MetaPhlAn 4 relative abundances, separated by taxonomic levels.

from pheno_utils import PhenoLoader
pl = PhenoLoader('gut_microbiome')
pl
PhenoLoader for gut_microbiome with
46 fields
2 tables: ['gut_microbiome', 'age_sex']

Data dictionary

pl.dict
folder_id feature_set field_string description_string bulk_dictionary relative_location data_coding stability units sampling_rate ... array debut completed transformation list_of_tags pandas_dtype min_plausible_value max_plausible_value dependency parent_dataframe
tabular_field_name
collection_timestamp 13.0 gut_microbiome Sampled timestamp Time sample was given NaN gut_microbiome/gut_microbiome.parquet NaN Accruing Time NaN ... Single NaN NaN NaN Gut Microbiome datetime64[ns, Asia/Jerusalem] NaN NaN NaN NaN
collection_date 13.0 gut_microbiome Sampled date Date sample was given NaN gut_microbiome/gut_microbiome.parquet NaN Accruing Time NaN ... Single NaN NaN NaN Gut Microbiome datetime64[ns] NaN NaN NaN NaN
timezone 13.0 gut_microbiome Timezone Timezone NaN gut_microbiome/gut_microbiome.parquet NaN Accruing NaN NaN ... Single NaN NaN NaN Gut Microbiome category NaN NaN NaN NaN
sample_name 13.0 gut_microbiome Sample name Sample Name NaN gut_microbiome/gut_microbiome.parquet NaN Accruing NaN NaN ... Single NaN NaN NaN Gut Microbiome string NaN NaN NaN NaN
urs_metadata_parquet 13.0 urs_abundances_aggregated URS abundances metadata Organism classification and taxonomy NaN gut_microbiome/gut_microbiome.parquet NaN Accruing NaN NaN ... Single NaN NaN NaN Gut Microbiome category NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales|f__Akkermansiaceae|g__GGB6529|s__GGB6529_SGB9222|t__SGB9222 NaN NaN k__Bacteria|p__Verrucomicrobia|c__Verrucomicro... t__SGB9222 NaN NaN NaN NaN percent NaN ... NaN NaN NaN NaN NaN float64 NaN NaN NaN metaphlan_abundance_strain_parquet
k__Eukaryota|p__Ascomycota|c__Saccharomycetes|o__Saccharomycetales|f__Saccharomycetaceae|g__Saccharomyces|s__Saccharomyces_cerevisiae|t__EUK4932 NaN NaN k__Eukaryota|p__Ascomycota|c__Saccharomycetes|... t__EUK4932 NaN NaN NaN NaN percent NaN ... NaN NaN NaN NaN NaN float64 NaN NaN NaN metaphlan_abundance_strain_parquet
k__Eukaryota|p__Eukaryota_unclassified|c__Eukaryota_unclassified|o__Eukaryota_unclassified|f__Entamoebidae|g__Entamoeba|s__Entamoeba_dispar|t__EUK46681 NaN NaN k__Eukaryota|p__Eukaryota_unclassified|c__Euka... t__EUK46681 NaN NaN NaN NaN percent NaN ... NaN NaN NaN NaN NaN float64 NaN NaN NaN metaphlan_abundance_strain_parquet
k__Eukaryota|p__Eukaryota_unclassified|c__Eukaryota_unclassified|o__Eukaryota_unclassified|f__Eukaryota_unclassified|g__Blastocystis|s__Blastocystis_sp_subtype_1|t__EUK944036 NaN NaN k__Eukaryota|p__Eukaryota_unclassified|c__Euka... t__EUK944036 NaN NaN NaN NaN percent NaN ... NaN NaN NaN NaN NaN float64 NaN NaN NaN metaphlan_abundance_strain_parquet
k__Eukaryota|p__Eukaryota_unclassified|c__Eukaryota_unclassified|o__Eukaryota_unclassified|f__Hexamitidae|g__Giardia|s__Giardia_intestinalis|t__EUK5741 NaN NaN k__Eukaryota|p__Eukaryota_unclassified|c__Euka... t__EUK5741 NaN NaN NaN NaN percent NaN ... NaN NaN NaN NaN NaN float64 NaN NaN NaN metaphlan_abundance_strain_parquet

9929 rows × 24 columns