024-rna_seq

RNA-Seq dataset

The RNA-Seq dataset includes bulk gene expression profiles measured in human peripheral blood mononuclear cells (PBMC) cells, sampled at each visit to the clinical testing center.

RNA sequencing (RNA-Seq) is a powerful high-throughput technique used to analyze the transcriptome, providing a snapshot of RNA presence and quantity in a biological sample at a given moment. Our bulk approach sequences RNA from a mixed population of cells, giving a cumulative overview of gene expression across the sample. It’s widely used for understanding complex biological processes and disease mechanisms.

This RNA-Seq dataset focuses on gene expression profiles in human peripheral blood mononuclear cells (PBMC), collected during patient visits to a clinical testing center. The aim is to uncover the gene expression dynamics in PBMCs under various clinical conditions. PBMCs, which include lymphocytes and monocytes, play a key role in the immune response. The gene expression patterns in these cells can provide valuable insights into immune system activities and pathophysiological states.

The dataset was generated using 3’-tagged bulk RNA sequencing technology, capturing a broad spectrum of gene expression in PBMCs from diverse clinical samples. The protocol was adapted from mcSCRB-seq The library preparation for our RNA-Seq dataset is based on a 3’-tagged bulk RNA-Sequencing protocol, adapted from mcSCRB-seq (Bagnoli et al.). This method incorporates unique molecular identifiers (UMIs), pool barcodes, and sample barcodes. UMIs are crucial for accurately quantifying transcript abundance, as they enable the differentiation of PCR duplicates from unique mRNA molecules. Pool barcodes facilitate multiplexing of several samples in a single sequencing run, enhancing throughput efficiency. Sample barcodes are used to track individual samples, ensuring precise sample identification and data integrity.

Data availability:

  • All tabular information is stored in a main parquet file: rna_seq.parquet
  • Read counts are stored in long-format parquet files per batch
  • Each sequencing batch includes metadata parquet, JSON and HTML files
from pheno_utils import PhenoLoader
pl = PhenoLoader('rna_seq')
pl
PhenoLoader for rna_seq with
103 fields
2 tables: ['rna_seq', 'age_sex']

Data dictionary

pl.dict
field_string description_string folder_id feature_set field_type strata data_coding array pandas_dtype bulk_file_extension ... transformation list_of_tags stability sexed debut completed min_plausible_value max_plausible_value dependency parent_dataframe
tabular_field_name
collection_timestamp Collection timestamp Time sample was given 24.0 rna_seq Datetime Collection time NaN Multiple datetime64[ns, Asia/Jerusalem] NaN ... NaN RNA-seq,transcriptomics,gene expression Accruing Both sexes 2021-02-28 NaN NaN NaN NaN NaN
collection_date Collection date Date sample was given 24.0 rna_seq Date Collection time NaN Multiple datetime64[ns] NaN ... NaN RNA-seq,transcriptomics,gene expression Accruing Both sexes 2021-02-28 NaN NaN NaN NaN NaN
timezone Timezone Timezone of the collection timestamp 24.0 rna_seq Categorical (single) Collection time 001_03 Multiple category NaN ... NaN RNA-seq,transcriptomics,gene expression Accruing Both sexes 2021-02-28 NaN NaN NaN NaN NaN
batch Batch Sequencing batch of the sample 24.0 rna_seq Categorical (single) Collection time NaN Multiple category NaN ... NaN RNA-seq,transcriptomics,gene expression Accruing Both sexes 2021-02-28 NaN NaN NaN NaN NaN
pool Pool Pool number within the batch 24.0 rna_seq Integer Auxiliary NaN Multiple Int64 NaN ... NaN RNA-seq,transcriptomics,gene expression Accruing Both sexes 2021-02-28 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
bcl2fastq__total_pools Bcl2fastq: total pools Bcl2fastq: total pools NaN NaN columns NaN NaN NaN float NaN ... NaN RNA-seq,transcriptomics,gene expression NaN NaN NaN NaN NaN NaN NaN batch_metadata_parquet
bcl2fastq__undetermined Bcl2fastq: undetermined Bcl2fastq: undetermined NaN NaN columns NaN NaN NaN float NaN ... NaN RNA-seq,transcriptomics,gene expression NaN NaN NaN NaN NaN NaN NaN batch_metadata_parquet
bcl2fastq__undetermined_percent Bcl2fastq: undetermined percent Bcl2fastq: undetermined percent NaN NaN columns NaN NaN NaN float NaN ... NaN RNA-seq,transcriptomics,gene expression NaN NaN NaN NaN NaN NaN NaN batch_metadata_parquet
bcl2fastq__yieldq30 Bcl2fastq: yieldq30 Bcl2fastq: yieldq30 NaN NaN columns NaN NaN NaN float NaN ... NaN RNA-seq,transcriptomics,gene expression NaN NaN NaN NaN NaN NaN NaN batch_metadata_parquet
bcl2fastq__yieldq30_pools Bcl2fastq: yieldq30 pools Bcl2fastq: yieldq30 pools NaN NaN columns NaN NaN NaN float NaN ... NaN RNA-seq,transcriptomics,gene expression NaN NaN NaN NaN NaN NaN NaN batch_metadata_parquet

60795 rows × 24 columns

Plot histogram and ecdf for htseq_count__assigned__unique at baseline visit

from pheno_utils.basic_plots import hist_ecdf_plots

col = "htseq_count__assigned__unique"
df = pl[[col] + ["age", "sex", "collection_date"]].loc[:,:,"00_00_visit",0,:]

# plot histogram and ecdf
hist_ecdf_plots(df.dropna(subset=[col,"sex", "age"]), col, gender_col="sex")

# stats
display(df[col].describe().to_frame().T)

count mean std min 25% 50% 75% max
htseq_count__assigned__unique 2824.0 2223589.202904 857712.561714 1463.0 1707855.25 2218835.0 2660384.5 11209544.0