from pheno_utils import PhenoLoader
024-rna_seq
RNA-Seq dataset
The RNA-Seq dataset includes bulk gene expression profiles measured in human peripheral blood mononuclear cells (PBMC) cells, sampled at each visit to the clinical testing center.
RNA sequencing (RNA-Seq) is a powerful high-throughput technique used to analyze the transcriptome, providing a snapshot of RNA presence and quantity in a biological sample at a given moment. Our bulk approach sequences RNA from a mixed population of cells, giving a cumulative overview of gene expression across the sample. It’s widely used for understanding complex biological processes and disease mechanisms.
This RNA-Seq dataset focuses on gene expression profiles in human peripheral blood mononuclear cells (PBMC), collected during patient visits to a clinical testing center. The aim is to uncover the gene expression dynamics in PBMCs under various clinical conditions. PBMCs, which include lymphocytes and monocytes, play a key role in the immune response. The gene expression patterns in these cells can provide valuable insights into immune system activities and pathophysiological states.
The dataset was generated using 3’-tagged bulk RNA sequencing technology, capturing a broad spectrum of gene expression in PBMCs from diverse clinical samples. The protocol was adapted from mcSCRB-seq The library preparation for our RNA-Seq dataset is based on a 3’-tagged bulk RNA-Sequencing protocol, adapted from mcSCRB-seq (Bagnoli et al.). This method incorporates unique molecular identifiers (UMIs), pool barcodes, and sample barcodes. UMIs are crucial for accurately quantifying transcript abundance, as they enable the differentiation of PCR duplicates from unique mRNA molecules. Pool barcodes facilitate multiplexing of several samples in a single sequencing run, enhancing throughput efficiency. Sample barcodes are used to track individual samples, ensuring precise sample identification and data integrity.
Data availability:
- All tabular information is stored in a main parquet file:
rna_seq.parquet
- Read counts are stored in long-format parquet files per batch
- Each sequencing batch includes metadata parquet, JSON and HTML files
= PhenoLoader('rna_seq')
pl pl
PhenoLoader for rna_seq with
103 fields
2 tables: ['rna_seq', 'age_sex']
Data dictionary
dict pl.
field_string | description_string | folder_id | feature_set | field_type | strata | data_coding | array | pandas_dtype | bulk_file_extension | ... | transformation | list_of_tags | stability | sexed | debut | completed | min_plausible_value | max_plausible_value | dependency | parent_dataframe | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tabular_field_name | |||||||||||||||||||||
collection_timestamp | Collection timestamp | Time sample was given | 24.0 | rna_seq | Datetime | Collection time | NaN | Multiple | datetime64[ns, Asia/Jerusalem] | NaN | ... | NaN | RNA-seq,transcriptomics,gene expression | Accruing | Both sexes | 2021-02-28 | NaN | NaN | NaN | NaN | NaN |
collection_date | Collection date | Date sample was given | 24.0 | rna_seq | Date | Collection time | NaN | Multiple | datetime64[ns] | NaN | ... | NaN | RNA-seq,transcriptomics,gene expression | Accruing | Both sexes | 2021-02-28 | NaN | NaN | NaN | NaN | NaN |
timezone | Timezone | Timezone of the collection timestamp | 24.0 | rna_seq | Categorical (single) | Collection time | 001_03 | Multiple | category | NaN | ... | NaN | RNA-seq,transcriptomics,gene expression | Accruing | Both sexes | 2021-02-28 | NaN | NaN | NaN | NaN | NaN |
batch | Batch | Sequencing batch of the sample | 24.0 | rna_seq | Categorical (single) | Collection time | NaN | Multiple | category | NaN | ... | NaN | RNA-seq,transcriptomics,gene expression | Accruing | Both sexes | 2021-02-28 | NaN | NaN | NaN | NaN | NaN |
pool | Pool | Pool number within the batch | 24.0 | rna_seq | Integer | Auxiliary | NaN | Multiple | Int64 | NaN | ... | NaN | RNA-seq,transcriptomics,gene expression | Accruing | Both sexes | 2021-02-28 | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
bcl2fastq__total_pools | Bcl2fastq: total pools | Bcl2fastq: total pools | NaN | NaN | columns | NaN | NaN | NaN | float | NaN | ... | NaN | RNA-seq,transcriptomics,gene expression | NaN | NaN | NaN | NaN | NaN | NaN | NaN | batch_metadata_parquet |
bcl2fastq__undetermined | Bcl2fastq: undetermined | Bcl2fastq: undetermined | NaN | NaN | columns | NaN | NaN | NaN | float | NaN | ... | NaN | RNA-seq,transcriptomics,gene expression | NaN | NaN | NaN | NaN | NaN | NaN | NaN | batch_metadata_parquet |
bcl2fastq__undetermined_percent | Bcl2fastq: undetermined percent | Bcl2fastq: undetermined percent | NaN | NaN | columns | NaN | NaN | NaN | float | NaN | ... | NaN | RNA-seq,transcriptomics,gene expression | NaN | NaN | NaN | NaN | NaN | NaN | NaN | batch_metadata_parquet |
bcl2fastq__yieldq30 | Bcl2fastq: yieldq30 | Bcl2fastq: yieldq30 | NaN | NaN | columns | NaN | NaN | NaN | float | NaN | ... | NaN | RNA-seq,transcriptomics,gene expression | NaN | NaN | NaN | NaN | NaN | NaN | NaN | batch_metadata_parquet |
bcl2fastq__yieldq30_pools | Bcl2fastq: yieldq30 pools | Bcl2fastq: yieldq30 pools | NaN | NaN | columns | NaN | NaN | NaN | float | NaN | ... | NaN | RNA-seq,transcriptomics,gene expression | NaN | NaN | NaN | NaN | NaN | NaN | NaN | batch_metadata_parquet |
60795 rows × 24 columns
Plot histogram and ecdf for htseq_count__assigned__unique at baseline visit
from pheno_utils.basic_plots import hist_ecdf_plots
= "htseq_count__assigned__unique"
col = pl[[col] + ["age", "sex", "collection_date"]].loc[:,:,"00_00_visit",0,:]
df
# plot histogram and ecdf
=[col,"sex", "age"]), col, gender_col="sex")
hist_ecdf_plots(df.dropna(subset
# stats
display(df[col].describe().to_frame().T)
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
htseq_count__assigned__unique | 2824.0 | 2223589.202904 | 857712.561714 | 1463.0 | 1707855.25 | 2218835.0 | 2660384.5 | 11209544.0 |