Foundation Models for Analyzing Single-Cell RNA Sequence data

Loading...
Thumbnail Image

Authors

Naziri, Amirreza

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Single-cell RNA sequencing (scRNA-seq) measures gene expression in individual cells, offering deep insight into cellular heterogeneity, development, and disease. Transformer-based foundation models have become central to single-cell RNA-sequencing analysis, yet most rely on uniform random masking during pretraining, a strategy misaligned with the sparsity, heterogeneity, and zero inflation characteristic of scRNA-seq data. To assess how these models behave under realistic biological variation, we first perform a comprehensive evaluation of four widely used single-cell foundation models (Geneformer, scBERT, scFoundation, and scGPT) across three diverse datasets. This benchmarking reveals substantial variability in model performance, including systematic weaknesses on rare cell populations and degraded accuracy in clinically challenging conditions. Motivated by the broader limitations of random masking in Foundation models, we introduce Multinomial Attention Masking (MAM), a biologically informed masking strategy that leverages trainable latent representations and cross-attention to identify informative gene positions during pretraining. Across all datasets, models pretrained with MAM consistently achieve higher downstream cell-type classification accuracy than those trained with uniform masking and, in several cases, outperform the original pretrained backbones. Biological validation further demonstrates that MAM preferentially selects highly expressed and functionally meaningful genes, indicating that its improvements stem from capturing biologically relevant structure rather than from increased algorithmic complexity. This work improves the reliability and utility of single-cell foundation models for researchers and clinicians alike.

Description

Keywords

Computer science

Citation

Collections