Autoencoder-Based Nonlinear Dimension Reduction for Single-Cell RNA-Seq Data: A Comparative Study of t-SNE and UMAP
DOI:
https://doi.org/10.6000/1929-6029.2025.14.78Keywords:
scRNA-seq data, Autoencoder, t-SNE, UMAP, Dimension Reduction, VisualizationAbstract
This paper proposes using an Autoencoder (AE) prior to t-SNE or UMAP visualization for scRNA-seq data. Direct application of t-SNE/UMAP to the raw, sparse expression matrix often yields unstable, poorly separated clusters. To address this, the framework first employs an AE to learn a denoised, compact latent representation. Subsequent t-SNE or UMAP embedding of this latent space produces more robust visualizations with enhanced cluster consistency and structural separability. A real-data-based comparison shows that, when using the same AE-derived latent space, UMAP outperforms t-SNE. It achieves better cluster cohesion, stronger global structure preservation, greater robustness to initialization and data perturbation, and lower computational cost. Statistical validation via a projection F-test confirms that clusters in the AE latent space exhibit significant between-group mean differences, quantifying the observed visual improvement. The study concludes that AE-based representation learning creates an effective input space for nonlinear embedding, with the AE-UMAP pipeline emerging as a particularly stable and efficient choice for scRNA-seq exploratory analysis.
Purpose: This study aims to investigate the effectiveness of AE based latent representations in enhancing nonlinear dimension reduction methods, namely t-SNE and UMAP, for single-cell gene expression data analysis. The performance of AE-based UMAP and AE-based t-SNE is systematically evaluated from multiple perspectives, including visualization quality, clustering consistency, structural preservation, and robustness.
Methods: This paper constructs a two-step dimension reduction framework for single-cell gene expression data analysis. First, an AE is employed to compress high-dimensional, sparse, and noisy gene expression data into a low-dimensional latent representation. Subsequently, t-SNE and UMAP are applied to the learned AE latent space for nonlinear embedding and visualization. The performance of different methods is systematically evaluated under multiple experimental conditions using clustering consistency metrics, structure preservation measures, and a projected F-test.
Results: Experimental results indicate that directly applying t-SNE or UMAP to the original expression data fails to stably recover meaningful clustering structures, whereas nonlinear dimension reduction performed on AE latent representations substantially improves visualization quality and clustering stability. Within the same latent space, t-SNE and UMAP exhibit comparable performance in terms of clustering accuracy; however, UMAP demonstrates superior performance with respect to cluster compactness, global structure preservation, stability across repeated experiments, and computational efficiency. Statistical testing further confirms the significance of between cluster differences in the AE latent space.
Contribution: This study systematically reveals the critical role of AE latent representations in stabilizing nonlinear dimension reduction for single cell data and provides a quantitative comparison between t-SNE and UMAP within a unified latent space. The results demonstrate that UMAP applied to AE latent representations achieves superior performance in terms of visualization stability and computational efficiency, offering a more robust two step dimension reduction strategy for exploratory analysis of high dimensional single cell data.
References
Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nature methods 2009; 6(5): 377-82. DOI: https://doi.org/10.1038/nmeth.1315
Luecken MD, Theis FJ. Current best practices in single‐cell RNA‐seq analysis: a tutorial. Molecular Systems Biology 2019; 15(6): e8746. DOI: https://doi.org/10.15252/msb.20188746
Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell differential expression analysis. Nature methods 2014; 11(7): 740-2. DOI: https://doi.org/10.1038/nmeth.2967
Sun S, Zhu J, Ma Y, Zhou X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol 2019; 20(1): 269. DOI: https://doi.org/10.1186/s13059-019-1898-6
Ringnér M. What is principal component analysis? Nature biotechnology 2008; 26(3): 303-4. DOI: https://doi.org/10.1038/nbt0308-303
Maaten L van der, Hinton G. Visualizing data using t-SNE. Journal of machine learning research 2008; 9: 2579-605.
McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction arXiv; 2020.
Wattenberg M, Viégas F, Johnson I. How to use t-SNE effectively. Distill 2016; 1(10): e2. DOI: https://doi.org/10.23915/distill.00002
Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature communications 2019; 10(1): 5416. DOI: https://doi.org/10.1038/s41467-019-13056-x
Hinton GE, Salakhutdinov RR. Reducing the Dimensionality of Data with Neural Networks. Science 2006; 313(5786): 504-7. DOI: https://doi.org/10.1126/science.1127647
Eraslan G, Simon LM, Mircea M, Mueller NS, Theis FJ. Single-cell RNA-seq denoising using a deep count autoencoder. Nature communications 2019; 10(1): 390. DOI: https://doi.org/10.1038/s41467-018-07931-2
Ding J, Condon A, Shah SP. Interpretable dimensionality re- duction of single cell transcriptome data with deep generative models. Nature communications 2018; 9(1): 2002. DOI: https://doi.org/10.1038/s41467-018-04368-5
Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nature methods 2018; 15(12): 1053-8. DOI: https://doi.org/10.1038/s41592-018-0229-2
Geddes TA, Kim T, Nan L, Burchfield JG, Yang JYH, Tao D, et al. Autoencoder-based cluster ensembles for single-cell RNA-seq data analysis. BMC Bioinformatics [Internet] 2019; 20(S19): 660. DOI: https://doi.org/10.1186/s12859-019-3179-5
Xiang R, Wang W, Yang L, Wang S, Xu C, Chen X. A comparison for dimensionality reduction methods of single-cell RNA-seq data. Frontiers in genetics 2021; 12: 646936. DOI: https://doi.org/10.3389/fgene.2021.646936
Allaoui M, Kherfi ML, Cheriet A. Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study. In: El Moataz A, Mammass D, Mansouri A, Nouboud F, editors. Image and Signal Processing Cham: Springer International Publishing; 2020. p. 317-25. DOI: https://doi.org/10.1007/978-3-030-51935-3_34
Amir E ad D, Davis KL, Tadmor MD, Simonds EF, Levine JH, Bendall SC, et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nature biotechnology 2013; 31(6): 545-52. DOI: https://doi.org/10.1038/nbt.2594
Roweis ST, Saul LK. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 2000; 290(5500): 2323-6. DOI: https://doi.org/10.1126/science.290.5500.2323
McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv; 2020.
Hubert L, Arabie P. Comparing partitions. Journal of Classification 1985; 2(1): 193-218. DOI: https://doi.org/10.1007/BF01908075
Rand WM. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association 1971; 66(336): 846-50. DOI: https://doi.org/10.1080/01621459.1971.10482356
Steinley D. Properties of the Hubert-Arable Adjusted Rand Index. Psychological Methods 2004; 9(3): 386-96. DOI: https://doi.org/10.1037/1082-989X.9.3.386
Santos JM, Embrechts M. On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification. In: Alippi C, Polycarpou M, Panayiotou C, Ellinas G, editors. Artificial Neural Networks - ICANN 2009 Berlin, Heidelberg: Springer Berlin Heidelberg; 2009. p. 175-84. DOI: https://doi.org/10.1007/978-3-642-04277-5_18
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 1987; 20: 53-65. DOI: https://doi.org/10.1016/0377-0427(87)90125-7
Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons; 2009.
Handl J, Knowles J, Kell DB. Computational cluster validation in post-genomic data analysis. Bioinformatics 2005; 21(15): 3201-12. DOI: https://doi.org/10.1093/bioinformatics/bti517
Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I. An extensive comparative study of cluster validity indices. Pattern recognition 2013; 46(1): 243-56. DOI: https://doi.org/10.1016/j.patcog.2012.07.021
Mahalanobis PC. On the generalized distance in statistics. Sankhyā: The Indian Journal of Statistics, Series A (2008-) 2018; 80: S1-7
Venna J, Kaski S. Neighborhood Preservation in Nonlinear Projection Methods: An Experimental Study. In: Dorffner G, Bischof H, Hornik K, editors. Artificial Neural Networks — ICANN 2001 Berlin, Heidelberg: Springer Berlin Heidelberg; 2001. p. 485-91. DOI: https://doi.org/10.1007/3-540-44668-0_68
Van Der Maaten L, Postma EO, Van Den Herik HJ. Dimensionality reduction: A comparative review. Journal of machine learning research 2009; 10(66-71): 13.
Cao Y, Liang J. Multiple mean comparison for clusters of gene expression data through the t-SNE plot and PCA dimension reduction. International Journal of Statistics in Medical Research 2025; 14: 1-14. DOI: https://doi.org/10.6000/1929-6029.2025.14.01
Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, et al. Comprehensive integration of single-cell data. cell 2019; 177(7): 1888-902. DOI: https://doi.org/10.1016/j.cell.2019.05.031
Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 2018; 19(1): 15. DOI: https://doi.org/10.1186/s13059-017-1382-0
Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profil- ing of single cells. Nature communications 2017; 8(1): 14049. DOI: https://doi.org/10.1038/ncomms14049
Tian T, Zhang J, Lin X, Wei Z, Hakonarson H. Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data. Nature communications 2021; 12(1): 1873. DOI: https://doi.org/10.1038/s41467-021-22008-3
Feng C, Liu S, Zhang H, Guan R, Li D, Zhou F, et al. Dimension reduction and clustering models for single-cell RNA sequencing data: a comparative study. International journal of molecular sciences 2020; 21(6): 2181. DOI: https://doi.org/10.3390/ijms21062181
Hasan BMS, Abdulazeez AM. A review of principal component analysis algorithm for dimensionality reduction. Journal of Soft Computing and Data Mining 2021; 2(1): 20-30.
Borenstein M (Ed.), Meta-analysis: A guide to calibrating and combining statistical evidence. Wiley 2024.
Westfall PH, Young SS, Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons 1993.
Ward JH. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association. 1963; 58(301): 236-244. DOI: https://doi.org/10.1080/01621459.1963.10500845
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Policy for Journals/Articles with Open Access
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are permitted and encouraged to post links to their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
Policy for Journals / Manuscript with Paid Access
Authors who publish with this journal agree to the following terms:
- Publisher retain copyright .
- Authors are permitted and encouraged to post links to their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work .