Henry Hou
Abstract: Recent advancements in deep learning have significantly improved cell segmentation, with foundation models (FMs) emerging as a promising approach. Among these, CellViT, a model based on the Vision Transformer (ViT) architecture, has demonstrated strong generalization capabilities across diverse cell imaging modalities and datasets. However, despite its potential, CellViT's performance remains suboptimal on challenging segmentation tasks. In this study, the accuracy and robustness of CellViT are evaluated on the publicly available MoNuSeg dataset, a benchmark for cell segmentation. To enhance CellViT’s performance, a fine-tuning framework is proposed that integrates both manually labeled cell images with precise annotations and synthetic images generated via a hierarchical diffusion model (DiffInfinite). The pipeline incorporates three key components: foundation model fine-tuning, human-in-the-loop feedback for synthetic image selection, and training on a combined dataset of real annotated and pseudo-labeled synthetic images. Experimental results demonstrate that the proposed method leads to significant improvements in segmentation accuracy and generalization, particularly in domains where labeled data is scarce. This study underscores the potential of fine-tuning foundation models with synthetic data augmentation, providing a scalable approach for enhancing biomedical image analysis. The findings pave the way for the development of more robust and precise segmentation models, with critical applications in disease diagnostics and biomedical research.
Keywords: Cell segmentation, Foundation models, Diffusion models, Synthetic data, Biomedical imaging
Date Published: December 5, 2025 DOI: 10.11159/jbeb.2025.014
View Article