Vision-based GUI testing: a systematic review of computer vision advances in software quality assurance

Main Article Content

Maksym A. Chaikovskyi
Eugene V. Malakhov

Abstract

Ensuring the reliability and adaptability of graphical user interfaces (GUIs) has become a central challenge in contemporary software quality assurance, as modern applications increasingly rely on visually dense, dynamic, and cross-platform interaction patterns. Traditional script-based and DOM-dependent testing approaches are brittle and expensive to maintain, since even minor layout shifts can invalidate large portions of test logic. These limitations have accelerated interest in computer vision and artificial intelligence techniques that evaluate GUIs directly from rendered visuals, enabling more robust, flexible, and human-like validation. This study aims to systematize current research on vision-based GUI testing and to identify how emerging multimodal and language-guided methods reshape quality-assurance practices. Conducted as a structured review of academic and industrial work published between 2010 and 2025, the analysis synthesizes findings across five methodological families: classical image-processing and template-matching approaches, deep neural detectors, generative models for synthetic augmentation, reinforcement-learning agents for autonomous exploration, and transformer-based multimodal systems integrating large language models. The review incorporates an additional empirical-synthesis chapter that consolidates reported evaluation results for vision transformer-based perception, large language models -guided test reasoning, and multimodal vision–language pipelines. The results reveal a consistent technological trajectory: from early perceptual matching to high-fidelity object detection, interactive workflow exploration, and, most recently, semantic interpretation of GUI tasks through natural-language grounding. Experimental evidence shows that visual transformers improve structural perception, reinforcement learning agents enhance interaction coverage, and multimodal systems introduce reasoning capabilities that approach human-like test generation. At the same time, the review identifies persistent challenges, including dataset scarcity – particularly of multimodal screenshot-instruction-execution corpora – fragmented evaluation practices, cross-platform variability, and the computational overhead of multimodal inference. The study concludes that vision-based GUI testing is evolving into a cognitively informed quality-assurance paradigm that integrates perception, behavior, and semantic reasoning. Its novelty lies in providing the unified taxonomy of methods, a consolidated synthesis of empirical verification results, and a set of actionable future research directions. These contributions offer practical value for advancing reproducibility, robustness, and methodological coherence in next-generation GUI-testing systems.

Downloads

Download data is not yet available.

Article Details

Topics

Section

Computer science and software engineering

Authors

Author Biographies

Maksym A. Chaikovskyi, Odesa I. I. Mechnikov National University, 2, Dvoryanska Str.Odesa, 65082, Ukraine

Postgraduate, Department of Mathematical Support of Computer Systems

Eugene V. Malakhov, Odesa I. I. Mechnikov National University, 2, Dvoryanska Str. Odesa, 65082, Ukraine

Doctor of Engineering Sciences, Professor, Head of Department of Mathematical Support of Computer Systems

Scopus Author ID: 56905389000

Similar Articles

You may also start an advanced similarity search for this article.