Hybrid detection of fuzzy duplicate texts: сosine similarity and transformers

Main Article Content

Tetiana M. Zabolotnia
Nazarii V. Kozynets

Abstract

This paper addresses the challenge of detecting texts that share the same meaning but differ in wording and structure. Such “fuzzy duplicates” are increasingly prevalent in user-generated content, media articles, and academic materials. Traditional TF-IDF-based methods with cosine similarity process data swiftly but often overlook deeper semantic nuances, especially in languages with free word order and complex morphology (for example, Slavic languages such as Ukrainian or Bulgarian, and agglutinative languages like Hungarian). Fully neural solutions (e.g., transformers) typically offer higher accuracy yet can be slow and computationally demanding. To tackle these issues, we propose a hybrid approach that integrates a simplified neural component with classical cosine similarity. The workflow normalizes text variants (correcting spelling and inflectional forms), converts them into semantic vectors using a lightweight transformer model, and then applies a dynamic threshold mechanism tuned to text genre (e.g., news vs. social media). Experiments on Ukrainian-language datasets suggest that this method balances accuracy and speed more effectively than a fully neural pipeline. The approach is novel in combining domain-specific preprocessing and lightweight neural embeddings for fuzzy duplicate detection in text, achieving approximately ten to twelve percent higher detection accuracy than known solutions while maintaining faster runtime than a full BERT model. Preliminary tests in editorial and plagiarism-checking scenarios indicate that the system more reliably identifies paraphrased content than purely statistical methods, thereby reducing the burden of manual verification. Overall, the hybrid design offers a practical compromise between detection performance and computational requirements, which is especially beneficial for resource-constrained applications in morphologically rich languages like Ukrainian or other Slavic languages. Future efforts will focus on extending morphological coverage to further improve reliability.

Downloads

Download data is not yet available.

Article Details

Topics

Section

Computer science and software engineering

Authors

Author Biographies

Tetiana M. Zabolotnia, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

PhD, Associate Professor, Department of Computer Systems Software

Scopus Author ID: 6507406568

Nazarii V. Kozynets, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

Master, Department of Computer Systems Software 

Similar Articles

You may also start an advanced similarity search for this article.