Hybrid detection of fuzzy duplicate texts: сosine similarity and transformers

Tetiana M. Zabolotnia; Nazarii V. Kozynets

doi:10.15276/aait.08.2025.4

PDF

Published:
2025-04-04

DOI: https://doi.org/10.15276/aait.08.2025.4

Keywords:

hybrid methods, fuzzy duplicates, cosine similarity, transformer models, ukrainian language texts, content moderation systems

PDF

How to cite

How to Cite

(1)

Zabolotnia T. M.; Kozynets N. V. " Hybrid Detection of Fuzzy Duplicate Texts: сosine Similarity and Transformers" Publ. Nauka i Tekhnika. Odesa: Ukraine. ААІТ 8 (1), 48–61. https://doi.org/10.15276/aait.08.2025.4.

This article was updated to correct the Conflict of Interest statement
10.02.2026

Tetiana M. Zabolotnia

National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

https://orcid.org/0000-0001-8570-7571

Nazarii V. Kozynets

National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

https://orcid.org/0009-0009-1316-8340

Abstract

This paper addresses the challenge of detecting texts that share the same meaning but differ in wording and structure. Such “fuzzy duplicates” are increasingly prevalent in user-generated content, media articles, and academic materials. Traditional TF-IDF-based methods with cosine similarity process data swiftly but often overlook deeper semantic nuances, especially in languages with free word order and complex morphology (for example, Slavic languages such as Ukrainian or Bulgarian, and agglutinative languages like Hungarian). Fully neural solutions (e.g., transformers) typically offer higher accuracy yet can be slow and computationally demanding. To tackle these issues, we propose a hybrid approach that integrates a simplified neural component with classical cosine similarity. The workflow normalizes text variants (correcting spelling and inflectional forms), converts them into semantic vectors using a lightweight transformer model, and then applies a dynamic threshold mechanism tuned to text genre (e.g., news vs. social media). Experiments on Ukrainian-language datasets suggest that this method balances accuracy and speed more effectively than a fully neural pipeline. The approach is novel in combining domain-specific preprocessing and lightweight neural embeddings for fuzzy duplicate detection in text, achieving approximately ten to twelve percent higher detection accuracy than known solutions while maintaining faster runtime than a full BERT model. Preliminary tests in editorial and plagiarism-checking scenarios indicate that the system more reliably identifies paraphrased content than purely statistical methods, thereby reducing the burden of manual verification. Overall, the hybrid design offers a practical compromise between detection performance and computational requirements, which is especially beneficial for resource-constrained applications in morphologically rich languages like Ukrainian or other Slavic languages. Future efforts will focus on extending morphological coverage to further improve reliability.

Downloads

Download data is not yet available.

Issue

Vol. 8 No. 1 (2025): Applied Aspects of Information Technology

Topics

Section

Computer science and software engineering

Authors

Author Biographies

Tetiana M. Zabolotnia, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

PhD, Associate Professor, Department of Computer Systems Software

Scopus Author ID: 6507406568

Nazarii V. Kozynets, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

Master, Department of Computer Systems Software

Hybrid detection of fuzzy duplicate texts: сosine similarity and transformers

How to cite

How to Cite

Abstract

Downloads

Issue

Topics

Section

Authors

Author Biographies

Tetiana M. Zabolotnia, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

Nazarii V. Kozynets, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

Similar Articles

Menu

Article Sidebar

How to cite

How to Cite

Main Article Content

Abstract

Downloads

Article Details

Issue

Topics

Section

Authors

Author Biographies

Tetiana M. Zabolotnia, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

Nazarii V. Kozynets, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

Similar Articles

Menu