On January 22, 2025, a meeting of the members of the editorial board and editorial board of the journals “Applied Aspects of Information Technology” and “Herald of Advanced Information Technology” was held (Read more)

Hybrid detection of fuzzy duplicate texts: сosine similarity and transformers

Authors

  • Tetiana M. Zabolotnia National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine
  • Nazarii V. Kozynets National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

DOI:

https://doi.org/10.15276/aait.08.2025.4

Keywords:

hybrid methods, fuzzy duplicates, cosine similarity, transformer models, ukrainian language texts, content moderation systems

Abstract

This paper addresses the challenge of detecting texts that share the same meaning but differ in wording and structure. Such “fuzzy duplicates” are increasingly prevalent in user-generated content, media articles, and academic materials. Traditional TF-IDF-based methods with cosine similarity process data swiftly but often overlook deeper semantic nuances, especially in languages with free word order and complex morphology (for example, Slavic languages such as Ukrainian or Bulgarian, and agglutinative languages like Hungarian). Fully neural solutions (e.g., transformers) typically offer higher accuracy yet can be slow and computationally demanding. To tackle these issues, we propose a hybrid approach that integrates a simplified neural component with classical cosine similarity. The workflow normalizes text variants (correcting spelling and inflectional forms), converts them into semantic vectors using a lightweight transformer model, and then applies a dynamic threshold mechanism tuned to text genre (e.g., news vs. social media). Experiments on Ukrainian-language datasets suggest that this method balances accuracy and speed more effectively than a fully neural pipeline. The approach is novel in combining domain-specific preprocessing and lightweight neural embeddings for fuzzy duplicate detection in text, achieving approximately ten to twelve percent higher detection accuracy than known solutions while maintaining faster runtime than a full BERT model. Preliminary tests in editorial and plagiarism-checking scenarios indicate that the system more reliably identifies paraphrased content than purely statistical methods, thereby reducing the burden of manual verification. Overall, the hybrid design offers a practical compromise between detection performance and computational requirements, which is especially beneficial for resource-constrained applications in morphologically rich languages like Ukrainian or other Slavic languages. Future efforts will focus on extending morphological coverage to further improve reliability.

Downloads

Download data is not yet available.

Author Biographies

Tetiana M. Zabolotnia, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

PhD, Associate Professor, Department of Computer Systems Software

Scopus Author ID: 6507406568

Nazarii V. Kozynets, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, Beresteiskyi Ave. Kyiv, 03056, Ukraine

Master, Department of Computer Systems Software  

Downloads

Published

2025-04-04

How to Cite

[1]
Zabolotnia T.M., Kozynets N.V. “Hybrid detection of fuzzy duplicate texts: сosine similarity and transformers”. Applied Aspects of Information Technology. 2025; Vol. 8, No. 1: 48–61. DOI:https://doi.org/10.15276/aait.08.2025.4.