Semantic analysis and classifi- cation of malware for UNIX-like operating systems with the use of machine learning methods
DOI:
https://doi.org/10.15276/aait.05.2022.25Keywords:
Malware detection, machine learning;, semantic analysis;, multiclass classification, text mining, operating systemAbstract
The paper focuses on malware classification, based on semantic analysis of disassembled binaries sections’ opcodes with the use of
n-grams, TF-IDF indicator and machine learning algorithms. The purpose of the research is to improve and extend the variety of methods
for identifying malware developed for UNIX-like operating systems. The task of the research is to create an algorithm, which can identify
the types of threats in malicious binary files using n-grams, TF-IDF indicator and machine learning algorithms. Malware classification
process can be based either on static or dynamic signatures. Static signatures can be represented as byte-code sequences, binaryassembled instructions, or imported libraries. Dynamic signatures can be represented as the sequence of actions made by malware. We
will use a static signatures strategy for semantic analysis and classification of malware. In this paper, we will work with binary ELF files,
which is the most common executable file type for UNIX-like operating systems. For the purpose of this research we gathered 2999
malware ELF files, using data from VirusShare and VirusTotal sites, and 959 non malware program files from /usr/bin directory in Linux
operating system. Each malware file represents one of 3 malware families: Gafgyt, Mirai, and Lightaidra, which are popular and harmful
threats to UNIX systems. Each ELF file in dataset was labelled according to its type. The proposed classification algorithm consists of
several preparation steps: disassembly of every ELF binary file from the dataset and semantically processing and vectorizing assembly
instructions in each file section. For the setting classification threshold, the Multinomial Naive Bayes model is used. Using the
classification threshold, we define the size for n-grams and the section of the file, which will give the best classification results. For
obtaining the best score, multiple machine learning models, along with hyperparameter optimization, will be used. As a metric of the
accuracy of the designed algorithm, mean accuracy and weighted F1 score are used. Stochastic gradient descent for SVM model was
selected as the best performing ML model, based on the obtained experimental results. Developed algorithm was experimentally proved
to be effective for classifying malware for UNIX operating systems. Results were analyzed and used for making conclusions and
suggestions for future work.