[논문] Machine learning을 이용한 Windows malware classification

보안/AI for Security

[논문] Machine learning을 이용한 Windows malware classification

kykyky 2024. 5. 4. 17:35

💡A Survey of Machine Learning Methods and Challenges for Windows Malware Classification

Machine learning을 이용해 malware를 multi-classification하는 경우,

일반적으로 아래와 같은 것들이 feature로서 선택된다.

또한 분석 method은 아래와 같다.

▶ N-gram
▶ Linear model
▶ Kernel method
▶ Decision tree
▶ Neural network
▶ sequences에 대한 method:
hidden markov model
byte similiarty measure
CNN, RNN
Haar Wavelet Transform

💡Entropy analysis to classify unknown packing algorithms for malware detection

✅entropy의 특징

▶ entropy는 data의 state(packed / unpacked / being unpacked)를 나타냄

▶ 메모리 공간을 초기화하면 entropy가 감소하며,

encryption이나 compression을 수행하면 entropy가 증가함

✅packer classification의 두 가지 방법

▶ similarity classification (symbolic aggregate approximation (SAX))

: entropy 측정 ⇨ scaling ⇨ SAX로 전환 ⇨ SAX pattern을 가지고 similarity 측정 ⇨ 유사한 것끼리 모음 ⇨ packing 알고리즘 classification

▶ incorporate common classification method (NB, SVM)

💡HYDRA: A multimodal deep learning framework for malware classification

✅ for malware detection & classification, combine "hand-engineered“ & "end-to-end"

▶ "hand-engineered": feature engineering

⇨ learns relationship among API feature vectors

▶ "end-to-end": deep-learning

⇨ learns mnemonic / byte sequence

✅modality

▶ Assembly as a feature

API function call & system call, mnemonics ⇨ feature selection ⇨ multimodal deep learning ⇨ classification

※ mnemonics: assembly의 sequence (mnemonic n-gram word from mnemonic sequence in assembly)

⇨ feature vector로서 기능함: vector의 element = 해당 mnemonic sequence가 나타난 횟수

▶ Hexadecimal sequence (machine code) as a feature

⇨ byte n-gram, entropy, image로서 이용됨 ⇨ multimodal deep learning ⇨ classification

💡Using convolutional neural networks for classification of malware represented as images

✅gray scale image 사용의 장점

capture minor change & retain global structure

: 공격자는 프로그램의 변종을 생성하더라도, 일반적으로는 아주 약간만 변경되는데,

image 방식에서는 global structure의 파악을 잘 유지하므로,

그러한 변종들이 여전히 동일 family로 잘 인식되어 classification에 성공할 것

✅process

1 byte ⇨ 1 pixel (value = 0~255) ⇨ gray image constructed

⇨ visual similarity ⇨ same family

✅CNN

▶ input: executable represented as gray image (w, h, d)

w: 보기 편하게

h: 파일 사이즈에 따라

d = 1

▶ layers: detection filter for specific features or patterns

▶ output: expected malware category

💡Sequential Embedding-based Attentive (SEA) classifier for malware classification

malware detection using NLP

✅ process

▶ learning process

window sliding ⇨ learn opcode only within the window

i) context: 어떤 특정 맥락에서 이 opcode가 사용되었는가?

같은 opcode라도, 사용된 맥락에 따라 기능이 다르다

ii) sementics

learn vector representation meaning specific opcode

▶ sequential blocks

learned things (preserving both contextual & semantic) are fed into sequential blocks (including LSTM)

이때, attention 활용

: input에서의 malicious/benign code의 비율이 극단적이면,

더 적은 데이터 쪽에 훨씬 더 큰 attention을 assign함으로써 결과의 치우침 방지

저작자표시 비영리 변경금지 (새창열림)

현재글[논문] Machine learning을 이용한 Windows malware classification

ky.agile