Malware Detection In the Wild

ML-Based Behavioral Malware Detection Is Far From a Solved Problem

Accepted to SaTML 2025

• Dates: April 9-11, 2025 • Location: Copenhagen, Denmark

Authors:

Yigitcan Kaya (UC Santa Barbara)
Yizheng Chen (Univ. of Maryland, College Park)
Marcus Botacin (Texas A&M University)
Shoumik Saha (Univ. of Maryland, College Park)
Fabio Pierazzi (University College London)
Lorenzo Cavallaro (University College London)
David Wagner (UC Berkeley)
Tudor Dumitraș (Univ. of Maryland, College Park)

Performance of ML-based behavioral malware detection under different settings

Training data	Test data	Method	TPR @ 1% FPR
Sandbox traces	Sandbox traces	Standard classifier	~95% (Prior Work)
Sandbox traces	Endpoint traces	Standard classifier	~17% (Our Work)
Endpoint traces	Endpoint traces	Standard classifier	~49% (Our Work)
Sandbox traces	Endpoint traces	Training set resampling + invariant learning	~22% (Our Work)
Endpoint traces	Endpoint traces	Soft labels + invariant learning	~52% (Our Work)

Unlocking the potential of machine learning (ML), we've witnessed thrilling advancements in behavioral malware detection. While ML models have traditionally been trained and evaluated on sandbox execution traces, their true performance in the wild is unknown. In this scenario, models must rely on traces collected from real endpoint hosts, not sandboxes, to make decisions.

Our work shows that, prior sandbox-based models suffer from massive performance loss when tested on real-world endpoint traces. They can achieve over 95% true-positive rate (TPR) on sandbox traces, which drops to 17% on endpoint traces (see the table above). We attribute this gap to two distinct factors. First, in real-world endpoint security solutions, easy-to-classify samples are handled effectively by static methods (such as blocklists or signatures), and dynamic (behavioral) methods are applied to the rest. We find that disregarding this filtering effect and evaluating behavior-based methods on a broad distribution of samples (including the ones static methods can handle) overestimates the true-positive rate by over 30%. Second, sandbox traces include features (some of which are spurious) stemming from how sandboxes are configured. Most features sandbox-based models learn to rely on are missing from endpoint traces, causing a drop in performance.

To improve the detection performance on endpoint traces, we explore multiple ML techniques, including techniques against label noise and spurious features. Our techniques result in moderate improvements: from 17% true-positive rate to 22%---still far below the sandbox-based performance of behavioral detectors.

Our results are a call to action for the community. Behavioral malware detection with ML is not a solved problem. ML methods perform far worse on malware in the wild, and there are major unsolved challenges and significant room for improvement.

This page is set up to stimulate progress on this problem. Upon request, we make our sandbox dataset and all our metadata available to the security community. We also offer a pipeline that allows researchers to evaluate their behavioral malware detectors against our real-world endpoint data. We will evaluate submitted detectors based on their performance on endpoint traces and rank them in the leaderboard below.

Please scroll below for instructions on accessing the dataset and submitting your detectors to our leaderboard.

Leaderboard

Submitting Team	Submit Date	Endpoint		Sandbox-1		Sandbox-2		Runtime (s)		Details
Submitting Team	Submit Date	TPR@1	AUC	TPR@1	AUC	TPR@1	AUC	Feat.	Inf.	Details
Kaya et al.	2025-03-05	16.7%	75.9%	93.2%	98.8%	63.5%	90.6%	570	50
Kaya et al.	2025-03-05	21.6%	77.3%	69.0%	97.2%	57.6%	92.5%	570	50
Kaya et al.	2025-03-05	11.2%	76.6%	95.0%	99.0%	60.7%	93.5%	570	50
Kaya et al.	2025-03-05	13.9%	76.2%	94.0%	99.0%	70.4%	91.9%	570	50
Kaya et al.	2025-03-05	51.8%	87.2%	10.0%	75.1%	19.6%	72.7%	570	50
Kaya et al.	2025-03-05	49.5%	86.4%	9.0%	77.6%	20.9%	74.5%	570	50

Request Access

To learn about our data and request access, please visit our Github repository

Submission Instructions

To learn how you can make a submission and view an example submission for the expected format, please visit our Github repository

Relevant Research

B. Miller, A. Kantchelian, M. C. Tschantz, S. Afroz, R. Bachwani, R. Faizullabhoy, L. Huang, V. Shankar, T. Wu, G. Yiu (2016). Reviewer integration and performance measurement for malware detection. Detection of Intrusions and Malware, and Vulnerability Assessment: 13th International Conference (DIMVA 2016).

Najmeh Miramirkhani, Mahathi Priya Appini, Nick Nikiforakis, Michalis Polychronakis (2017). Spotless Sandboxes: Evading Malware Analysis Systems Using Wear-and-Tear Artifacts. 2017 IEEE Symposium on Security and Privacy (SP).

C. Jindal, C. Salls, H. Aghakhani, K. Long, C. Kruegel, and G. Vigna (2019). Neurlux: dynamic malware analysis without feature engineering. Proceedings of the 35th Annual Computer Security Applications Conference (ACSAC 2019).

E. B. Karbab and M. Debbabi (2019). Maldy: Portable, data-driven malware detection using natural language processing and machine learning techniques on behavioral analysis reports. Digital Investigation, vol. 28.

Avllazagaj, E., Zhu, Z., Bilge, L., Balzarotti, D., & Dumitraș, T. (2021). When malware changed its mind: an empirical study of variable program behaviors in the real world. In 30th USENIX Security Symposium (USENIX Security 21) (pp. 3487-3504).

S. Dambra, Y. Han, S. Aonzo, P. Kotzias, A. Vitale, J. Caballero, D. Balzarotti, and L. Bilge (2023). Decoding the secrets of machine learning in malware classification: A deep dive into datasets, feature extraction, and model performance. Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS 2023).

Dmitrijs Trizna, Luca Demetrio, Battista Biggio, Fabio Roli (2023). Nebula: Self-Attention for Dynamic Malware Analysis. IEEE Transactions on Information Forensics and Security, Volume 19, 2024.

Citation

@inproceedings{kaya2025ml,
    title={ML-Based Behavioral Malware Detection Is Far From a Solved Problem},
    author={Kaya, Yigitcan and Chen, Yizheng and Botacin, Marcus and Saha, Shoumik and Pierazzi, Fabio and Cavallaro, Lorenzo and Wagner, David and Dumitras, Tudor},
    booktitle={Proceedings of the 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)},
    year={2025},
    organization={IEEE}}

Copied!

Contact

Yigitcan Kaya
UC Santa Barbara
yigitcan at ucsb dot edu