Malware Detection In the Wild

ML-Based Behavioral Malware Detection Is Far From a Solved Problem

Accepted to SaTML 2025

• Dates: April 9-11, 2025 • Location: Copenhagen, Denmark

Authors:

Yigitcan Kaya (UC Santa Barbara)
Yizheng Chen (Univ. of Maryland, College Park)
Marcus Botacin (Texas A&M University)
Shoumik Saha (Univ. of Maryland, College Park)
Fabio Pierazzi (University College London)
Lorenzo Cavallaro (University College London)
David Wagner (UC Berkeley)
Tudor DumitraČ™ (Univ. of Maryland, College Park)

Performance of ML-based behavioral malware detection under different settings

Training data Test data Method TPR @ 1% FPR
Sandbox traces Sandbox traces Standard classifier ~95% (Prior Work)
Sandbox traces Endpoint traces Standard classifier ~17% (Our Work)
Endpoint traces Endpoint traces Standard classifier ~49% (Our Work)
Sandbox traces Endpoint traces Training set resampling + invariant learning ~22% (Our Work)
Endpoint traces Endpoint traces Soft labels + invariant learning ~52% (Our Work)

Unlocking the potential of machine learning (ML), we've witnessed thrilling advancements in behavioral malware detection. While ML models have traditionally been trained and evaluated on sandbox execution traces, their true performance in the wild is unknown. In this scenario, models must rely on traces collected from real endpoint hosts, not sandboxes, to make decisions.

Our work shows that, prior sandbox-based models suffer from massive performance loss when tested on real-world endpoint traces. They can achieve over 95% true-positive rate (TPR) on sandbox traces, which drops to 17% on endpoint traces (see the table above). We attribute this gap to two distinct factors. First, in real-world endpoint security solutions, easy-to-classify samples are handled effectively by static methods (such as blocklists or signatures), and dynamic (behavioral) methods are applied to the rest. We find that disregarding this filtering effect and evaluating behavior-based methods on a broad distribution of samples (including the ones static methods can handle) overestimates the true-positive rate by over 30%. Second, sandbox traces include features (some of which are spurious) stemming from how sandboxes are configured. Most features sandbox-based models learn to rely on are missing from endpoint traces, causing a drop in performance.

To improve the detection performance on endpoint traces, we explore multiple ML techniques, including techniques against label noise and spurious features. Our techniques result in moderate improvements: from 17% true-positive rate to 22%---still far below the sandbox-based performance of behavioral detectors.

Our results are a call to action for the community. Behavioral malware detection with ML is not a solved problem. ML methods perform far worse on malware in the wild, and there are major unsolved challenges and significant room for improvement.

This page is set up to stimulate progress on this problem. Upon request, we make our sandbox dataset and all our metadata available to the security community. We also offer a pipeline that allows researchers to evaluate their behavioral malware detectors against our real-world endpoint data. We will evaluate submitted detectors based on their performance on endpoint traces and rank them in the leaderboard below.

Please scroll below for instructions on accessing the dataset and submitting your detectors to our leaderboard.

Leaderboard


Submitting
Team

Submit
Date
Endpoint Sandbox-1 Sandbox-2 Runtime (s)
Details
TPR@1 AUC TPR@1 AUC TPR@1 AUC Feat. Inf.
Kaya et al.
2025-03-05
16.7% 75.9% 93.2% 98.8% 63.5% 90.6% 570
50
Kaya et al.
2025-03-05
21.6% 77.3% 69.0% 97.2% 57.6% 92.5% 570
50
Kaya et al.
2025-03-05
11.2% 76.6% 95.0% 99.0% 60.7% 93.5% 570
50
Kaya et al.
2025-03-05
13.9% 76.2% 94.0% 99.0% 70.4% 91.9% 570
50
Kaya et al.
2025-03-05
51.8% 87.2% 10.0% 75.1% 19.6% 72.7% 570
50
Kaya et al.
2025-03-05
49.5% 86.4% 9.0% 77.6% 20.9% 74.5% 570
50

Request Access

To learn about our data and request access, please visit our Github repository

Submission Instructions


To learn how you can make a submission and view an example submission for the expected format, please visit our Github repository

Relevant Research


Citation


@inproceedings{kaya2025ml,
    title={ML-Based Behavioral Malware Detection Is Far From a Solved Problem},
    author={Kaya, Yigitcan and Chen, Yizheng and Botacin, Marcus and Saha, Shoumik and Pierazzi, Fabio and Cavallaro, Lorenzo and Wagner, David and Dumitras, Tudor},
    booktitle={Proceedings of the 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)},
    year={2025},
    organization={IEEE}} 
Copied!

Contact


Yigitcan Kaya
UC Santa Barbara
yigitcan at ucsb dot edu