Authors:
Yigitcan Kaya (UC Santa Barbara)
Yizheng Chen (Univ. of Maryland, College Park)
Marcus Botacin (Texas A&M University)
Shoumik Saha (Univ. of Maryland, College Park)
Fabio Pierazzi (University College London)
Lorenzo Cavallaro (University College London)
David Wagner (UC Berkeley)
Tudor DumitraČ™ (Univ. of Maryland, College Park)
Training data | Test data | Method | TPR @ 1% FPR |
---|---|---|---|
Sandbox traces | Sandbox traces | Standard classifier | ~95% (Prior Work) |
Sandbox traces | Endpoint traces | Standard classifier | ~17% (Our Work) |
Endpoint traces | Endpoint traces | Standard classifier | ~49% (Our Work) |
Sandbox traces | Endpoint traces | Training set resampling + invariant learning | ~22% (Our Work) |
Endpoint traces | Endpoint traces | Soft labels + invariant learning | ~52% (Our Work) |
Unlocking the potential of machine learning (ML), we've witnessed thrilling advancements in behavioral malware
detection. While ML models have traditionally been trained and evaluated on sandbox execution traces,
their true performance in the wild is unknown. In this scenario, models must rely on traces collected
from real endpoint hosts, not sandboxes, to make decisions.
Our work shows that, prior sandbox-based models suffer from massive performance loss when tested on
real-world endpoint traces. They can achieve over 95% true-positive rate (TPR) on sandbox
traces, which drops to 17% on endpoint traces (see the table above). We attribute this gap to two
distinct factors. First, in real-world endpoint security solutions, easy-to-classify samples are handled
effectively by static methods (such as blocklists or signatures), and dynamic (behavioral) methods are applied to
the rest. We find that disregarding this filtering effect and evaluating behavior-based methods on a broad
distribution of samples (including the ones static methods can handle) overestimates the true-positive rate by
over 30%. Second, sandbox traces include features (some of which are spurious) stemming from how sandboxes are
configured. Most features sandbox-based models learn to rely on are missing from endpoint traces, causing a drop
in performance.
To improve the detection performance on endpoint traces, we explore multiple ML techniques, including techniques
against label noise and spurious features. Our techniques result in moderate improvements: from 17%
true-positive rate to 22%---still far below the sandbox-based performance of behavioral detectors.
Our results are a call to action for the community. Behavioral malware detection with ML is not a solved
problem. ML methods perform far worse on malware in the wild, and there are major unsolved challenges
and significant room for improvement.
This page is set up to stimulate progress on this problem. Upon request, we make our sandbox dataset and all our
metadata available to the security community. We also offer a pipeline that allows researchers to evaluate their
behavioral malware detectors against our real-world endpoint data. We will evaluate submitted detectors
based on their performance on endpoint traces and rank them in the leaderboard below.
Please scroll below for instructions on accessing the dataset and submitting your detectors to our
leaderboard.
Submitting Team |
Submit Date |
Endpoint | Sandbox-1 | Sandbox-2 | Runtime (s) |
Details |
||||
---|---|---|---|---|---|---|---|---|---|---|
TPR@1 | AUC | TPR@1 | AUC | TPR@1 | AUC | Feat. | Inf. | |||
Kaya et al. |
2025-03-05 |
16.7% | 75.9% | 93.2% | 98.8% | 63.5% | 90.6% | 570 |
50 |
|
Kaya et al. |
2025-03-05 |
21.6% | 77.3% | 69.0% | 97.2% | 57.6% | 92.5% | 570 |
50 |
|
Kaya et al. |
2025-03-05 |
11.2% | 76.6% | 95.0% | 99.0% | 60.7% | 93.5% | 570 |
50 |
|
Kaya et al. |
2025-03-05 |
13.9% | 76.2% | 94.0% | 99.0% | 70.4% | 91.9% | 570 |
50 |
|
Kaya et al. |
2025-03-05 |
51.8% | 87.2% | 10.0% | 75.1% | 19.6% | 72.7% | 570 |
50 |
|
Kaya et al. |
2025-03-05 |
49.5% | 86.4% | 9.0% | 77.6% | 20.9% | 74.5% | 570 |
50 |
The N-Gram-based ResNet model trained on sandbox-1 traces and tuned on the endpoint traces.
The N-Gram-based ResNet model trained on sandbox-1 traces with a resampled training set and environment invariance loss.
The N-Gram-based ResNet model trained on sandbox-1 traces and tuned on the sandbox-1 traces.
The N-Gram-based ResNet model trained on sandbox-1 traces and tuned on the sandbox-2 traces.
The N-Gram-based ResNet model directly trained on the real-world endpoint traces with soft labels and environment invariance loss.
The N-Gram-based ResNet model directly trained on the real-world endpoint traces.
To learn about our data and request access, please visit our Github repository
To learn how you can make a submission and view an example submission for the expected format, please visit our Github repository
@inproceedings{kaya2025ml,
title={ML-Based Behavioral Malware Detection Is Far From a Solved Problem},
author={Kaya, Yigitcan and Chen, Yizheng and Botacin, Marcus and Saha, Shoumik and Pierazzi, Fabio and Cavallaro, Lorenzo and Wagner, David and Dumitras, Tudor},
booktitle={Proceedings of the 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)},
year={2025},
organization={IEEE}}