Dynamic Malware Behavior Dataset (DMBD) July 2025
Overview
This dataset provides comprehensive behavioral data from 65,416 software samples (both malware and legitimate software) to support machine learning research in cybersecurity. The dataset was generated by Lawrence Livermore National Laboratories (LLNL) and includes detailed event logs collected during 5-minute executions in a controlled cyber range environment.
Key Dataset Features
- Size: 65,416 total samples with approximately 39 million event log entries
- Composition: 40,000 training samples and 25,416 test samples
- Data Collection: Host logs generated using LLNL's Wintap tool
- Labeling: Classifications (malicious/benign) determined by consensus voting from ~80 anti-virus engines
- Event Types: Process creation/termination, image loading, file creation, network connections, and registry modifications
Data Structure
The dataset consists of two main components:
- Event data: Detailed logs of software behavior in a parent-child format
- Label data: Classification of each sample as "benign," or "malicious"
Baseline Performance
Initial machine learning models have achieved:
- 95.0% accuracy with 10-fold cross-validation
- 95.4% accuracy on the test set
- ROC AUC scores of 0.959 (validation) and 0.970 (test)
Research Challenge
While baseline results are strong, performance deteriorates when testing on newer samples. The dataset specifically includes 2024 malware samples in the test set but not in the training data (which covers 2017-2023). This presents an opportunity for researchers to develop models with improved temporal generalization capabilities.
This dataset serves as a benchmark for cybersecurity researchers to develop more effective machine learning approaches for malware detection based on dynamic behavioral analysis.
Additional Information
There is a github repository that has additional details about the dataset and a sample notebook.