Dynamic Malware Behavior Dataset (DMBD) July 2025

Overview

This dataset provides comprehensive behavioral data from 65,416 software samples (both malware and legitimate software) to support machine learning research in cybersecurity. The dataset was generated by Lawrence Livermore National Laboratories (LLNL) and includes detailed event logs collected during 5-minute executions in a controlled cyber range environment.

Key Dataset Features

Size: 65,416 total samples with approximately 39 million event log entries
Composition: 40,000 training samples and 25,416 test samples
Data Collection: Host logs generated using LLNL's Wintap tool
Labeling: Classifications (malicious/benign) determined by consensus voting from ~80 anti-virus engines
Event Types: Process creation/termination, image loading, file creation, network connections, and registry modifications

Data Structure

The dataset consists of two main components:

Event data: Detailed logs of software behavior in a parent-child format
Label data: Classification of each sample as "benign," or "malicious"

Baseline Performance

Initial machine learning models have achieved:

95.0% accuracy with 10-fold cross-validation
95.4% accuracy on the test set
ROC AUC scores of 0.959 (validation) and 0.970 (test)

Research Challenge

While baseline results are strong, performance deteriorates when testing on newer samples. The dataset specifically includes 2024 malware samples in the test set but not in the training data (which covers 2017-2023). This presents an opportunity for researchers to develop models with improved temporal generalization capabilities.

This dataset serves as a benchmark for cybersecurity researchers to develop more effective machine learning approaches for malware detection based on dynamic behavioral analysis.

Additional Information

There is a github repository that has additional details about the dataset and a sample notebook.