U.S. flag

An official website of the United States government.

Dynamic Malware Behavior Dataset (DMBD) July 2025

Overview

This dataset provides comprehensive behavioral data from 65,416 software samples (both malware and legitimate software) to support machine learning research in cybersecurity. The dataset was generated by Lawrence Livermore National Laboratories (LLNL) and includes detailed event logs collected during 5-minute executions in a controlled cyber range environment.

Key Dataset Features

  • Size: 65,416 total samples with approximately 39 million event log entries
  • Composition: 40,000 training samples and 25,416 test samples
  • Data Collection: Host logs generated using LLNL's Wintap tool
  • Labeling: Classifications (malicious/benign) determined by consensus voting from ~80 anti-virus engines
  • Event Types: Process creation/termination, image loading, file creation, network connections, and registry modifications

Data Structure

The dataset consists of two main components:

  1. Event data: Detailed logs of software behavior in a parent-child format
  2. Label data: Classification of each sample as "benign," or "malicious"

Baseline Performance

Initial machine learning models have achieved:

  • 95.0% accuracy with 10-fold cross-validation
  • 95.4% accuracy on the test set
  • ROC AUC scores of 0.959 (validation) and 0.970 (test)

Research Challenge

While baseline results are strong, performance deteriorates when testing on newer samples. The dataset specifically includes 2024 malware samples in the test set but not in the training data (which covers 2017-2023). This presents an opportunity for researchers to develop models with improved temporal generalization capabilities.

This dataset serves as a benchmark for cybersecurity researchers to develop more effective machine learning approaches for malware detection based on dynamic behavioral analysis.

Additional Information

There is a github repository that has additional details about the dataset and a sample notebook.