AcmeTrace - Publicly released AI model development traces from Shanghai AI Lab

AcmeTrace Project Overview

The AcmeTrace project, hosted by the Shanghai AI Lab, offers a comprehensive repository of public releases encompassing various workloads spanning from March to August 2023. This project primarily focuses on the extensive trace data collected from the development of large language models (LLMs) within datacenters. It is specifically aimed at supporting academic research and facilitating advancements in this domain. For academics and researchers interested in detailed analyses, AcmeTrace represents a key resource, and any inquiries or discussions are encouraged through email or GitHub.

AcmeTrace Dataset Details

The AcmeTrace project operates within a sophisticated dataset infrastructure that offers crucial insights into job executions across multiple GPU clusters. Here is a breakdown of the dataset's highlights:

Main Characteristics:

Total Dataset Size: While the full dataset on HuggingFace weighs approximately 80GB, a smaller, more manageable size of 109MB is readily accessible.
Active Duration: It covers a span of six months.
Independent GPU Clusters: There are two distinct GPU clusters available.
Total Number of Jobs: The dataset includes a robust volume of 880,740 jobs, with 470,497 being GPU-specific jobs.

Structural Composition

The dataset's organization is meticulous, offering users a seamless experience in navigating through job traces, cluster utilization data, and even visualization examples. The structure is divided as follows:

Data Folder: Containing job_trace (with files such as trace_kalos.csv and trace_seren.csv), utilization data using tools like DCGM & Prometheus, and processed files for plotting purposes.
Figure Folder: Presents examples of trace visualizations like bar graphs illustrating job states.
Scripts and Support Files: Includes parsing scripts and utilities to analyze and generate visual representations of the data.

Schema Explanation

A deep dive into the dataset reveals intricate details about job execution parameters:

1. Job Trace

The job trace files (trace_seren.csv and trace_kalos.csv) provide extensive details on jobs including:

Job Identification: Unique job_id and user identifiers.
Resource Requirements: Information about nodes, GPUs, CPUs, and memory configurations.
Job Lifecycle: Tracking changes from submit_time to end_time, including factors like duration and queue time.
Status Tracking: Possible states include COMPLETED, CANCELLED, FAILED, etc., detailing the job’s final status.

2. Resource Utilization

The project also tracks critical utilization metrics across the cluster, offering insights into CPU and GPU resource consumption.

Access and Usage

While the AcmeTrace dataset provides significant academic potential, the project's space constraints mean that cluster utilization files are accessible through HuggingFace. Prospective researchers and academics can utilize this data to enhance their research methodologies, supported by examples of resource consumption and job execution metrics.

Concluding Note

AcmeTrace represents a pivotal resource for anyone interested in the nuances of LLM development within high-performance computing environments. By offering a well-organized and detailed dataset, along with avenues for community engagement, the project fosters a collaborative and educational atmosphere conducive to impactful research outcomes. For the latest updates and further information, individuals are encouraged to explore the project's NSDI '24 paper or visit its GitHub repository.