AcmeTrace Project Overview
The AcmeTrace project, hosted by the Shanghai AI Lab, offers a comprehensive repository of public releases encompassing various workloads spanning from March to August 2023. This project primarily focuses on the extensive trace data collected from the development of large language models (LLMs) within datacenters. It is specifically aimed at supporting academic research and facilitating advancements in this domain. For academics and researchers interested in detailed analyses, AcmeTrace represents a key resource, and any inquiries or discussions are encouraged through email or GitHub.
AcmeTrace Dataset Details
The AcmeTrace project operates within a sophisticated dataset infrastructure that offers crucial insights into job executions across multiple GPU clusters. Here is a breakdown of the dataset's highlights:
Main Characteristics:
- Total Dataset Size: While the full dataset on HuggingFace weighs approximately 80GB, a smaller, more manageable size of 109MB is readily accessible.
- Active Duration: It covers a span of six months.
- Independent GPU Clusters: There are two distinct GPU clusters available.
- Total Number of Jobs: The dataset includes a robust volume of 880,740 jobs, with 470,497 being GPU-specific jobs.
Structural Composition
The dataset's organization is meticulous, offering users a seamless experience in navigating through job traces, cluster utilization data, and even visualization examples. The structure is divided as follows:
- Data Folder: Containing
job_trace
(with files such astrace_kalos.csv
andtrace_seren.csv
),utilization
data using tools like DCGM & Prometheus, and processed files for plotting purposes. - Figure Folder: Presents examples of trace visualizations like bar graphs illustrating job states.
- Scripts and Support Files: Includes parsing scripts and utilities to analyze and generate visual representations of the data.
Schema Explanation
A deep dive into the dataset reveals intricate details about job execution parameters:
1. Job Trace
The job trace files (trace_seren.csv
and trace_kalos.csv
) provide extensive details on jobs including:
- Job Identification: Unique
job_id
anduser
identifiers. - Resource Requirements: Information about nodes, GPUs, CPUs, and memory configurations.
- Job Lifecycle: Tracking changes from
submit_time
toend_time
, including factors likeduration
andqueue
time. - Status Tracking: Possible states include
COMPLETED
,CANCELLED
,FAILED
, etc., detailing the job’s final status.
2. Resource Utilization
The project also tracks critical utilization metrics across the cluster, offering insights into CPU and GPU resource consumption.
Access and Usage
While the AcmeTrace dataset provides significant academic potential, the project's space constraints mean that cluster utilization files are accessible through HuggingFace. Prospective researchers and academics can utilize this data to enhance their research methodologies, supported by examples of resource consumption and job execution metrics.
Concluding Note
AcmeTrace represents a pivotal resource for anyone interested in the nuances of LLM development within high-performance computing environments. By offering a well-organized and detailed dataset, along with avenues for community engagement, the project fosters a collaborative and educational atmosphere conducive to impactful research outcomes. For the latest updates and further information, individuals are encouraged to explore the project's NSDI '24 paper or visit its GitHub repository.