Domain Adapted Language Modeling Toolkit
Simple Overview
A significant gap exists between general-purpose Large Language Models (LLMs) and the vector databases that supply them with contextual data. Bridging this gap is vital to embedding AI into efficient, information-centric fields. These AI systems are sought not only for their flexibility but also for their distinctive and precise capabilities. In pursuit of this, Arcee has developed an open-source toolkit called Domain Adapted Language Model (DALM). This toolkit assists developers in customizing the Arcee open-source Domain Pretrained (DPT) LLMs, helping organizations tailor AI systems to fit their unique needs and intellectual environments.
Demonstration Models
The Arcee team has crafted several example DALMs to showcase its potential:
- DALM-Patent: Designed for patent-related queries.
- DALM-PubMed: Tailored to navigate and extract information from medical publications.
- DALM-SEC: Focused on financial and securities-related documents.
- DALM-Yours: A customizable template for individual needs.
These models exemplify how DALM can be adapted to different sectors, leveraging open-source LLMs to refine domain-specific responses.
Research and Development
The DALM project hosts a comprehensive code repository for fine-tuning a fully differentiable Retrieval Augmented Generation (RAG-end2end) model. This model, for the first time, adapts RAG technology to work with decoder-only language models such as Llama, Falcon, or GPT. Key features include:
- Utilizing the "in-batch negative concept" to enhance efficiency.
- Training methods for both the RAG-end2end and Retriever using contrastive learning, detailed in the repository’s training folder.
- Evaluation tools for both retrievers and generators.
- Data processing scripts suitable for handling diverse datasets.
System Requirements and Installation
When running these systems, requirements vary depending on the chosen retriever, generator model, and batch size. For instance, using the structure retriever BAAI/bge-large-en
and generator meta-llama/Llama-2-7b-hf
, on an A100 GPU (80GB), can process 200k datasets in approximately seven hours.
Installation is straightforward and can be done via pip
, or by cloning the repository for more hands-on development. Verification can be done using the dalm version
command.
Data Setup and Training
Data preparation for training and evaluation is user-friendly, needing a simple CSV file with specific columns (Passage, Query, and optionally Answer). Training scripts available facilitate:
- Training the retriever only, optimizing it for locating relevant passages using contrastive learning.
- Training the Retriever and Generator together for comprehensive RAG-end2end learning.
Evaluation
Evaluation distinguishes the model's capability to find and retrieve relevant passages effectively. Through various retriever models, metrics such as recall and hit rate were assessed, showing significant improvements with End2End models.
Contribution
To contribute to this open-source project, developers can follow the guidelines outlined in the CONTRIBUTING documentation within the repository.
Through DALM, developers can harness advanced language models in niche areas, effectively bridging the divide between general machine learning capabilities and specific, fact-based applications. This toolkit empowers organizations to leverage bespoke AI systems, fine-tuning them to mirror their unique knowledge bases and operational goals.