KG_RAG - Enhancing Language Model Capabilities through Biomedical Knowledge Integration

Introduction to the KG-RAG Project

What is KG-RAG?

KG-RAG stands for Knowledge Graph-based Retrieval Augmented Generation. This innovative framework marries the explicit knowledge from a Knowledge Graph (KG) with the implicit comprehension capabilities of a Large Language Model (LLM). Essentially, KG-RAG is a versatile tool designed to integrate specialized knowledge contexts into general-purpose language models, enhancing their response accuracy and relevance.

The Core Component: SPOKE

At the heart of KG-RAG is a robust biomedical knowledge graph known as SPOKE (Scalable Precision Oncology Knowledge Engine). SPOKE acts as the backbone, supplying the biomedical context necessary for the framework. It amalgamates over 40 diverse biomedical knowledge repositories, covering critical concepts such as genes, proteins, drugs, compounds, and diseases. This rich data comprises more than 27 million nodes and 53 million edges, contributing to a detailed and interconnected knowledge base that supports precise query responses.

How Does KG-RAG Work?

KG-RAG operates by extracting "prompt-aware context" from the SPOKE KG. This involves identifying the minimal yet sufficient context required to answer a user's query effectively. Through this process, KG-RAG empowers large language models to provide responses that are both domain-specific and contextually accurate, a significant enhancement over traditional general-purpose models.

Example Use Case

A practical demonstration of KG-RAG's capabilities can be seen in the example of the drug "setmelanotide." When queried, a model without KG-RAG provides generic responses. In contrast, when equipped with KG-RAG, the model accurately reflects the recent FDA approval of the drug for weight management in patients with Bardet-Biedl Syndrome. This illustrates how KG-RAG enhances the depth and precision of LLM responses using targeted knowledge extraction.

Running KG-RAG

To utilize KG-RAG in practice, users can follow a structured setup process:

Clone the Repository: Begin by cloning the KG-RAG repository, which houses all necessary biomedical data.
Create a Virtual Environment: Set up a controlled environment using Python version 3.10.9.
Install Dependencies: Use the requirements file to install all required packages.
Configure Settings: Update the config.yaml file with machine-specific information for running scripts.
Run Setup Script: This script will prepare necessary components, such as creating a disease vector database and optionally downloading the Llama model.
Execute KG-RAG: Users can run KG-RAG with GPT or Llama models, choosing between normal and interactive modes for detailed, step-by-step insights.

BiomixQA: A Benchmark Tool

To validate the efficiency of KG-RAG across various LLMs, the project also introduces BiomixQA, a dataset containing a range of biomedical question formats, including multiple-choice and true/false questions. BiomixQA aids in testing and improving models for biomedical natural language processing and question-answering systems.

Conclusion

KG-RAG represents a significant leap forward in the integration of specialized knowledge into general language models. By leveraging comprehensive biomedical graphs like SPOKE, it highlights the enhanced capabilities of language models in providing accurate and context-aware information. As research and development continue, KG-RAG could expand its versatility beyond the biomedical field, demonstrating its potential in various domains requiring precise contextual knowledge augmentation.