towhee - Enhance Unstructured Data Handling Using LLM-Orchestrated Pipelines

Introduction to Towhee

Towhee is a revolutionary framework designed to simplify the processing of unstructured data. Utilizing Large Language Model (LLM) based pipeline orchestration, Towhee transforms raw data types such as text, images, audio, and video files into meaningful formats like text, images, or embeddings. This processed data can then be seamlessly integrated into storage systems like vector databases. Towhee offers a user-friendly Pythonic API to prototype data processing pipelines and optimize them for production.

Multi Modalities

🎨 One of Towhee's standout features is its ability to handle various data types proficiently, whether it's working with image data, video clips, text, audio files, or even complex structures like molecules.

LLM Pipeline Orchestration

📃 Towhee excels in adapting to different Large Language Models (LLMs). It allows users to host open-source large models locally and provides tools for prompt management and knowledge retrieval, making interactions with LLMs more effective.

Rich Operators

🎓 Towhee is equipped with over 140 ready-to-use state-of-the-art models across five domains: computer vision, natural language processing, multimodal, audio, and medical. Whether you're dealing with video decoding, audio slicing, or dimensionality reduction, Towhee makes building robust data processing pipelines straightforward.

Prebuilt ETL Pipelines

🔌 Towhee offers readily available ETL (Extract, Transform, Load) pipelines. These pipelines simplify tasks such as Retrieval-Augmented Generation, Text Image search, and Video copy detection, making it accessible for developers who might not specialize in AI.

High-performance Backend

⚡️ With the Triton Inference Server, Towhee boosts model serving on CPUs and GPUs through platforms like TensorRT, Pytorch, and ONNX. Users can turn their Python pipelines into high-performance docker containers with minimal effort, facilitating efficient deployment and scalability.

Pythonic API

🐍 Towhee provides a Pythonic method-chaining API that makes describing custom data processing pipelines intuitive. It supports schemas that simplify processing unstructured data akin to managing tabular data.

Getting Started

Towhee requires Python 3.7 or later. It can be installed using pip:

pip install towhee towhee.models

Pipeline Options

Pre-defined Pipelines

Towhee offers various pre-defined pipelines to help users implement common functions quickly. These include:

These can be found on the Towhee Hub. Here is a simple example of using the sentence_embedding pipeline:

from towhee import AutoPipes, AutoConfig
config = AutoConfig.load_config('sentence_embedding')
config.model = 'paraphrase-albert-small-v2'
config.device = 0
sentence_embedding = AutoPipes.pipeline('sentence_embedding', config=config)

embedding = sentence_embedding('how are you?').get()
embeddings = sentence_embedding.batch(['how are you?', 'how old are you?'])
embeddings = [e.get() for e in embeddings]

Custom Pipelines

In cases where a preferred pipeline is not available in the Towhee Hub, one can utilize Towhee's Python API to create custom pipelines. Below is an example of a cross-modal retrieval pipeline using CLIP:

from towhee import ops, pipe, DataCollection

p = (
    pipe.input('file_name')
    .map('file_name', 'img', ops.image_decode.cv2())
    .map('img', 'vec', ops.image_text_embedding.clip(model_name='clip_vit_base_patch32', modality='image'))
    .map('vec', 'vec', ops.towhee.np_normalize())
    .map(('vec', 'file_name'), (), ops.ann_insert.faiss_index('./faiss', 512))
    .output()
)

for f_name in ...: # Add image URLs here
    p(f_name)

p.flush()

decode = ops.image_decode.cv2('rgb')
p = (
    pipe.input('text')
    .map('text', 'vec', ops.image_text_embedding.clip(model_name='clip_vit_base_patch32', modality='text'))
    .map('vec', 'vec', ops.towhee.np_normalize())
    .map('vec', 'row', ops.ann_search.faiss_index('./faiss', 3))
    .map('row', 'images', lambda x: [decode(item[2][0]) for item in x])
    .output('text', 'images')
)

DataCollection(p('puppy Corgi')).show()

Core Concepts

Towhee is constructed around four main components: Operators, Pipelines, DataCollection API, and Engine.

Operators: These are the fundamental units of a neural data processing pipeline.
Pipelines: These consist of several operators connected as a directed acyclic graph (DAG) and control functionalities like feature extraction and data analysis.
DataCollection API: Offers a method-chaining style interface for custom pipeline creation and data processing.
Engine: The driving force behind dataflow management, task scheduling, and resource monitoring.

For those looking to explore more or contribute, the following resources are available:

Towhee also welcomes contributions ranging from code development to documentation improvements. More information can be found on their contributing page.

For vector storage solutions, consider exploring Milvus.