meerkat - Explore and Annotate Unstructured Datasets with Python

Project Introduction: Meerkat

Overview

Meerkat is an innovative open-source Python library designed to simplify the visualization, exploration, and annotation of any dataset. This tool is particularly powerful for handling unstructured data types, including free text, images, video, and PDFs, with the help of machine learning models. Developed by the Hazy Research lab at Stanford, Meerkat aims to transform how users interact with large volumes of complex data.

Key Features

Low Overhead

Meerkat ensures a smooth start, requiring only four lines of Python code to begin interacting with datasets. It integrates seamlessly with popular data tools such as Pandas, Arrow, HF Datasets, Ibis, and SQL, minimizing data movement by allowing users to work where their data exists without the need for uploads to external databases or reformatting.

import meerkat as mk
df = mk.from_csv("paintings.csv")
df["image"] = mk.files("image_url")
df

Diverse Data Types

Meerkat supports a wide variety of data types, enabling users to visualize and annotate text, images, audio, video, MRI scans, PDFs, HTML files, JSON files, and more, through its interactive interfaces.

Intelligent User Interfaces

Incorporating machine learning models into the user interface is a breeze with Meerkat, allowing features like searching, grouping, and autocomplete. This is achieved by embedding models, such as large language models, in the system for enhanced functionality.

df["embedding"] = mk.embed(df["img"], engine="clip")
match = mk.gui.Match(df, against="embedding", engine="clip")
sorted_df = mk.sort(df, by=match.criterion.name, ascending=False)
gallery = mk.gui.Gallery(sorted_df)
mk.gui.html.div([match, gallery])

Customizable and Composable

Meerkat allows users to create customized interfaces through declarative visualization components that are both composable and modifiable, much like popular libraries such as Seaborn.

plot = mk.gui.plotly.Scatter(df=plot_df, x="umap_1", y="umap_2",)

@mk.gui.reactive
def filter(selected: list, df: mk.DataFrame):
    return df[df.primary_key.isin(selected)]

filtered_df = filter(plot.selected, plot_df)
table = mk.gui.Table(filtered_df, classes="h-full")

mk.gui.html.flex([plot, table], classes="h-[600px]")

Ideal Use Cases

Exploratory analysis over unstructured data
Spot-checking performance of large language models like GPT-3
Identifying systematic errors in machine learning models
Quick labeling of validation data

Limitations

While Meerkat shines with unstructured data, it may not be the best fit if:

The task involves only structured data (numerical/categorical).
You're creating straightforward demos of machine learning models without frequent data visualization.
The project involves large-scale manual data labeling, where dedicated tools are more suitable.

About the Team

Meerkat is crafted by Machine Learning PhD students at Stanford's Hazy Research lab, focusing on making models more reliable for processing unstructured datasets. They are enthusiastic about collaborations and queries, and can be contacted via email for more details on using Meerkat in various projects.

For further exploration, users can visit Meerkat's website and access their documentation to dive deeper into what Meerkat offers.