Lance: A Modern Data Format for Machine Learning
Lance is a cutting-edge columnar data format specifically designed to optimize workflows and datasets for machine learning (ML). It offers a fast and efficient solution to access and manage data, making it a perfect fit for various applications like building search engines, creating feature stores, and handling complex data types. Lance greatly improves performance, especially in tasks that require high-speed input/output operations and data shuffling during large-scale ML training.
Key Features of Lance
-
High-performance Random Access: Lance provides an access speed that is up to 100 times faster than Parquet, without losing effectiveness during data scans.
-
Vector Search Capabilities: With Lance, users can perform rapid vector searches, finding nearest neighbors in mere milliseconds. It also allows combining Online Analytical Processing (OLAP) queries with vector searches.
-
Automatic Versioning: Lance supports versioning seamlessly, enabling users to manage different data versions without needing additional infrastructure.
-
Ecosystem Integrations: Lance is compatible with popular data processing tools like Apache Arrow, Pandas, Polars, and DuckDB, with plans to integrate more in the future.
Getting Started with Lance
To start with Lance, users can install it using pip:
pip install pylance
For those interested in trying the latest features, a preview release can be installed:
pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance
Example Workflow: Converting to Lance
Users can quickly convert their datasets to the Lance format using only a few lines of code. Here’s how you can convert a Parquet dataset:
import lance
import pandas as pd
import pyarrow as pa
import pyarrow.dataset
df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')
parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")
Reading with Lance
Lance makes it effortless to read data, integrating easily with various data systems and query engines such as Pandas and DuckDB. For example, to read a Lance dataset using Pandas:
dataset = lance.dataset("/tmp/test.lance")
df = dataset.to_table().to_pandas()
df
Or using DuckDB:
import duckdb
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()
Advanced Features
Lance supports sophisticated data operations, such as vector search, which involves finding similar items quickly—a critical task in applications like personalized search or recommendation systems. Creating and querying such indexes involves a straightforward process using Lance’s tools.
Why Another Data Format?
The machine-learning lifecycle involves numerous stages, each requiring different data representations. Existing formats like Parquet and ORC excel in certain areas but fall short in others, especially when dealing with ML use cases. Lance provides a unified solution that optimizes every stage of the ML development cycle, from collection and exploration to training and deployment, bridging gaps that traditionally required multiple formats and transitions.
Benchmarks and Performance
Lance offers impressive performance metrics. When benchmarked against the Oxford Pet dataset, Lance demonstrated being 50-100 times faster for analytical queries compared to raw metadata and significantly faster for random access compared to Parquet.
Community Usage and Support
Lance is already powering real-world applications, from serverless ML vector databases like LanceDB to complex data management in e-commerce and autonomous vehicle companies. The project welcomes community contributions and runs active development and discussions through various platforms including Discord and Twitter.
For those interested in deep diving into the technical details or contributing to the project, Lance offers comprehensive documentation and community resources to get started.