wllama - Improve browser-based llama.cpp inference with WebAssembly support

Introduction to wllama

wllama is an innovative WebAssembly binding for the llama.cpp project, designed to provide a seamless experience when running inference directly in web browsers. This project stands out because it requires no backend support or GPU, thanks to its utilization of WebAssembly SIMD.

Recent Changes

The development team for wllama is continuously working on improvements and enhancements. Some of the notable recent updates include:

Version 1.14.0: Introduced the capability to use cached models when devices go offline and brought experimental support for encoder-decoder architectures.
Version 1.10.0: Improved model loading by accepting Blob types and enhanced file caching performance through the Origin private file system (OPFS).
Version 1.9.0: Added a custom logger function, the getModelMetadata() method, and support for certain tokens and stopping conditions in text completions.

For a detailed account of all the changes, users are encouraged to visit the releases page.

Key Features

wllama is packed with numerous features that facilitate various operations crucial for running and managing machine learning models in-browser:

Typescript Support: Improved developer experience with built-in typescript functionalities.
In-Browser Inference: Run model inference directly in your browser without a backend or GPU.
No Runtime Dependencies: Streamlined operation by removing unnecessary dependencies.
High-Level and Low-Level APIs: Offers both easy-to-use APIs for general tasks and more complex controls for advanced users.
Efficient Model Handling: Supports splitting models into smaller files for efficient parallel loading.
Automatic Threading: Switches between single-thread and multi-thread builds depending on browser capabilities to improve performance.
Pre-Built NPM Package: Easy installation via npm with the @wllama/wllama package.

Limitations

Despite its many strengths, wllama does have a few limitations:

Headers for Multi-threading: Users need to configure specific headers for cross-origin policies to enable multi-threading.
WebGL Support: Currently, wllama does not support WebGL, though this might change in future versions.
File Size Restrictions: The maximum file size is capped at 2GB, which may necessitate splitting larger models.

Using wllama

To integrate wllama into a React Typescript project, users can install it via npm. The library is highly versatile and includes examples ranging from basic usage to more advanced applications involving embeddings and cosine distances. Detailed instructions and examples can be found in the documentation.

Compiling Binary Files

For those wishing to compile the binaries themselves—perhaps because they prefer not to use the pre-built binaries or require the latest changes—the project provides a straightforward process using Docker.

Future Plans

wllama aims to expand in several directions:

Develop more practical examples demonstrating wllama's capabilities.
Explore GPU inference support via WebGL.
Implement multi-modal capabilities and multi-sequences features, dependent on WebAssembly's resource management advancements.

By leveraging the power of WebAssembly and the flexibility of the llama.cpp framework, wllama is well-positioned to provide cutting-edge solutions for running machine learning models directly in browsers, enhancing accessibility and ease of use for developers around the world.