mars - Enhance Large-Scale Data Computation with a Scalable Framework

Mars Project: A Comprehensive Overview

Mars is an innovative framework designed to handle large-scale data computation seamlessly. It achieves this by scaling well-known libraries such as NumPy, Pandas, Scikit-learn, and numerous others into a unified framework that is tensor-based. This enables efficient data processing for both small and extensive datasets.

Installation

Mars is straightforward to install using Python's package manager. To get started, one simply needs to run:

pip install pymars

Additionally, for those interested in contributing to the Mars project, there's a more detailed setup process:

git clone https://github.com/mars-project/mars.git
cd mars
pip install -e ".[dev]"

More installation details are available in the Mars documentation.

Architecture

Mars boasts a robust architecture, designed for efficient computation. It supports the creation of new sessions both locally and within a cluster, ensuring flexibility for varying computational needs.

Getting Started

To start using Mars, users can initiate a new session either locally or by connecting to an existing Mars cluster:

import mars
mars.new_session()
# or connect to a cluster
mars.new_session('http://<web_ip>:<ui_port>')

Key Features

Mars Tensor

Mars provides a tensor interface similar to NumPy, allowing users familiar with NumPy to adapt easily. Here’s a comparison:

NumPy Approach:

import numpy as np
N = 200_000_000
a = np.random.uniform(-1, 1, size=(N, 2))
print((np.linalg.norm(a, axis=1) < 1).sum() * 4 / N)

Mars Tensor Approach:

import mars.tensor as mt
N = 200_000_000
a = mt.random.uniform(-1, 1, size=(N, 2))
print(((mt.linalg.norm(a, axis=1) < 1).sum() * 4 / N).execute())

Mars is optimized for performance, enabling computations even faster than on a laptop when distributed.

Mars DataFrame

Mars offers a DataFrame interface akin to Pandas:

Pandas:

import pandas as pd
df = pd.DataFrame(np.random.rand(100000000, 4), columns=list('abcd'))
print(df.sum())

Mars DataFrame:

import mars.dataframe as md
df = md.DataFrame(mt.random.rand(100000000, 4), columns=list('abcd'))
print(df.sum().execute())

Mars Learn

In similarity to Scikit-learn, Mars Learn provides a familiar interface for machine learning operations:

Scikit-learn:

from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
X, y = make_blobs(n_samples=100000000, n_features=3)
pca = PCA(n_components=3)
pca.fit(X)

Mars Learn:

from mars.learn.datasets import make_blobs
from mars.learn.decomposition import PCA
X, y = make_blobs(n_samples=100000000, n_features=3)
pca = PCA(n_components=3)
pca.fit(X)

Additionally, Mars integrates with popular libraries like TensorFlow, PyTorch, XGBoost, LightGBM, Joblib, and Statsmodels for extended versatility.

Mars Remote

Mars allows function execution in parallel, optimizing computation times by leveraging its remote capabilities.

Scalability

Mars is exceptionally scalable, capable of transitioning from operation on a single machine to a cluster with minimal hassle. This scalability is vital for processing larger datasets or improving performance.

Deployment Options

Mars supports multiple deployment scenarios:

Bare Metal: Mars can be scaled out to a cluster by starting its distributed components on different machines.
Kubernetes: Detailed deployment guidance is provided for Kubernetes environments.
Yarn: Users can also deploy Mars via Yarn.

Community and Contribution

Mars thrives on community involvement. Developers and users can join the discussion through Slack, mailing lists, or by submitting GitHub issues or pull requests. Comprehensive guides and resources are available for those looking to dive deeper.

The Mars project is growing continually, welcoming contributors and users to join its expanding ecosystem.