en

#cross-modal retrieval

Chinese-CLIP is a Chinese adaptation of the CLIP model, trained with around 200 million image-text pairs for tasks such as image-text feature extraction, cross-modal retrieval, and zero-shot classification. Building on the open_clip project, it is tailored for Chinese data with enhancements like coreml model conversion, fine-tuning through knowledge distillation, and deployment using ONNX and TensorRT. The model demonstrates strong performance in benchmark tasks such as text-to-image retrieval and zero-shot classification.

ImageBind integrates images, text, audio, depth, thermal, and IMU data into one embedding space, facilitating cross-modal retrieval and data composition. It supports zero-shot classification and multi-modal generation, offering a ready-to-use PyTorch implementation with pretrained models for developers and researchers in AI.

Uni3D is a scalable 3D pretraining framework designed for large-scale representation learning with one billion parameters. It utilizes a 2D-initialized ViT to align 3D point cloud features with image-text models. By using 2D pretrained models and image-text alignment, Uni3D extends the capabilities of 2D models, achieving new standards in various 3D tasks. The open-sourced project includes tools for semantic coherence, model weights, evaluation code, and more, encouraging community collaboration and progress in multimodal intelligence.

Terms of Use Privacy Policy Advertising Services

Feedback Email: [email protected]