XPretrain - Accurately Describing Multi-modality Learning through State-of-the-art Pre-training Models

Introduction to XPretrain

XPretrain is a prominent project centered around multi-modality learning, a revolutionary technique where models process and analyze diverse forms of data, such as images, video, and text, in a unified manner. This ambitious project is driven by the research endeavors of the Multimedia Search and Mining (MSM) group at Microsoft Research. XPretrain emphasizes cutting-edge pre-training models that enhance the integration of video, language, and image data.

Multi-modality Learning

Multi-modality learning in XPretrain is divided into two major areas: Video & Language, and Image & Language.

Video & Language

Datasets

HD-VILA-100M: XPretrain features the HD-VILA-100M dataset, a comprehensive and high-resolution collection of video-language data. This dataset advances the study of video-language dynamics by providing a rich, diverse pool of data.

Pre-training Models

HD-VILA (CVPR 2022): A model oriented towards video-language pre-training, known for handling high-resolution data effectively.
LF-VILA (NeurIPS 2022): Designed for long-form video-language interaction, offering sophisticated pre-training methods suitable for extended video content.
CLIP-ViP (ICLR 2023): This innovative model adapts image-language pre-training methodologies for use in video-language contexts, bridging a crucial gap in the multi-modality field.

Image & Language

Pre-training Models

Pixel-BERT: An end-to-end model that merges image and language data seamlessly, laying a foundation for robust multi-modality learning.
SOHO (CVPR 2021 oral): This model improves upon previous designs, utilizing quantized visual tokens in its end-to-end pre-training framework.
VisualParsing (NeurIPS 2021): A powerful transformer-based model that refines the integration of image and language through pre-training.

Recent Updates

The XPretrain project has reached several key milestones:

March 2023 saw the release of code for the CLIP-ViP and LF-VILA models, showcasing the project's continuous advancement.
The CLIP-ViP model was successfully accepted at ICLR 2023, indicating its importance in adapting pre-training models to video and language contexts.
In September 2022, LF-VILA was acknowledged at NeurIPS 2022, highlighting its capability in the long-form video-language domain.
Earlier in March 2022, the influential HD-VILA-100M dataset became available, coinciding with the CVPR 2022 acceptance of the HD-VILA model.

Contributions and Community Engagement

XPretrain encourages contributions from the community. Contributors must adhere to Microsoft’s Contributor License Agreement (CLA), which guarantees legal rights to contributions. The project adheres to the Microsoft Open Source Code of Conduct, fostering an inclusive and respectful environment for collaboration.

Trademark Information

The usage of Microsoft trademarks or logos within the XPretrain project follows Microsoft's Trademark & Brand Guidelines. Any representation of third-party trademarks complies with those respective policies.

Contact Information

For queries related to the pre-trained models, users are encouraged to submit an issue report. For further communication, Bei Liu ([email protected]) and Jianlong Fu ([email protected]) can be contacted for more details.

XPretrain stands as a pivotal endeavor in multi-modality learning, pushing the boundaries of how models understand and process diverse data forms.