PhoGPT - Exploring Advanced Neural Network Applications for Vietnamese Language Processing

PhoGPT: Generative Pre-training for Vietnamese

PhoGPT is an innovative open-source project that introduces a state-of-the-art generative model specifically designed for the Vietnamese language. It features two main components: a base pre-trained monolingual model known as PhoGPT-4B and its chat-focused variant, PhoGPT-4B-Chat.

Model Structure

PhoGPT-4B: This base model has about 3.7 billion parameters and is trained using 102 billion tokens from a comprehensive Vietnamese text corpus. It boasts an 8192 context length and operates with a 20,000-token vocabulary.
PhoGPT-4B-Chat: This variant is adapted for conversational purposes. It is fine-tuned from the PhoGPT-4B base model, utilizing a dataset consisting of 70,000 instructional prompts paired with responses, along with an additional 290,000 conversations. This training allows PhoGPT-4B-Chat to excel in interactive and dialogue-based tasks.

Model Download

PhoGPT models can be accessed and downloaded via Hugging Face:

PhoGPT-4B: A base model suitable for various Vietnamese NLP tasks.
PhoGPT-4B-Chat: Tailored for chat and instruction-following applications, ideal for interactive use cases.

These models require approximately 7GB of VRAM when loaded in float16 format, making them accessible to a wide range of hardware configurations.

Running the Model

PhoGPT supports various inference engines enabling easy deployment:

vLLM, Text Generation Inference, and llama.cpp are among the supported frameworks, facilitating the model’s usage in different programming environments and platforms.

With llama.cpp

Follow a series of straightforward steps including compiling the llama.cpp, installing dependencies, and converting models for execution.
Optional quantization offers further model size reduction, supporting 4 and 8-bit formats for efficient processing.

Using Transformers Library

Transformers provide a robust framework to handle PhoGPT models, especially suited for instruction-based and casual conversation generation.
Users can take advantage of the Python code provided to directly interact with the model, performing tasks such as text generation and modification.

Fine-tuning the Model

For users who wish to tailor the PhoGPT model to specific applications or datasets, fine-tuning is supported with:

Configuration and datasets examples available, allowing for straightforward model customization and enhancement.
Options to employ various tools like llm-foundry, transformers, and other external libraries dedicated to model fine-tuning.

Limitations

Despite its capabilities, PhoGPT has some constraints:

It may struggle with tasks that require a high degree of reasoning, such as complex mathematics or programming.
Users should also be cautious of the potential for generating harmful content, hate speech, or biased replies. Vigilance is recommended to ensure the model's output is appropriate and factually accurate.

By offering PhoGPT to the community, the developers aim to enhance the research and practical application capabilities within the Vietnamese language space. They encourage users to cite their technical report when employing PhoGPT for research or in other software frameworks. Moreover, the project is shared under the Apache License 2.0, promoting both academic and commercial usage under specified terms.