Introducing Darwin: A Specialized GPT for Scientific Innovation
Darwin is an innovative open-source project developed by the University of New South Wales AI4Science and GreenDynamics AI. It focuses on improving artificial intelligence's ability to understand and generate scientific content, primarily in the fields of materials science, chemistry, and physics. By fine-tuning the powerful LLaMA model on a vast array of scientific literature and datasets, Darwin aims to enhance the precision and applicability of language models in scientific research.
Key Features
- Scientific Focus: Darwin is specifically designed to process and generate scientific information, making it a valuable tool for research in materials science, chemistry, and physics.
- Integration of Knowledge: The model combines structured and unstructured scientific data to improve its understanding and output in scientific contexts.
- Open Source and Research-Oriented: As an open-source project licensed under CC BY NC 4.0, Darwin is intended for non-commercial research purposes only.
Recent Updates
Darwin continues to evolve and achieve significant milestones:
- As of February 15, 2024, Darwin has set a new standard in tasks related to experimental bandgap prediction and metallic classification, outperforming existing models like Fine-tuned GPT-3.5.
- A Google Colab version was made available in September 2023, enabling wider usage and experimentation with the model.
Model Insights
Built on the robust framework of the 7B LLaMA model, Darwin is trained using over 100,000 data points. These are generated by the Darwin Scientific Instruction Generator (SIG), utilizing data from FAIR datasets and a comprehensive scientific literature corpus. This approach has enabled Darwin to surpass models like GPT-4 in scientific question-and-answer tasks and beat fine-tuned GPT-3 in chemistry problem-solving.
Development and Usage
The Darwin project is a work in progress, with ongoing efforts to advance its capabilities and safety measures. Users are encouraged to explore Darwin's potential and contribute feedback to enhance its development.
For those interested in using Darwin:
- Detailed installation instructions and requirements are provided to facilitate set-up and use.
- The model requires significant computational resources, such as a minimum of 10GB GPU memory for inference.
Fine-tuning and Data Sources
Researchers can further fine-tune the Darwin model using different datasets. The project employs extensive scientific literature and FAIR datasets as primary data sources, supporting advanced scientific experimentation.
The Team Behind Darwin
The project boasts a strong collaborative team from UNSW and GreenDynamics along with key advisors from UNSW Engineering. These contributors bring together expertise from multiple disciplines to drive the project's success.
Acknowledgements and Future Opportunities
Darwin's development has benefited from various influential open-source projects, including LLaMA and Stanford Alpaca, as well as gptchem. The team acknowledges the support of NCI Australia for high-performance computing resources.
For individuals interested in joining this pioneering work, PhD, PostDoc, and other career opportunities are available. Enthusiasts are encouraged to contact the project leaders for more information.
Darwin represents a significant leap in applying AI to scientific research, and as its development continues, it promises to unlock even more potential within the scientific community.