awesome_Chinese_medical_NLP: An Introduction
The "awesome_Chinese_medical_NLP" project stands as a comprehensive repository of Chinese Medical Natural Language Processing (NLP) resources. This project showcases a wide array of publicly accessible resources including terminology collections, corpora, word vectors, pre-trained models, knowledge graphs, named entity recognition datasets, question-answer datasets, and information extraction tools. It is an invaluable asset for researchers and practitioners working in the burgeoning field of Chinese medical NLP.
Benchmark
The project introduces the Chinese Biomedical Language Understanding Evaluation (CBLUE), which is a benchmark dataset for Chinese medical information processing challenges. CBLUE is a collaborative effort aimed at advancing Chinese medical NLP technologies and the community. It is decorated with contributions from numerous prestigious institutions including Alibaba Cloud Tianchi and several renowned universities such as Peking University and Zhengzhou University, among others.
Terminology Collections and Corpora
For researchers in need of comprehensive data, this project compiles various collections and corpora:
- Medical News: Offers a collection of Chinese medical news articles.
- Medical Books: Provides an open collection of Chinese LaTeX medical textbooks.
- Medical Vocabularies: From the THUOCL team at Tsinghua University, presenting a large compendium of medical-related terminology.
- International Classification of Diseases (ICD): Covering the 9th, 10th, and 11th revisions in Chinese.
- Specific Datasets: Such as the annotated Chinese diabetes dataset available for detailed study.
Word Vectors and Pre-trained Models
The project excels in presenting state-of-the-art pre-trained models and word vector resources, vital for NLP tasks:
- ChineseEHRBert: A pre-trained BERT model tailored for Chinese electronic medical records.
- MC-BERT: Part of the ChineseBLUE dataset and models suited for medical tasks.
- Medical Word2Vec: Offers word vector resources specialized for the biomedical domain in Chinese.
Segmentation Tools
The toolkit includes segmentation tools necessary for parsing Chinese medical texts:
- PKUSEG: A versatile segmentation tool with support specific for medical texts.
- CMeKG Tools: Offers comprehensive medical segmentation capabilities.
Knowledge Graphs and Relationship Extraction
This section focuses on tools and datasets for constructing and utilizing medical knowledge graphs:
- cMeKG: A Chinese Medical Knowledge Graph offering deep insights into medical relationships.
- OMAHA: Provides datasets for understanding drug indications through knowledge graphs.
Named Entity Recognition (NER)
The compilation includes extensive resources for recognizing entities in Chinese medical records:
- CCKS: Multiple datasets from the China Conference on Knowledge Graph and Semantic Computing provide corpora for NER tasks.
- CHIP2020: Specific datasets geared toward identifying entities in Chinese medical text.
Question Answering
The repository contains datasets and systems for building and evaluating Chinese medical QA systems:
- cMedQA: A dataset series dedicated to question-answering tasks on medical topics.
- KGQA: Leverages medical knowledge graphs to enhance QA systems.
Terminology Standardization
Datasets like CHIP2019 enable the standardization of clinical terminology, critical for maintaining consistency in medical language processing.
Similar Sentence Pair Determination and Text Classification
There are competitive resources and datasets, such as competitions on determining similar sentence pairs during the COVID-19 pandemic, and text classification datasets focused on clinical trial selection.
Other Resources
Lastly, the project offers a plethora of other resources addressing various NLP tasks, such as understanding question intentions from patient consultations and content comprehension in medical knowledge dissemination.
The awesome_Chinese_medical_NLP project is a robust collection of resources, tools, and datasets dedicated to fostering advancement in the field of Chinese medical NLP. Its extensive range and depth cater to a broad spectrum of NLP tasks, offering substantial support to developers, researchers, and academics in this specialized arena.