Professor Lianwen Jin's Team Wins the EvaHan2023 Classical Chinese Translation Competition

The Machine Translation Summit MTS2023 has come to an end on September 8th in Macau, China.The first Ancient Chinese Machine Translation Competition (EvaHan 2023) held at the summit has attracted over 20 teams from well-known research institutions such as Peking University, Nanjing University, South China University of Technology, The Chinese University of Hong Kong, and Beijing Institute of Technology to sign up and participate.The team led by Professor Lianwen Jin from our school proposed a solution based on the Large Language Model (LLM) and won the competition at last.

 

Classical Chinese is the carrier of traditional Chinese culture, and the automatic translation of classical and modern Chinese can be very helpful in understanding ancient Chinese history and inheriting excellent traditional Chinese culture.However, significant differences exist in grammar structure, expression habits and other aspects between classical Chinese and modern Chinese, which brings challenges to the translation from classical Chinese to modern Chinese.At the same time, elliptical sentence structures are widely used in classical Chinese, and restoring the omitted parts during the translation requires the translation system to have rich prior knowledge.

 

Focusing on the difficulties in translating classical Chinese, Professor Jin’s team has proposed the following solution.Firstly, based on the large-scale pre-trained language model named LLaMA, they use classical Chinese data to expand the vocabulary, and innovatively use the word embedding of the pre-trained model to fuse and expand the classical Chinese vocabulary, which can fully utilize the knowledge stored in the pre-trained model.Secondly, by integrating and refining existing classical Chinese corpus, a large-scale dataset of classical Chinese is constructed, and the dataset is used to perform incremental unsupervised pre-training on the vocabulary-expanded model, providing the model with rich prior knowledge of classical Chinese.Finally, multi-stage supervised training is conducted on the competition data, achieving a BLEU score of 29.68 and a CHRF score of 26.14 in the machine translation indicators, which makes the team win the competition with a clear advantage. This score also surpasses Baidu Translate's 25.57BLEU score significantly.

 

The team members include Jiahuan Cao (master student), Dezhi Peng (doctoral student), Yongxin Shi (doctoral student), Zongyuan Jiang (master student), and Professor Jin (supervisor).

Text: Jiahuan Cao, Dezhi Peng; Preliminary reviewer: Shushu Zeng; Final reviewer: Jian Zhang)

Fig.1 Certificate