MCI's response to PQ on National Multimodal Large Language Model Programme
Parliament Sitting on 10 January 2024
QUESTION FOR WRITTEN ANSWER
55. Mr Dennis Tan Lip Fong: To ask the Prime Minister whether the National Multimodal Large Language Model Programme will incorporate non-textual training data for commonly used languages in Singapore such as Singlish, Malay, Tamil and Chinese dialects to enhance its voice recognition capabilities and widen the accessibility of this programme to Singaporeans.
Answer:
The National Multimodal LLM Programme aims to develop Large Language Models (LLMs) that are more suited for our context. The Southeast Asian Languages in One Network (SEA-LION) model that was recently released was trained on a dataset that has more than 10 languages, including colloquial English (or Singlish), Chinese, Malay and Tamil.
In the next phase, the Programme will look at techniques to incorporate speech data containing nonverbal cues such as tone and pitch, to augment SEA-LION. For this, they will first evaluate the model performance when non-textual data in standard and colloquial English are added, before moving on to other languages. As we build our local expertise in developing and training regional LLMs through this effort, we will closely monitor ongoing developments and will adapt our plans as the technologies in the field evolve.