VBART: The Turkish LLM
arXiv:2403.01308 路 cs.CL, cs.AI, cs.LG 路 Submitted March 2, 2024
We present VBART, the first Turkish sequence-to-sequence Large Language Models pre-trained on a large corpus from scratch. VBART comes in two sizes (Large and XLarge), based on ideas from BART and mBART.
Fine-tuned VBART models surpass prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering, and question generation. Key findings:
- Monolingual Turkish LLM outperforms multilingual models by up to 3x
- Monolingual tokenizer is up to 11x more efficient than multilingual tokenizers
- Introduces a method to enlarge an existing pre-trained LLM
- 135 GB cleaned
vngrs-web-corpuspublicly released
Fine-tuned models, tokenizer, and corpus are available on HuggingFace.
Links: arXiv 路 PDF 路 HuggingFace