VBART: The Turkish LLM

arXiv:2403.01308 · cs.CL, cs.AI, cs.LG · Submitted March 2, 2024

We present VBART, the first Turkish sequence-to-sequence Large Language Models pre-trained on a large corpus from scratch. VBART comes in two sizes (Large and XLarge), based on ideas from BART and mBART.

Fine-tuned VBART models surpass prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering, and question generation. Key findings:

Monolingual Turkish LLM outperforms multilingual models by up to 3x
Monolingual tokenizer is up to 11x more efficient than multilingual tokenizers
Introduces a method to enlarge an existing pre-trained LLM
135 GB cleaned vngrs-web-corpus publicly released

Fine-tuned models, tokenizer, and corpus are available on HuggingFace.

Links: arXiv · PDF · HuggingFace

VBART: The Turkish LLM#

VBART: The Turkish LLM