VBART: The Turkish LLM

arXiv:2403.01308 路 cs.CL, cs.AI, cs.LG 路 Submitted March 2, 2024

We present VBART, the first Turkish sequence-to-sequence Large Language Models pre-trained on a large corpus from scratch. VBART comes in two sizes (Large and XLarge), based on ideas from BART and mBART.

Fine-tuned VBART models surpass prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering, and question generation. Key findings:

  • Monolingual Turkish LLM outperforms multilingual models by up to 3x
  • Monolingual tokenizer is up to 11x more efficient than multilingual tokenizers
  • Introduces a method to enlarge an existing pre-trained LLM
  • 135 GB cleaned vngrs-web-corpus publicly released

Fine-tuned models, tokenizer, and corpus are available on HuggingFace.

Links: arXivPDFHuggingFace