Large language models for text translation
In recent years, machine translation has come a long way. Thanks to advances in artificial intelligence and natural language processing (NLP), it’s now possible to translate text from one language to another quickly and accurately. However, traditional approaches to machine translation have their limitations. They often rely on rule-based systems or statistical models that can struggle with complex sentence structures and idiomatic expressions.
That’s where generative large language models (LLMs) come in. These powerful tools use neural network to generate human-like translations on the fly. In this blog post, I will show open large language models like Llama-2 which are instruction tuned generative models can be effectively to perform translation tasks.
Setting the scene
Before diving into the specifics of generative LLMs for translation, let’s first understand what they are and how they work. A LLM is essentially an neural network system that has been trained on a massive corpus of text data to simply predict the next word. Often this is followed by one or more fine-tuning steps to provide general-purpose capabilities through an instruction or chat text interface. For translation, we can instruct these models to perform translation in an automated fashion.
The beauty generative LLMs lies in their ability to handle complexities that traditional approaches struggle with. Translation between languages can be challenging for a number of reasons but nuances are often lost. However, despite being trained for general purpose tasks mainly in English, I find that these models perform surprisingly well for translation between English and Danish.
Translation using language models
In the code below, a GPTQ quantized version (8-bit) of the Llama-2-13b-chat model is used for translation with the transformers and AutoGPTQ libraries. I find that the 13b version perform much better for my use-case and similarly that the 8-bit version was significantly better than the 4-bit version. It runs well on a 24 GB GPU, other options are using the GGUF models in combination with llama-cpp-python.
The code is also available as a GitHub Gist.
The steps for translating a large text file (approximately 300 word pages) from Danish to English:
- Setup and load model
#Imports necessary libraries
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer, pipeline, logging
from tqdm import tqdm
#Path to model
#Here, a Llama-2-13b-chat quantized using GPTQ is used
model_path = "TheBloke_Llama-2-13B-chat-GPTQ_gptq-8bit-128g-actorder_True"
#Instantiate model
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
quant_cfg = BaseQuantizeConfig.from_pretrained(model_path)
model = AutoGPTQForCausalLM.from_quantized(model_dir,
use_safetensors=True,
device="cuda:0",
quantize_config=quant_cfg,
disable_exllama=True)
logging.set_verbosity(logging.CRITICAL)
#Instantiate a pipeline for text-generation task with some
#reasonable default parameters
pipe = pipeline("text-generation",
model=model,
tokenizer=tokenizer,
temperature=0.7,
repetition_penalty=1.15,
top_p=0.9,
max_new_tokens=2048,
top_k=20,
do_sample=True)
- Load and chunk the text to be translated
#Reads the contents of "book.txt" and splits it into chunks
with open("book.txt") as src:
book = src.read()
book_list = [i.strip() for i in book.split("\n") if len(i.strip()) != 0]
#Split the content of book into chunks while respecting paragraphs
def line_chunks(lines, chunk_limit):
chunks = []
chunk = []
chunk_len = 0
for line in lines:
if len(line) + chunk_len < chunk_limit:
chunk.append(line)
chunk_len += len(line)
else:
chunks.append(chunk)
chunk = [line]
chunk_len = len(line)
chunks.append(chunk)
return chunks
#Using chunk size of 4000 as Llama-2 uses approximately
#4 characters per token or ~1000 tokens
#This leaves plenty of space for the generated content as
#Llama-2 context window is 4096 tokens
chunk_size = 4000
#Chunk text and merge sub-chunks
book_chunks = line_chunks(book_list, chunk_size)
book_chunks_merge = ["\n".join([j.strip() for j in i]) for i in book_chunks]
- Define function which prepare prompt and call the model
#Defines a function that perform the translation
#Build prompt from a template that I have found to be appropriate
#Translate the text using the pipeline
#Finally cut the chunk such that only the translated text is returned
def translate(chunks, from_language, to_language, pipeline):
prompt_template = "<s>[INST] <<SYS>>\nYou are a helpfull assistant and an expert translater. You translate text from {from_lang} to {to_lang} while keeping the text in a scientific tone.\n<</SYS>>\n\nTranslate text from {from_lang} to {to_lang}. Only return the translated text without any further explanation. Text in {from_lang}:\n\n'{text}' [/INST] Sure! Here's the translation of the text from {from_lang} to {to_lang}:\n\n'"
chunks_translated = []
for i in tqdm(chunks):
prompt = prompt_template.format(text = i,
from_lang=from_language.lower().capitalize(),
to_lang=to_language.lower().capitalize())
prompt_len = len(prompt)
trans = pipeline(prompt)
chunks_translated.append((prompt_len, trans[0]["generated_text"]))
chunks_translated_cut = [text[begin:(len(text)-1)] for begin, text in chunks_translated]
return(chunks_translated_cut)
- Translate text chunks
#Translate book chunks from Danish to English
book_translated = translate(book_chunks_merge, "danish", "english", pipe)
- Join translated chunks and write to file
#Write the translate chunks to file
with open('book_translated.txt', 'w') as f:
result_txt = "\n\n".join([i.strip() for i in book_translated])
f.write(result_txt)
I have used the code snippet above to translate a book from Danish to English. While not perfect, it provides a very good starting point. The most common errors are translation of some Danish specific words/nouns, while the meaning of the paragraphs were captured well.
Concluding remarks
Generative LLMs represent a significant leap forward in machine translation technology. Their ability to learn directly from raw text data and adapt to new tasks makes them ideal for handling complex sentence structures and idiomatic expressions across various languages. However, it does require some experimentation to get the right prompt when using general purpose LLMs, especially when the process of extracting the translated text needs to be automated. I find that using open models is a huge advantage due to the complete control of the prompt. For example, I found that adding an output indicator, the part of the prompt following the end-of-instruction token (” [/INST]“), increased predictability of the output and made extraction easier. While there likely are some challenges with translation using this method for uncommon languages but my experiments have generally turned out well.