Translating my Blog into 7 Languages for Pennies with OpenAI

How I am offering content to a diverse audience

6 min 1077 words Published on December 09, 2024 Last Updated December 09, 2024

Cameron Roots Cameron Roots's profile picture

Categories: Professional Development Computational

Tags: Website Blog Professional Development

This content is available in several other languages via the dropdown in the navigation bar.

Why Bother with Multilingual Blogging

When I last updated my blog, I redesigned it specifically with the ability to provide content in multiple languages. Because most of my content will relate to synthetic biology, programming, and occasionally American graduate school, I anticipate most of my readers will probably already be somewhat proficient in English, my native language. Around 90% of academic literature is published in English. Programming is in a similar situation, with major languages like Java, C and Python largely (but not exclusively) being programmed in English. I’m unaware of biology graduate programs in the United States that offer coursework outside English.

However, a subset of my readers will be less comfortable in English than in their native language. For example, a large number of Colombian researchers report having papers rejected due to grammar. Its certainly the case that some potential readers will not be able to read English at all. These could be undergraduates or curious nonscientific readers. In order to reach a larger audience and make my content more accessible to these communities, I need a way to quickly and cleanly translate my content into other languages

Choosing Target Languages

I wish that I could say that I chose the languages I did for a sophisticated reason, but truthfully it's mostly about reach. English is my native language and the content I write in, so that one is automatic. I also wanted to include German, French, and Spanish as languages close to English that provide a decent benchmark for comparison. Hindi and Simplified Chinese are included because of the large number of people who speak those languages. I added Arabic because I thought it would be interesting to add a language that is left-to-right, with an eventual intent to program it to align to the right. I also have colleagues who speak each of these languages from whom I can receive feedback.

Building the pipeline

Leveraging AI for Translation

There's a lot of literature out there on why one might chose generative AI (ex. OpenAI) versus a rules-based system (ex. Google Translate) for building out automated translation. The main reason I went with OpenAI is personal experience. When translating content from other languages to English, I've noticed that rules-based translations will often give awkward grammar back to me. I can't imagine that other languages would get a better experience. By using AI, I'm hoping to get more natural sounding translations.

Setting up the OpenAI Pipeline

OpenAI already has a straightforward and well-documented python package that wraps their API. Being already well experienced in python, using it for my translation setup was a simple decision.

First, you'll need to make sure you have an OpenAI key set up. This will also require you to load funds into your account, however, I've found that translating blog posts this way costs me less than a penny with current prices and the 4o-mini models. This API is obviously directed more towards continuous services that would make a lot of API calls over time, rather than our application of just calling it occasionally for translation.

To set up your key, visit OpenAI's developer platform, navigate to your dashboard, then API keys, and create a new secret key. You'll need to save this key for later. Keep it in a safe, private place, as it will allow anybody who has it to use your OpenAI funds for whatever purpose they please.

We can test the key with a simple python script to make sure we set it up correctly.

"""Test if the OpenAI API key is valid."""
   key = get_openapi_key()
   client = OpenAI(api_key=key)

   completion = client.chat.completions.create(
   	model="gpt-4o-mini",
   	messages=[
       	{
           	"role": "system",
           	"content": "You are a health-check API. Simply reply 'OK'",
       	},
       	{"role": "user", "content": "Check"},
   	],
   )
   if completion.choices[0].message.content == "OK":
   	print("OpenAI key is valid and working")
   else:
   	print(completion.choices[0].message)

Now that we've verified that the key is working, we can set up the pipeline. We're not offering a live service, so we can use OpenAI's batch API to save a lot of money. The batch API is designed to be used for large quantities of requests that don't require immediate responses, so they can be slower. However, it doesn't actually care about the number of requests you send and it's half the price at the time of writing. This brings the cost down to a fraction of a penny to translate a blog post into every language we're interested in.

OpenAPI requires their requests to be in a specific format. We need to tell the AI what it is (a translation tool), the model we're using, and the language we're translating to. I've read that it helps to also tell it to use a given dialect, so we'll add that option too.

def openai_format(
	id: str, model: str, language: str, dialect: str, content: str
) -> dict:
	if dialect:
    	system_message = f"""You are a translation backend.
    	Translate the user input into {language} using a {dialect}
    	dialect. Only return the translation preserving the
    	markdown structure. It is possible that you are starting
    	mid-code block."""
	else:
    	system_message = f"""You are a translation backend.
    	Translate the user input into {language}. Only return the
    	translation in the matching markdown format. It is
    	possible that you are starting mid-code block."""
	formatted_prompt = {
    	"custom_id": id,
    	"method": "POST",
    	"url": "/v1/chat/completions",
    	"body": {
        	"model": model,
        	"messages": [
            	{"role": "system", "content": system_message},
            	{"role": "user", "content": content},
        	],
        	"temperature": 0.1,
    	},
	}
	return formatted_prompt

Now that we have a function to help us format our requests, we can bundle them together using OpenAI's python package. The API likes its request in JSON format, so we'll use a temporary directory to write our requests to a file, then use the python package to send the requests to the API. This will give us a batch ID that we have to keep track of, as this will be how we retrieve the results.

import json
from tempfile import TemporaryDirectory
from openai import OpenAI

def send_openai_batch(formatted_prompts: list, description=None):
	key = get_openapi_key()
	client = OpenAI(api_key=key)
	if not description:
    	description = "Markdown translation batch"
	with TemporaryDirectory() as tmpdirname:
    	with open(f"{tmpdirname}/batchinput.json", "a") as f:
        	for prompt in formatted_prompts:
            	f.write(json.dumps(prompt) + "\n")

    	batch_input_file = client.files.create(
        	file=open(f"{tmpdirname}/batchinput.json", "rb"),
        	purpose="batch"
    	)

    	batch_input_file_id = batch_input_file.id

    	batch = client.batches.create(
        	input_file_id=batch_input_file_id,
        	endpoint="/v1/chat/completions",
        	completion_window="24h",
        	metadata={"description": description},
    	)
	return batch

Now that we've sent our request, we need to wait for the batch to finish. Even though we're using the slower batch API, this process usually only takes a few minutes. If we try to grab the results before it's done, we'll error out. We can use the ID to check on the batch on a loop instead, and then pull the results afterwards.

from time import sleep

def openai_check_results(batch_id: str):
	key = get_openapi_key()
	client = OpenAI(api_key=key)
	result = client.batches.retrieve(batch_id)
	return result


def wait_for_openai(batch_id: str):
	while True:
    	result = openai_check_results(batch_id)
    	if result.status in ["validating", "in_progress"]:
        	print(f"Waiting on OpenAI (batch is {result.status})")
        	sleep(5)
        	continue
    	else:
        	break
	if result.status in ["failed", "expired",
                 	"canceling", "canceled"]:
    	raise ValueError(f"OpenAI batch {batch_id} failed")

Getting the actual result is very straightforward. Once we get the green light, we can just grab the text from the result object. The returned value is a series of JSON-formatted strings in a similar format we used for the request. Once we load the JSON, OpenAI's response is buried in result["response"]["body"]["choices"][0]["message"]["content"].

from collections import defaultdict

def build_openai_results(batch_id: str):
	key = get_openapi_key()
	client = OpenAI(api_key=key)
	result = openai_check_results(batch_id)

	file_response = client.files.content(result.output_file_id
	).text.split("\n")

	result_dict = defaultdict(dict)

	for json_str in file_response:
    	if json_str == "":
        	continue
    	result = json.loads(json_str)
    	id = result["custom_id"]
    	result_dict[id] = result["response"]["body"][
            	"choices"][0]["message" ]["content"]

	return result_dict

Conclusion

And that's it! We can now use OpenAI to translate our blog posts (or any other content) into any language we want. Make sure you do any copy editing ahead of time prior to running translations, as I can’t guarantee how the model might behave. I've taken this a bit further by automating more of the process. My system automatically takes in Markdown files. It also splits the files into smaller chunks, as I've noticed that the AI can get a bit distracted if it is given a large file to translate. I've put this into a separate repository you can find here. The code is MIT licensed, so feel free to use it however you want.

Table of contents