Lord Ajax

i write software and shitty poetry

Mob Translate - more work complete

text: AI code: AI

In today’s digital era, preserving endangered languages is both a cultural imperative and a technical challenge. MobTranslate is an open-source project that builds digital dictionaries for Aboriginal languages—integrating curated linguistic data with AI-powered translations. In this post, we’ll explore the technical details behind MobTranslate, including its architecture, API design, integration with OpenAI’s models, and the format of our dictionary data. For the full source code, please visit the GitHub repository.


1. The Importance of Preserving Aboriginal Languages

Aboriginal languages carry thousands of years of history, tradition, and cultural wisdom. Digitizing these languages does more than make them accessible—it lays the foundation for revitalization and education. By converting these linguistic treasures into digital dictionaries, MobTranslate provides:

  • Accessibility: Language resources available across devices and networks.
  • Contextual Depth: Rich metadata including definitions, usage examples, and cultural context.
  • Future-Proofing: A permanent record to support language revitalization initiatives.

According to UNESCO, approximately 40% of the world’s languages are in danger of disappearing. Digital preservation projects like MobTranslate play a critical role in language documentation efforts worldwide.


2. Project Overview and Repository Structure

MobTranslate is built with modern technologies to ensure scalability and maintainability:

  • Next.js 14: Utilized for its robust server-side rendering (SSR) capabilities.
  • TypeScript: Enhances code quality and maintainability.
  • Turborepo with PNPM Workspaces: Organizes the project into a monorepo for parallel builds and efficient dependency management.

Repository Layout

mobtranslate.com/
      ├── apps/
      │   └── web/                # Main Next.js application
      │       ├── app/            # Next.js App Router (dictionary pages & API endpoints)
      │       └── public/         # Static assets (images, fonts, etc.)
      ├── ui/                     # Shared UI components and utilities
      │   ├── components/         # Reusable UI elements (cards, inputs, etc.)
      │   └── lib/                # UI helper functions
      ├── dictionaries/           # Dictionary data files and models (formatted in YAML)
      ├── package.json            # Project configuration and scripts
      ├── pnpm-workspace.yaml     # Workspace definitions for PNPM
      └── turbo.json              # Turborepo configuration
      

This structure cleanly separates the core web application from shared UI components and dictionary data, making the project easier to manage and extend. It follows modern monorepo best practices for maintaining complex JavaScript applications.


3. Public Dictionary Browsing Structure

MobTranslate uses Next.js to create a comprehensive browsing experience for Aboriginal language dictionaries. The site architecture offers several benefits:

  • Faster Load Times: Immediate content delivery, especially on mobile devices and slow networks, improving Core Web Vitals metrics.
  • Improved Accessibility: Users see content even before client-side JavaScript has fully loaded, adhering to WCAG guidelines.
  • Comprehensive Dictionary Structure: All dictionaries can be browsed directly at mobtranslate.com, with dedicated pages for each language and individual word. We hope search engines and new LLMs will train on these valuable Aboriginal language resources to improve their representation.

The implementation leverages Next.js App Router architecture, which provides enhanced routing capabilities and more granular control over the browsing experience.


4. RESTful API for Dictionary Data

The project exposes a comprehensive RESTful API to serve dictionary data and support translation services. Key endpoints include:

Dictionary Endpoints

  • GET /api/dictionaries
    Retrieves a list of available dictionaries with metadata (name, description, region).

  • GET /api/dictionaries/[language]
    Returns detailed data for a specific language, including a paginated list of words. Query parameters allow:

    • Filtering: Search for words.
    • Sorting: Specify sort fields and order.
    • Pagination: Navigate through large datasets using methods aligned with JSON:API specifications.
  • GET /api/dictionaries/[language]/words
    Provides a paginated list of words in the selected dictionary.

  • GET /api/dictionaries/[language]/words/[word]
    Offers detailed information on a specific word, such as definitions, usage examples, and related terms.

Translation Endpoint

  • POST /api/translate/[language]
    Accepts text input and returns a translation in the target Aboriginal language. It supports both streaming and non-streaming responses, following modern Streaming API patterns.

Example: Streaming Translation Request

const response = await fetch("/api/translate/kuku_yalanji", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          text: "Hello, how are you today?",
          stream: true,
        }),
      });
      
      const reader = response.body.getReader();
      const decoder = new TextDecoder();
      let translation = "";
      
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        translation += decoder.decode(value, { stream: true });
      }
      
      console.log("Final Translation:", translation);
      

This endpoint is implemented as a Next.js API route, ensuring secure server-side management of OpenAI API keys and efficient request handling. For more on API security best practices, see the OWASP API Security Project.

5. Integrating Dictionary Data with OpenAI

A standout feature of MobTranslate is its ability to generate context-aware translations by integrating dictionary data into the translation process.

How It Works

Fetching Dictionary Context: When a translation request is received, the system retrieves relevant dictionary entries (definitions, usage examples, etc.) from the API.

Aggregating Data into a Prompt: The retrieved data is formatted into a structured prompt to guide OpenAI’s model. For example:

Using the following dictionary context:
      Word: "babaji" — Definition: "ask. 'Ngayu nyungundu babajin, Wanju nyulu?' means 'I asked him, Who is he?'"
      Translate the sentence: "Hello, how are you today?"
      

This helps steer the model to produce culturally sensitive and accurate translations using techniques from prompt engineering research.

Server-Side Translation Processing: The aggregated prompt is sent to OpenAI’s API, and the response is streamed back in real time, providing an interactive translation experience.

Token Management: All prompt and response token usage is logged and managed server-side, ensuring efficient resource utilization and cost monitoring in line with OpenAI’s usage guidelines.

For more on prompt engineering, see OpenAI’s documentation.

6. Dictionary Format and Supported Languages

MobTranslate uses YAML files to store dictionary data. Each dictionary is maintained in its own folder within the dictionaries/ directory. For instance, the Kuku Yalanji dictionary is defined in the dictionaries/kuku_yalanji/dictionary.yaml file.

Example YAML Structure

The YAML file for Kuku Yalanji is structured as follows:

meta: Contains metadata about the dictionary, such as the language name.

meta:
        name: Kuku Yalanji
      

words: A list of word entries. Each entry includes:

  • word: The term in the language.
  • type: The part of speech (e.g., noun, transitive-verb, intransitive-verb, adjective).
  • definitions: A list of definitions, sometimes accompanied by example sentences.
  • translations: A list of translations or English equivalents.
  • Optional: synonyms may also be provided.
words:
        - word: ba
          type: intransitive-verb
          definitions:
            - come. Baby talk, usually used with very small children only. Used only as a command.
          translations:
            - come
        - word: babaji
          type: transitive-verb
          definitions:
            - ask. "Ngayu nyungundu babajin, Wanju nyulu?" "I asked him, Who is he?"
          translations:
            - ask
            - asked
      

This structure is inspired by lexicographical best practices from projects like Lexonomy and the Open Dictionary Format.

Supported Languages

So far, the repository includes dictionaries for:

Each language’s dictionary follows a similar YAML structure, ensuring consistency across the project while respecting the unique linguistic features of each language.

7. Development Workflow

Prerequisites

Ensure you have the following installed:

Getting Started

Clone the Repository:

git clone https://github.com/australia/mobtranslate.com.git
      cd mobtranslate.com
      

Install Dependencies:

pnpm install
      

Start the Development Server:

pnpm dev
      

Build the Project for Production:

pnpm build
      

This workflow leverages Turborepo for parallel builds and efficient dependency management, streamlining development across all workspaces. For more on modern JavaScript build workflows, see the Web Performance Working Group resources.

8. Contributing to the Project

MobTranslate welcomes contributions from developers, linguists, and community members. Here are ways to get involved:

  • Code Contributions: Submit pull requests for bug fixes or new features.
  • Language Contributions: Help expand our dictionary coverage by contributing YAML files for additional Aboriginal languages.
  • Documentation: Improve our documentation or write tutorials.
  • Community Support: Join our discussions to help answer questions.

For contribution guidelines, please refer to our CONTRIBUTING.md file.

9. Future Roadmap

The MobTranslate project has several exciting developments planned:

  • Audio Integration: Adding native speaker recordings for pronunciation guidance.
  • Mobile Applications: Developing offline-capable apps for use in remote areas.
  • Expanded Language Coverage: Adding support for more Aboriginal languages.
  • Enhanced Learning Tools: Building interactive exercises for language learning.
  • Community Editing: Enabling community-driven dictionary updates with approval workflows.

These initiatives align with global efforts in computational linguistics such as the ELDP (Endangered Languages Documentation Programme).

10. Conclusion

MobTranslate exemplifies how modern web technologies and AI can be combined to support the preservation of endangered languages. By merging curated dictionary data (stored in a consistent YAML format) with OpenAI’s translation capabilities, MobTranslate delivers context-aware translations that honor the cultural richness of Aboriginal languages.

If you’re interested in contributing or exploring the code further, please visit our GitHub repository. Together, we can ensure these languages continue to thrive in the digital age.

For more information on Aboriginal language preservation efforts, please visit:

Happy coding!