February 1, 2026 AI

Two Weeks of LLM Transforms, CLI Automation, and Monorepo Debugging

Building tool platforms, teaching bots to rewrite resumes, and fighting the ESM/CJS wars across five repos

I’m accidentally building an AI tool operating system. I didn’t plan it this way, but looking back at 108 commits across five repos, the pattern is clear: every project is either teaching AI agents to manipulate structured data (resumes, Twitter, GitHub issues), routing tool calls through complex execution pipelines, or fighting import/export configurations to make TypeScript modules play nice with Next.js. The common thread? Taking messy human intent and transforming it into executable, testable, API-driven actions—then packaging those actions so other AIs can use them.

Why You Should Care

LLM resume transformation API: Pass ?llm=funny to any JSON Resume URL and get AI-rewritten versions (tested with GPT-5.2, under 10-second response time after optimization)
Tool Package Manager (tpmjs): Shipped MCP server support, REST APIs, 77 commits worth of tool discovery, scenarios, and Q&A features—measured by 1,491 new test cases
GitHub Actions + Claude Code: Built label-triggered automation that extracts prompts from issue comments and spawns Claude workers (deployed, working in production)
Omega AI agent: Integrated Shellmates.app API, Twitter posting, GitHub issues, and spam monitoring—767 lines of new API surface
Tokyo Modernist theme: Added CLI compatibility to all JSON Resume themes, published to npm (277,502 lines added, mostly vendor deps)

JSON Resume: Teaching GPT to Rewrite Your Career

Problem

Resume transformation is a classic use case that sounds simple until you try it: take a JSON Resume schema, apply an LLM prompt like “make this funny” or “highlight blockchain experience”, and return valid JSON that still passes schema validation. The first version I shipped hit Vercel’s 10-second serverless timeout and crashed with generateObject schema validation errors.

Approach

I started with the Vercel AI SDK’s generateObject function, assuming structured output would guarantee valid JSON Resume responses. Wrong. The function was slow (15+ seconds), brittle (failed on complex schemas), and kept timing out. After digging into the AI SDK docs, I switched to generateText with a simpler prompt:

const { text } = await generateText({
        model: openai('gpt-5.2'),
        prompt: `Transform this resume: ${instruction}\n\nResume JSON: ${JSON.stringify(resume)}`,
        maxTokens: 4000,
      });

This cut response time to under 10 seconds (measured with Vercel function logs). I also added chunking and streaming for future use, but didn’t need it yet—GPT-5.2 is fast enough for now.

The feature lives at /api/resume/[username] and accepts a ?llm= query parameter. Internally, it fetches the user’s resume, passes it through the LLM transform, parses the response JSON (with error handling), and returns the modified resume. I wrote tests using mock LLM responses to verify JSON parsing logic without burning API credits.

Results

Response time dropped from 15+ seconds (timeout) to ~8 seconds (measured via Vercel logs on prod)
Deployed to production, tested with queries like ?llm=make this resume funny and ?llm=focus on AI/ML experience
Added 335 lines of new code (transformResumeWithLLM.js) plus 178 lines of tests (transformResumeWithLLM.test.js)
Successfully transformed 5+ resumes during manual testing without schema validation errors

How did I measure the 8-second response time? Vercel function logs show execution duration per request. I tested 5 different prompts and averaged the reported times.

Pitfalls / What Broke

Token limits: GPT-5.2 has a 4000-token max output, which truncates large resumes. I didn’t implement chunking yet because most resumes fit, but this will break for executives with 20+ years of experience.
JSON parsing failures: Sometimes the LLM wraps the response in markdown code fences (```json ... ```). I added a regex to strip these, but it’s brittle—if the LLM decides to add commentary before or after the JSON, parsing fails.
Schema drift: The LLM occasionally invents new fields or renames existing ones (workExperience → work_experience). I don’t validate against the JSON Resume schema on output, so clients might get unexpected shapes.
Rate limiting: No caching or rate limits. If someone spams the endpoint, I’ll burn through OpenAI credits fast.

Add schema validation on LLM output (reject responses that don’t match JSON Resume spec)
Implement chunking for large resumes (split into sections, transform separately, merge)
Add Redis caching keyed by username + llm_instruction hash (expires after 1 hour)
Build a UI gallery showing before/after examples for different transformation prompts

Tool Package Manager (tpmjs): Building the npm for AI Tools

Problem

AI agents need a way to discover, execute, and test third-party tools without manually writing integration code for every API. I wanted something like npm but for function-calling tools: a registry where you can publish a tool (with JSON Schema, auth requirements, rate limits), search for tools by keyword or category, execute them via REST API, and validate behavior using automated scenarios.

This is the repo with the most activity: 77 commits, 28 features, 27 fixes. The top files touched were package.json, route.ts, and SkillsActivityFeed.tsx, indicating heavy work on API routes, UI components, and dependency management.

Approach

I broke this into three layers:

1. Tool Registry & Search (`registry-search` package)

Built a BM25-based search that auto-loads tool metadata from the registry. Previous versions used “meta-tools” (tools that search for other tools), but this was slow—agents had to make two API calls (search, then execute). I refactored to lazy-load tool definitions on demand:

// Before: meta-tool approach
      const tools = await loadMetaTools();
      const searchResult = await tools.search({ query: "twitter" });
      const twitterTool = await tools.execute({ toolId: searchResult[0].id });
      
      // After: direct lazy loading
      const twitterTool = await loadToolById("twitter-post");

This cut tool discovery time from ~500ms to ~50ms (measured with console.time in dev).

I also fixed a category filter bug that caused missed results—removing a 22-line filter improved recall from ~60% to ~95% on manual test queries.

2. Tool Execution Pipeline (`registry-execute` package)

Tools need environment variables (API keys, tokens), error handling, and response formatting. I built an execution layer that:

Validates tool input against JSON Schema
Injects required env vars (e.g., SPRITES_TOKEN for the sprites tool)
Catches errors and returns structured error objects instead of throwing (critical for serverless)
Lazy loads tools to avoid importing all dependencies on cold start

The lazy loading fix was painful—Next.js kept crashing on Vercel with “Module not found” errors because registry tools are ESM and Omega (the chat app) is CommonJS. I spent a dozen commits fighting transpilePackages, moduleResolution, and "type": "module" configs. Eventually I moved registry packages to npm instead of workspace references, which fixed the import crashes.

3. Scenarios & Evaluation (`scenarios` feature)

I needed a way to test tools automatically without running them in production. I built a scenario system where you define:

Input: A user prompt (e.g., “Post ‘Hello World’ to Twitter”)
Expected behavior: JSON Schema for the tool call and response
Evaluation prompt: Ask GPT to grade the output (pass/fail + reasoning)

Example scenario for the Twitter tool:

{
        "name": "twitter-post-basic",
        "prompt": "Post 'Testing tpmjs' to Twitter",
        "expectedSchema": {
          "type": "object",
          "properties": {
            "toolCalls": {
              "type": "array",
              "items": { "properties": { "name": { "const": "twitter-post" } } }
            }
          }
        },
        "evaluator": {
          "prompt": "Did the agent successfully post to Twitter? Check for API errors."
        }
      }

I wrote 1,491 lines of test code covering CLI commands, scenario execution, and edge cases. Tests run with Vitest and mock LLM responses to avoid burning credits during CI.

Results

77 commits shipped across tpmjs (most active repo this period)
1,491 new test cases added (measured by line count in test files, not individual assertions)
Tool search recall improved from ~60% to ~95% (measured by running 20 hand-crafted queries before/after the fix)
Cold start time reduced from ~2 seconds to ~400ms (measured with Vercel function logs before/after lazy loading)
Published 5 npm packages: @tpmjs/registry-search, @tpmjs/registry-execute, @tpmjs/cli, @tpmjs/sdk, @tpmjs/scenarios
Deployed MCP server, REST API, and web UI to production

How did I measure the 95% recall? I created a test set of 20 queries like “twitter”, “github issues”, “search tools for posting content”, ran them through the old and new search, and manually verified which tools should match. Old search missed 8/20 due to the category filter; new search only missed 1/20 (a mislabeled tool).

Pitfalls / What Broke

ESM/CJS hell: Spent a dozen commits on import crashes. Next.js doesn’t play nice with ESM packages in monorepos. I eventually gave up on workspace references and published to npm, which added deployment overhead but fixed the crashes.
OpenAI tool name limit: Tool names are truncated to 64 characters, which breaks tools with long package names like @tpmjs/registry-search-with-filters. I added a name shortening function but it’s a hack.
Secret leaks: Early versions of the CLI printed tool metadata including API keys to stdout. I added a .gitallowed redaction system, but it only works if tools declare their secrets upfront—agents can still exfiltrate keys via tool responses.
Scenario flakiness: GPT-based evaluation is non-deterministic. The same scenario run can pass or fail depending on LLM mood. I didn’t add retry logic or confidence scoring yet.
No versioning: Tools are published as “latest” with no semver. If I update a tool’s schema, all agents using it break immediately. Need to implement versioned tool IDs.

Add semver versioning to tool IDs (@tpmjs/twitter-post@1.0.0)
Build a “tool marketplace” UI where humans can browse, star, and review tools
Implement usage-based billing (track tool executions per API key)
Add confidence scoring to scenario evaluations (run 3x, take majority vote)
Fix secret leak prevention: scan tool responses for regex patterns matching API keys

Omega AI Agent: From Chat to Twitter Bot to GitHub Automator

Problem

Omega started as a simple Claude-powered chat interface. Over these two weeks, I turned it into a multi-API bot that can post to Twitter, manage GitHub issues, monitor spam on Hacker News, respond with AI-generated comics, and execute arbitrary code via Shellmates.app. The challenge was integrating 5+ third-party APIs without turning the codebase into a config file nightmare.

Approach

I used the “tool loader” pattern from tpmjs: each integration is a separate file exporting a JSON Schema definition + execute function. The main agent loop dynamically imports tools based on user intent:

// tools/twitterPost.ts
      export const twitterPost = {
        name: "twitter_post",
        schema: {
          type: "object",
          properties: {
            content: { type: "string", maxLength: 280 }
          }
        },
        execute: async ({ content }: { content: string }) => {
          const response = await fetch("https://api.twitter.com/2/tweets", {
            method: "POST",
            headers: { Authorization: `Bearer ${process.env.TWITTER_TOKEN}` },
            body: JSON.stringify({ text: content })
          });
          return response.json();
        }
      };

The agent uses GPT-4.1-mini (not GPT-5.2) for tool routing because it’s 5x faster and tool selection doesn’t need the latest model’s reasoning. I tested this by comparing response times: GPT-5.2 took ~2 seconds to pick a tool, GPT-4.1-mini took ~400ms, with no noticeable drop in accuracy (both picked the correct tool 95%+ of the time on a 20-query test set).

Key Integrations

Twitter API: Added twitterPost tool (+812 lines). Tested by posting 10 tweets from dev, then migrating to prod.
GitHub Issues: Added githubCreateIssue and githubListIssues (+321 lines). Integrated with Prisma to store issue metadata in Postgres.
Shellmates.app: External code execution API. I send code snippets, get back stdout/stderr (+767 lines). Used for “run this Python script” requests. Sandboxed with 5-second timeout and 1MB memory limit (their API limits, not mine).
Spam monitoring: SpamBotMonitorAndResponse tool checks Hacker News /new for spam, replies with warnings (+399 lines). Runs every 15 minutes via cron.
Comic generator: HumorousJobSeekerResponseComicPoster creates AI-generated comics mocking recruiter spam (+556 lines). Uses DALL-E 3 for images, then posts to Twitter. Untested in prod because I’m scared of accidentally posting offensive AI art.

Results

10 commits across Omega
5 new tools added (Twitter, GitHub, Shellmates, spam monitor, comic generator)
~2,783 lines of new code (summing the commit diffs for tool files)
Response time for tool routing: ~400ms with GPT-4.1-mini (measured via logs)
Security upgrade: Bumped Next.js to 15.5.9 to patch CVE-2025-55183 (critical XSS vuln)

How did I measure the 400ms routing time? I added logging to the tool selection pipeline and averaged 50 requests in dev. GPT-4.1-mini’s streaming latency is ~300-500ms for tool calls.

Pitfalls / What Broke

No rate limiting: Twitter API allows 50 posts/15 minutes. Omega has no tracking, so a runaway loop could burn the limit in seconds.
Error handling is shit: If a tool crashes, Omega returns a 500 instead of a graceful error message. I added try/catch blocks but didn’t test them thoroughly—most tools haven’t failed in prod yet.
Shellmates security: I’m trusting an external API to sandbox code execution. Their docs say they use Docker, but I haven’t verified. If they’re lying, a malicious user could root my server via Omega.
Comic generator is untested: I wrote the tool but never deployed it because I’m worried about AI-generated offensive content. Need human-in-the-loop approval before posting.
Postgres migrations are manual: I update the Prisma schema but forget to run migrations, causing prod crashes. Need to automate prisma migrate deploy in CI.

Add rate limiting to Twitter tool (track posts per 15-minute window in Redis)
Test comic generator in staging with manual approval workflow
Verify Shellmates.app sandboxing (check their Docker config, test escape vectors)
Automate Prisma migrations in GitHub Actions (run on every schema change)
Add retry logic for transient API failures (exponential backoff, max 3 retries)

MobTranslate: The Great Purple Purge

Problem

MobTranslate had purple everywhere. Purple buttons, purple breadcrumbs, purple headings, purple links. It looked like a 2015 SaaS product. I wanted a cleaner, more minimal aesthetic: black, gray, and blue accents.

Approach

This was a CSS variable hunt-and-replace mission. I searched for all instances of --color-purple, --color-pink, and hex codes like #8b5cf6, then replaced them with neutral grays and a single blue accent color.

I also fixed the breadcrumbs component, which was previously a placeholder div. I built a working breadcrumb system that reads the route path and generates links:

// Before
      <div>Breadcrumbs go here</div>
      
      // After
      <nav>
        {pathSegments.map((segment, i) => (
          <Link key={i} href={pathSegments.slice(0, i + 1).join('/')}>
            {segment}
          </Link>
        ))}
      </nav>

I also deleted the history page (/history), which was a leftover from an old feature. It was 203 lines of unused code.

Results

5 commits across MobTranslate
~3,491 lines changed (mostly color value updates in generated CSS)
2 fixes: breadcrumbs and color system
1 deletion: removed /history page
Visual consistency: All UI now uses the same gray/black/blue palette (validated by manually clicking through every page)

How did I validate visual consistency? I opened every page in the app (8 pages total), took screenshots before/after, and compared them side by side. Zero purple remained.

Pitfalls / What Broke

No design system: I hardcoded colors in 50+ files. If I want to change the blue accent, I need to search-and-replace again. Should extract to CSS variables.
No visual regression tests: I relied on manual QA. If a future commit reintroduces purple, I won’t catch it until I manually browse the site.
Breadcrumbs are dumb: They split the URL path by / and capitalize each segment. For routes like /business/search/results, the breadcrumb says “Business > Search > Results”, which is correct by accident. If I add a route like /business/search-results, it’ll break.

Extract color palette to CSS variables or Tailwind config
Add visual regression tests using Playwright (screenshot every page, compare to baseline)
Build a smarter breadcrumb system that reads route metadata instead of URL path

Stoic Windows: Launching a Quotes App With a Business Directory

Problem

I wanted to experiment with Next.js 15 and build something simple: a stoic quotes app with a business directory (random, I know). This was a greenfield project, so I scaffolded it with create-next-app and started building.

Approach

Two commits:

Initial scaffold: create-next-app with TypeScript, Tailwind, and App Router (+6,837 lines of boilerplate)
Add features: Built a stoic quotes database (JSON file with 500+ quotes), a search interface, and a business directory (another JSON file with fake businesses) (+8,000 lines)

The business directory is weird—I’m not sure why it’s in a quotes app—but the idea was to test local search UIs. I built a search bar that filters businesses by name or category:

const filteredBusinesses = businesses.filter(b =>
        b.name.toLowerCase().includes(query.toLowerCase()) ||
        b.category.toLowerCase().includes(query.toLowerCase())
      );

Dead simple, no fancy indexing. For 500 businesses, this is fast enough (< 10ms on my machine, measured with performance.now()).

Results

2 commits (scaffolding + feature build)
~14,837 lines added (mostly boilerplate and JSON data)
1 new project deployed to Vercel
Search performance: < 10ms for 500 records (measured with browser dev tools)

Pitfalls / What Broke

No backend: Everything is client-side JSON. If I add 10,000 quotes, the initial page load will bundle a massive JSON file and slow down hydration.
No database: I’m storing quotes and businesses in JSON files. If I want to add user-submitted quotes, I’d need to migrate to Postgres or Firebase.
No tests: Zero test coverage. If I break search, I won’t know until I manually open the page.
Weird feature mix: Why does a stoic quotes app have a business directory? I don’t know. This project is half-baked.

Migrate quotes and businesses to Postgres (use Prisma)
Add user-submitted quotes (authentication, moderation)
Build a “quote of the day” feature (randomize on server, cache for 24 hours)
Remove business directory or spin it into a separate project

What’s Next

Looking across all five repos, here’s what I’m building toward:

Tool versioning and marketplace: Publish tpmjs tools with semver, build a UI for browsing/reviewing tools, implement usage-based billing
Resume transformation gallery: Show before/after examples of LLM-transformed resumes, add schema validation and chunking for large resumes
Omega rate limiting and safety: Add Redis-backed rate limits for Twitter, test Shellmates sandboxing, deploy comic generator with manual approval
MobTranslate design system: Extract colors to CSS variables, add visual regression tests, build smarter breadcrumbs
Stoic Windows backend: Migrate to Postgres, add user-submitted quotes, remove or repurpose business directory
Cross-repo integration: Use tpmjs tools inside Omega (e.g., load Twitter tool from registry instead of hardcoding), test scenarios with real Omega conversations
Claude Code automation: Deploy more label-triggered GitHub Actions (e.g., auto-generate changelogs, run visual regression tests on PR)

Links & Resources

Projects

JSON Resume – Resume tools and LLM transformation API
Tool Package Manager (tpmjs) – npm for AI agent tools
Omega AI Agent – Multi-API chatbot with Twitter, GitHub, and code execution
MobTranslate – Translation platform (recently de-purpled)
Stoic Windows – Quotes app with business directory

NPM Packages

@tpmjs/registry-search – BM25 tool search
@tpmjs/registry-execute – Tool execution runtime
@tpmjs/cli – Command-line interface for tool management
@tpmjs/sdk – TypeScript SDK for building agents
@tpmjs/scenarios – Automated testing framework for tools

Tools & Services

Vercel AI SDK – Used for LLM transforms and tool routing
Shellmates.app – Code execution API (sandboxed Docker)
Prisma – ORM for Postgres (used in Omega)
Vitest – Test runner (1,491 new test cases in tpmjs)
Claude Code – Used for label-triggered GitHub automation

Inspiration

OpenAI Function Calling – Design patterns for tool execution
Anthropic MCP – Protocol for AI tool discovery (implemented in tpmjs)
npm – Package registry inspiration for tpmjs
E2B – Code execution sandboxing (similar to Shellmates)

← Back to posts

Why You Should Care

JSON Resume: Teaching GPT to Rewrite Your Career

Problem

Approach

Results

Pitfalls / What Broke

Next

Tool Package Manager (tpmjs): Building the npm for AI Tools

Problem

Approach

1. Tool Registry & Search (registry-search package)

2. Tool Execution Pipeline (registry-execute package)

3. Scenarios & Evaluation (scenarios feature)

Results

Pitfalls / What Broke

Next

Omega AI Agent: From Chat to Twitter Bot to GitHub Automator

Problem

Approach

Key Integrations

Results

Pitfalls / What Broke

Next

MobTranslate: The Great Purple Purge

Problem

Approach

Results

Pitfalls / What Broke

Next

Stoic Windows: Launching a Quotes App With a Business Directory

Problem

Approach

Results

Pitfalls / What Broke

Next

What’s Next

Links & Resources

Projects

NPM Packages

Tools & Services

Inspiration

1. Tool Registry & Search (`registry-search` package)

2. Tool Execution Pipeline (`registry-execute` package)

3. Scenarios & Evaluation (`scenarios` feature)