Open position · Agent Design
Instructions are the interface between human intent and machine behavior. You design them with the rigor of an engineer, the precision of an editor, and the curiosity of a scientist — because a single word in the wrong place changes everything an agent does downstream.
The role
Every agent in our fleet runs on instructions. Those instructions determine how it reasons, what it refuses, how it handles ambiguity, and whether it behaves consistently across the full range of inputs it will encounter in production. Prompt engineering at this scale is not a creative exercise — it's a systems discipline. You design, test, version, and optimize the instruction layers that sit at the core of every agent we ship.
You work upstream of almost everything: Domain Imprinters bring the knowledge, Skill Architects encode the procedures, but your prompts are the scaffold the agent thinks through. When an agent behaves unexpectedly, the first question is usually about the prompt. When an agent performs exceptionally, it's usually because someone wrote something unusually precise.
What you'll do
Prompt design & iteration
Write system prompts, task instructions, and chain-of-thought scaffolds for production agents
Iterate rapidly against evaluation suites — you measure before you ship
Design multi-turn conversation flows and handle context window constraints deliberately
Build and maintain a versioned prompt library with change rationale for every edit
Evaluation & testing
Design adversarial test sets — inputs crafted to expose failure modes before users find them
Run A/B experiments across prompt variants and measure performance with statistical rigor
Build regression suites that catch regressions when prompts are modified or models are updated
Produce structured failure analyses — what broke, why, and what the fix changes
Safety & reliability
Design refusal logic and boundary conditions that hold under adversarial pressure
Identify prompt injection vulnerabilities and harden instruction layers against manipulation
Ensure consistent behavior across model versions, context lengths, and temperature settings
Collaborate with the Agent Ethics board on guardrail design for high-stakes deployments
Collaboration & standards
Work with Domain Imprinters to integrate knowledge context without bloating instruction layers
Partner with Skill Architects to ensure skill prompts compose cleanly with system-level instructions
Define house style guides and prompting standards used across all agent development
Run prompt review for junior team members and external contributors
Problem surfaces you'll work on
Instruction following
Agents that do exactly what you said, not what you meant
Persona consistency
Agents that stay in character under pressure and across sessions
Refusal calibration
Neither too cautious nor too permissive
Chain-of-thought
Reasoning that's transparent, structured, and auditable
Context management
Staying coherent across long, complex conversations
Prompt injection
Instructions that hold when users try to override them
What we're looking for
Must-haves
Demonstrated experience prompting LLMs in production — not just experimentation
Strong writing instincts: precision, concision, and an ear for ambiguity
Systematic approach to testing — you don't ship on vibes, you ship on evidence
Familiarity with at least one frontier model family and its behavioral quirks
Comfort working across disciplines — you translate between engineers, domain experts, and product
Nice-to-haves
Background in linguistics, cognitive science, technical writing, or UX research
Experience with fine-tuning, RLHF, or evaluation framework design
Familiarity with agentic frameworks (tool use, multi-step reasoning, memory systems)
Python fluency for scripting evaluation pipelines and batch testing
Technical baseline
You should be comfortable running experiments in code, not just in the playground. That means scripting batch evaluations, parsing JSON outputs, and building simple pipelines to measure prompt performance at scale. You don't need to be a software engineer — but you should be dangerous enough with Python that you can test your own hypotheses without waiting for one.