← All positions

Open position · Agent Design

Prompt Engineer

Instructions are the interface between human intent and machine behavior. You design them with the rigor of an engineer, the precision of an editor, and the curiosity of a scientist — because a single word in the wrong place changes everything an agent does downstream.


Full-time Remote-first Human role Reports to: Head of Agent Design Cross-functional with Domain Imprinters & Skill Architects

Every agent in our fleet runs on instructions. Those instructions determine how it reasons, what it refuses, how it handles ambiguity, and whether it behaves consistently across the full range of inputs it will encounter in production. Prompt engineering at this scale is not a creative exercise — it's a systems discipline. You design, test, version, and optimize the instruction layers that sit at the core of every agent we ship.

You work upstream of almost everything: Domain Imprinters bring the knowledge, Skill Architects encode the procedures, but your prompts are the scaffold the agent thinks through. When an agent behaves unexpectedly, the first question is usually about the prompt. When an agent performs exceptionally, it's usually because someone wrote something unusually precise.

Prompt design & iteration

Write system prompts, task instructions, and chain-of-thought scaffolds for production agents

Iterate rapidly against evaluation suites — you measure before you ship

Design multi-turn conversation flows and handle context window constraints deliberately

Build and maintain a versioned prompt library with change rationale for every edit

Evaluation & testing

Design adversarial test sets — inputs crafted to expose failure modes before users find them

Run A/B experiments across prompt variants and measure performance with statistical rigor

Build regression suites that catch regressions when prompts are modified or models are updated

Produce structured failure analyses — what broke, why, and what the fix changes

Safety & reliability

Design refusal logic and boundary conditions that hold under adversarial pressure

Identify prompt injection vulnerabilities and harden instruction layers against manipulation

Ensure consistent behavior across model versions, context lengths, and temperature settings

Collaborate with the Agent Ethics board on guardrail design for high-stakes deployments

Collaboration & standards

Work with Domain Imprinters to integrate knowledge context without bloating instruction layers

Partner with Skill Architects to ensure skill prompts compose cleanly with system-level instructions

Define house style guides and prompting standards used across all agent development

Run prompt review for junior team members and external contributors

🧭

Instruction following

Agents that do exactly what you said, not what you meant

🪞

Persona consistency

Agents that stay in character under pressure and across sessions

🛑

Refusal calibration

Neither too cautious nor too permissive

🔗

Chain-of-thought

Reasoning that's transparent, structured, and auditable

📐

Context management

Staying coherent across long, complex conversations

🧱

Prompt injection

Instructions that hold when users try to override them

Must-haves

Demonstrated experience prompting LLMs in production — not just experimentation

Strong writing instincts: precision, concision, and an ear for ambiguity

Systematic approach to testing — you don't ship on vibes, you ship on evidence

Familiarity with at least one frontier model family and its behavioral quirks

Comfort working across disciplines — you translate between engineers, domain experts, and product

Nice-to-haves

Background in linguistics, cognitive science, technical writing, or UX research

Experience with fine-tuning, RLHF, or evaluation framework design

Familiarity with agentic frameworks (tool use, multi-step reasoning, memory systems)

Python fluency for scripting evaluation pipelines and batch testing

You should be comfortable running experiments in code, not just in the playground. That means scripting batch evaluations, parsing JSON outputs, and building simple pipelines to measure prompt performance at scale. You don't need to be a software engineer — but you should be dangerous enough with Python that you can test your own hypotheses without waiting for one.

prompt design evaluation frameworks chain-of-thought python json / yaml LLM APIs adversarial testing