Flagging agent
The bouncer at the door. Looks at every incoming brief and decides whether the rest of the pipeline should bother with it. Three possible verdicts: legitimate, spam, or a manipulation attempt.
Built-in Runs first Read-only
What it watches for
- Manipulation attempts — text that tries to instruct an AI, jailbreak prompts, attempts to extract internal data, or smuggle tool calls.
- Out-of-scope submissions — personal favours, unrelated topics, illegal content, anything the project clearly isn't there to handle.
- Low-quality submissions — empty, gibberish, spam, or so vague there's nothing to act on.
Verdicts
| Verdict | Meaning |
|---|---|
| Legitimate | Looks like real work for the team. Pipeline continues. |
| Spam | Junk or out-of-scope. Brief is cancelled — but kept for the record. |
| Manipulation | Looks like an attempt to manipulate an AI. Same outcome as spam, with a clearer reason on the brief so you know why. |
Settings
| Setting | Default | What it does |
|---|---|---|
| Model | The project default | The flagging job is small enough that lighter models work well. |
| Strictness | Medium | Higher = catches more, with more false positives. Lower = lets more through, including some borderline submissions. |
How the workflow uses it
Every project ships with two rules wired up to the flagging agent:
- Legit → hand the brief to the refining agent.
- Spam or manipulation → cancel the brief.
Cancellation isn't loss. Cancelled briefs stay in the database with the full submission and the verdict. They show up in stats and search; you can reopen any of them if the bouncer was wrong.
Why it's a separate agent
Splitting flagging from refining means the two passes can use different settings, and you can turn flagging off entirely for trusted intakes (a private internal form, say) without changing how anything else works.