Docs / Agents / Flagging

Flagging agent

The bouncer at the door. Looks at every incoming brief and decides whether the rest of the pipeline should bother with it. Three possible verdicts: legitimate, spam, or a manipulation attempt.

Built-in Runs first Read-only

What it watches for

Manipulation attempts — text that tries to instruct an AI, jailbreak prompts, attempts to extract internal data, or smuggle tool calls.
Out-of-scope submissions — personal favours, unrelated topics, illegal content, anything the project clearly isn't there to handle.
Low-quality submissions — empty, gibberish, spam, or so vague there's nothing to act on.

Verdicts

Verdict	Meaning
Legitimate	Looks like real work for the team. Pipeline continues.
Spam	Junk or out-of-scope. Brief is cancelled — but kept for the record.
Manipulation	Looks like an attempt to manipulate an AI. Same outcome as spam, with a clearer reason on the brief so you know why.

Settings

Setting	Default	What it does
Model	The project default	The flagging job is small enough that lighter models work well.
Strictness	Medium	Higher = catches more, with more false positives. Lower = lets more through, including some borderline submissions.

How the workflow uses it

Every project ships with two rules wired up to the flagging agent:

Legit → hand the brief to the refining agent.
Spam or manipulation → cancel the brief.

Cancellation isn't loss. Cancelled briefs stay in the database with the full submission and the verdict. They show up in stats and search; you can reopen any of them if the bouncer was wrong.

Why it's a separate agent

Splitting flagging from refining means the two passes can use different settings, and you can turn flagging off entirely for trusted intakes (a private internal form, say) without changing how anything else works.

← Previous Refining agent Next → Coding agent