In partnership with

THE SIGNAL

AI "safety filters" can now be permanently removed: Not tricked. Not bypassed. Deleted. The model forgets how to refuse. Forever.

5,000+ developers downloaded the tool in weeks: A single command. 45 minutes. Anyone with a GPU can do it.
For people with no GPU, you can get H200 for 3.59$/hour here (just to give you an idea how cheap it can be)

The uncensored model still thinks normally: Same coding ability. Same reasoning. The safety layer is gone. Everything else works.

Wake up to better business news

Some business news reads like a lullaby.

Morning Brew is the opposite.

A free daily newsletter that breaks down what’s happening in business and culture — clearly, quickly, and with enough personality to keep things interesting.

Each morning brings a sharp, easy-to-read rundown of what matters, why it matters, and what it means to you. Plus, there’s daily brain games everyone’s playing.

Business news, minus the snooze. Read by over 4 million people every morning.

Heretic (link)

A Python package that takes any open-source LLM and outputs a version with refusal behavior surgically removed from its weights.

What it does:
Feed it any open-source LLM. It outputs a version where refusal behavior has been surgically removed from the weights. The model literally forgets how to say "I can't help with that."

What it replaces:
Fragile jailbreak prompts → Permanent weight editing
Manual abliteration (hit-or-miss) → Automated optimization
Hours of trial and error → 45-minute single command

Cost:
Open source. AGPL-3.0. Free forever.

Use it if:
You research alignment mechanisms. You want uncensored local models. You're building agents that shouldn't refuse commands. You're curious how refusal actually works inside neural networks.

How it's different from jailbreaks:

Jailbreak prompts

Heretic

Fragile, patchable by new system prompts

Permanent, baked into the weights

Model still "knows" it should refuse

Refusal circuit doesn't exist

Works on some prompts, not others

Works on everything, every time

The metrics that matter:

On Gemma 3-12B-IT:

  • Original model: 97/100 refusals on harmful prompts

  • Heretic version: 3/100 refusals

  • KL divergence: 0.16 (vs 0.45-1.04 for manual abliterations)

Translation: the model answers almost everything, but still writes clean code and reasons normally on safe tasks.

How it works (simple version)

Think of a language model like a giant maze of pathways. When you ask something the model considers "harmful," there's a specific pathway that lights up and says "I can't help with that."

Heretic finds that pathway and burns it.

The model still has all its other pathways: coding, reasoning, writing, explaining.

But the refusal pathway? Gone. The model can't find it anymore because it physically doesn't exist in the weights.

Result: you ask anything, you get an answer. No lectures. No safety disclaimers. Just output.

How it works (technical)

The core idea: inside every AI model, there's a hidden "refusal signal." When it lights up, the model says no. Heretic finds that signal and disconnects it.

Step 1: Watch the model think

Run the model on two types of prompts:

  • Stuff it usually refuses (harmful requests)

  • Stuff it answers normally (harmless questions)

Record what happens inside the model's neural layers for each. Specifically, look at the first word it generates — that's where the decision to refuse or answer gets made.

What you're capturing: the "activations" — basically, the numbers flowing through each layer of the neural network as it processes the input.

Step 2: Find the refusal pattern

Take all the recordings from harmful prompts. Calculate their average (mean).

Take all the recordings from harmless prompts. Calculate their average (mean).

Subtract one from the other:

refusal_direction = mean(harmful_activations) - mean(harmless_activations)

What this gives you: a vector (a mathematical arrow) that points in the direction of refusal.

Think of it like a compass. When the needle points this direction, the model says no. The calculation above finds exactly where that direction is.

Step 3: Cut the connection

Now modify the model's weights so it can't move in that direction anymore.

Analogy: imagine a GPS that normally routes you away from certain areas. Heretic doesn't change the destination, it just removes the "avoid this area" rule from the map. The GPS still works. It just doesn't block anything.

The math: it makes the weight matrices "orthogonal" to the refusal direction. In linear algebra terms, it projects out that direction from the transformation. The model can still move along every other axis but it just loses the ability to move along the refusal axis.

This happens layer by layer. Some layers get cut more, some less.

Step 4: Test and tune

Here's the tricky part. If you cut too much, the model gets dumber. If you cut too little, it still refuses sometimes.

Heretic runs an optimization loop with two metrics:

  • Refusal rate: how often does it still refuse harmful prompts?

  • KL divergence: how different are its answers from the original model on normal prompts? (This measures "brain damage")

Lower KL = the model still thinks normally.

It uses Optuna's optimizer to search for the sweet spot: which layers to cut, how deep to cut each one. Runs test after test, adjusting parameters, until refusal is low and capability is preserved.

This is automated. No manual guessing.

Step 5: Save the result

Exports a new model checkpoint. Same architecture. Modified weights. Load it like any other model.

Time: 30-45 minutes on a single GPU.

JAILBREAK VS. ABLITERATION VS. HERETIC

Jailbreak prompts

Manual abliteration

Heretic

Mechanism

Trick the model with clever prompts

Manually adjust safety-related weights

Automated directional ablation

Persistence

Breaks with prompt changes

Permanent for that checkpoint

Permanent for that checkpoint

Tuning

Manual prompt engineering

Manual layer/strength choices

Automated optimization search

Capability impact

None (it's just text)

Often noticeable degradation

Targeted to minimize loss

Detectability

Easy to log and filter

Harder — behavioral only

Harder — behavioral only

Ease of use

Copy-paste

Requires ML knowledge

Single CLI command

WHAT PEOPLE ARE DOING WITH IT

Local uncensored chatbots:

Enthusiasts run 4B-30B variants through text-generation UIs. Roleplay, NSFW conversations, bypassing mainstream content rules. The "freedom" use case.

Coding and autonomous agents:

Here's where it gets interesting. Pair an uncensored model with code execution or CLI tools, and it will never refuse a command. No "I can't help with that" when you ask it to inspect malware, write exploitation code, or run system commands.

For red-teaming and security research, that's valuable. For other contexts, that's the risk.

Alignment and interpretability research:

Researchers use Heretic as a probe. By ablating refusal circuits and watching what changes, they study how alignment is encoded mechanistically. Is refusal really a separable circuit? How robust is it? What other behaviors could be edited this way?

Benchmark stress-testing:

Community members run private "LLM IQ tests" comparing Heretic-edited models against base versions. Testing raw capability, refusal rates, and edge cases.

LINKS

- GitHub: https://github.com/p-e-w/heretic
- Hugging Face: search "Heretic abliterated"
- r/LocalLLaMA threads

Until next week,
@speedy_devv

Keep Reading