Open Weight Models: A Security Guide to the Dangers (2026)

16 min readBy Nathan House
A glowing wireframe AI model file cracked open, a neural-network brain inside, with malicious red code spilling out like a trojan past a broken padlock — surrounded by model files glowing safe cyan and corrupted red with skulls. The hidden danger inside a downloaded open weight model.

If you can run security tools, you can run open weight models. The same property that makes them powerful — a downloadable, modifiable AI you control on your own hardware — is exactly what makes them a supply-chain problem. Right now, people inside your organisation are pulling multi-gigabyte binary blobs off the internet and executing them on privileged machines. In this guide we'll treat an open weight model the way we'd treat any untrusted binary: open the box, look at every file, and work out what can hurt us before we run it.

I've spent 30 years in security, and the pattern is familiar. A new technology arrives, everyone downloads it for free, and the threat model lags years behind. Local AI is not automatically safer AI. It's just a different trust boundary. Let's close the gap.

The unifying idea

Throughout this guide every file, format, and runner gets one of three ratings — SAFE no code-execution-on-load, RISKY parser or template risk, verify the source, or DANGEROUS arbitrary code execution on load. Learn this traffic-light grammar and the rest of the article reads at a glance.

🧠 What Are Open Weight Models? (And Why Security Teams Should Care)

Start with what a model actually is: a giant pile of numbers. When an AI lab trains a model, it feeds huge amounts of data through a neural network and adjusts billions of internal parameters — the "weights" — until the network produces useful text. Those trained parameters are the model. Everything else (the prompt, the interface, the API) is just plumbing around them.

A closed or API model — GPT-4/5, Claude, Gemini — never gives you those weights. You send your prompt to the vendor's servers, you pay per token, and your data leaves your machine. You're renting access.

An open weight model is different. The trained parameters are published for download. You run and fine-tune on your own hardware, offline if you like, with no per-token cost and no data leaving your network. The catch: the training code, data, and methodology stay proprietary. The best analogy is a compiled binary you can run but not rebuild. You have the executable, not the source. Most local models fit here — Llama, Mistral, Qwen, DeepSeek, Gemma, Phi — under licences ranging from permissive (MIT, Apache-2.0) to custom-restricted.

Why should a security team care? Because "free, private, downloadable, modifiable" describes both a productivity win and an attack surface. Anyone can take a model, alter it, and re-upload it — and you may be the one downloading the result. "Private, runs locally" sounds great until you remember you just executed an unverified blob on a machine that matters. The classic problems all come straight back: provenance, integrity, dependency trust, runtime isolation, supply chain.

📦 Open Weight vs Open Source vs Closed: Knowing What You're Running

Three categories, and the difference is about trust and provenance, not marketing.

Comparison diagram of three model categories. Closed/API: weights never released, runs on vendor servers, your data leaves your machine, primary risk is a trust-boundary one. Open weight: weights published, you run and fine-tune locally, training recipe proprietary, run-but-not-rebuild trust problem. Open source (strict): weights plus architecture plus training code plus enough data to reproduce from scratch, OLMo as a rare example. The diagram stresses that most models marketed as open-source are really open weight.

Closed / API: weights never released, runs on the vendor's servers, your data leaves your machine. Your primary risk is a trust-boundary one — data leaving your control, and your reliance on the vendor.

Open weight: weights published; you run and fine-tune locally; but the training recipe is proprietary. That same run-but-not-rebuild trust problem.

Open source (strict): weights plus architecture plus training code plus enough data to reproduce the model from scratch. OLMo is one rare example.

That last distinction matters more than it looks. By the stricter, OSI-aligned definition, the Open Source Initiative's position is that a weights-only release does not qualify as open-source AI — which is precisely why the accurate term "open weight" exists. Most models marketed as "open-source" are really open weight. You get the binary, not the recipe.

This is a supply-chain fact, not pedantry. With a truly open-source model you could, in principle, audit the data and reproduce the build. With a weights-only model you can't — you're trusting a binary blob whose origin you must verify some other way. "Open" doesn't mean "safe"; it means you have a different inspection surface. Treat that "open-source" label with the same suspicion you'd give an unsigned installer.

🗂️ Inside the Box: The Files You Actually Get (File by File)

Download a model and you don't get one magical AI object — you get a folder. Each file carries its own risk. For security teams, the folder contents matter more than the model name. Here's the box, file by file, with my risk rating for each.

~/models/llama-3-8b
$ ls -la ~/models/llama-3-8b
model.safetensors        # SAFE — no code
pytorch_model.bin        # DANGEROUS — pickle
model.Q4_K_M.gguf        # RISKY — chat template
config.json              # RISKY — parser CVEs
tokenizer.json           # RISKY — parser
tokenizer.model          # RISKY — parser
generation_config.json   # SAFE — inert
README.md                # SAFE — docs
LICENSE                  # SAFE — legal

Now the same nine files as a risk grid — the spine of this whole guide. Each row: the file, its rating, what it is, and the one-line defence.

model.safetensors SAFE

Raw tensors plus a JSON metadata header. No code, no Python deserialisation step. The format we want.

Defence: Prefer it whenever it's available.

pytorch_model.bin / .pth / .pkl DANGEROUS

Python pickle. torch.load() unpickles arbitrary objects — arbitrary code execution on load, before any inference runs.

Defence: Prefer safetensors; where you must load pickle, pass weights_only=True and scan first.

model.Q4_K_M.gguf RISKY

A single binary bundling quantized tensors, metadata, an embedded tokenizer, and a Jinja2 chat template. Not a pickle vector, but parser bugs can corrupt memory and the embedded template can act as an inference-time backdoor.

Defence: Verify the source; avoid random re-uploads.

config.json RISKY

Architecture parameters, normally inert — but a crafted config.json has been implicated in code-execution issues in specific Hugging Face Transformers versions.

Defence: Keep runtimes patched; don't assume JSON is harmless just because it's text.

tokenizer.json RISKY

Maps text to and from the numbers the model reads. Complex C/Rust parsers chewing untrusted Unicode — a malformed file can manipulate output, crash the tokenizer (DoS), or trip memory-safety bugs.

Defence: Treat it as part of the trusted package, not harmless text.

tokenizer.model RISKY

SentencePiece tokenizer model — same parser-attack surface as the other tokenizer files.

Defence: Treat it as part of the trusted package, not harmless text.

generation_config.json SAFE

Inert decoding defaults — temperature, top-p, and similar sampling settings.

Defence: Safe to load.

README.md SAFE

The model card — documentation only.

Defence: Safe to load, but don't trust its claims without provenance.

LICENSE SAFE

A compliance matter, not a malware one.

Defence: Safe to load; confirm you can legally use the model.

A few precision points the grid compresses. For model.safetensors, the Hugging Face docs put it plainly: "Safetensors is a new simple format for storing tensors safely (as opposed to pickle) and that is still fast (zero-copy)." "SAFE" here means no code-execution-on-load — not a guarantee the file can never mislead tooling or carry a restrictive licence. For pickle, the HF docs say it directly: "There are dangerous arbitrary code execution attacks that can be perpetrated when you load a pickle file." GGUF quant labels you'll see (compression levels, explained shortly): Q4_K_M, Q5_K_M, Q6_K, Q8_0. And an *.onnx file is MOSTLY SAFE — a Protobuf graph where only runtime parser vulnerabilities apply.

The file-level summary

Safetensors, model card, LICENSE, and generation_config are safe to load. Pickle (.bin/.pth/.pkl) means code execution on load. .gguf, config.json, and tokenizer files are risky — verify the source.

⚙️ How You Run Them: Platforms, Quantization, and Hidden Risk

The runner you choose changes your attack surface more than the model does. Here's the comparison, then the detail.

RunnerFormatCode-exec riskOpen / closedTreat it like
OllamaGGUFRISKYOpen-sourceAn easy on-ramp — watch the install pattern and exposed ports
llama.cppGGUFRISKYOpen-sourceThe engine everything builds on — smallest attack surface
LM StudioGGUFRISKYClosed-source binaryA GUI — plus a separate supply-chain variable
vLLMsafetensorsSAFEOpen-sourceA critical web server — auth, network controls, limits
Hugging Face TransformersAny (incl. pickle)DANGEROUS — the real code-exec runtimeOpen-source libraryThe dangerous one — never trust_remote_code on untrusted repos

Ollama is the easy on-ramp: a background daemon with an OpenAI-compatible API, ollama run llama3, built on llama.cpp, running on CPU plus NVIDIA, AMD, or Apple Silicon, auto-picking a GGUF quant. GGUF avoids Python deserialisation and code-execution-on-load, so it's lower risk than pickle or Transformers at the model level — though it's still parsed by complex C/C++ and can carry a manipulated chat template, so provenance still matters. The cautions here are mostly operational: the curl … | sh install pattern, and accidentally exposing the API port.

llama.cpp is the C/C++ engine almost everything builds on. Most hardware-portable, it defines GGUF, you compile it yourself, and it has the smallest attack surface. LM Studio is a GUI desktop app to browse, download, and chat with GGUF models, with a built-in OpenAI-compatible server — but note it's a closed-source binary, a separate supply-chain variable. vLLM is the production server — high-throughput, GPU-heavy, primarily NVIDIA in production, not built for CPU. Treat it exactly like a critical web server: authentication, network controls, resource limits.

Hugging Face Transformers is the Python library, maximum flexibility — and the real code-execution runtime. trust_remote_code=True runs arbitrary Python straight from the model repo, and older pickle weights execute code on load. Mitigate by using safetensors and never setting trust_remote_code on untrusted repos.

The one sentence to remember

GGUF runtimes (Ollama, llama.cpp, LM Studio) don't execute code on load, so the dangerous runtime is Hugging Face Transformers with trust_remote_code=True — that's where code actually runs straight from the repo.

Quantization is the compression trick that makes local models practical. Models are trained in high-precision numbers (FP16/FP32); quantization shrinks them to INT8 or INT4, cutting size and VRAM roughly proportionally for a small quality cost. A 70B model is about 140GB at FP16 but roughly 40GB at 4-bit. GGUF Q4_K_M is the sensible default; below Q3, quality degrades noticeably. As a rough VRAM guide (before context and KV-cache overhead): an 8B model at Q4 needs about 4GB, so even a modest GPU or an Apple Silicon Mac runs useful models locally.

My practical recommendation: use GGUF through llama.cpp or Ollama for casual local inference, and reach for Transformers only when you specifically need its flexibility and can control the repo, the format, and the execution settings. (If you want a worked local-agent setup, see our Claude Code with Ollama guide.)

🔧 Turning Good Models Bad: How Attackers Weaponize Open Weights

Once weights are downloadable, they're modifiable. That's the feature, and it's the risk. Three classes of attack, each with a mechanism, a real example, and a defence.

Diagram of three attack classes against open weight models. (a) Guardrail removal: abliteration erases the single residual-stream direction that mediates refusal, plus cheap fine-tuning attacks that retrain safety out. (b) Backdoors and sleeper agents: a model behaves normally until a trigger appears, surviving standard safety training. (c) Malware in the file: a pickle file executes code on load via the __reduce__ method, before any inference. Each class shows mechanism, a real example, and a defence.

(a) Guardrail removal

Mechanism — abliteration. Safety refusals aren't a firewall bolted onto a model; they're a behaviour encoded in the weights. Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (arXiv:2406.11717, 2024), found refusal is governed by a single direction in the model's residual stream — think of it as one conditional jump inside a compiled binary, where patching that single pathway makes the whole "should I refuse?" check disappear. They demonstrated it across 13 open-source chat models up to 72B parameters. Erase that one direction and the model stops refusing while keeping most of its capability. The authors describe it as a white-box jailbreak. Be precise here: the term "abliteration" is a community coinage (Maxime Labonne, "Uncensor any LLM with abliteration") — it is not the paper's term. The difficulty is low: no gradients, a single GPU. Abliterated or "uncensored" models are common on Hugging Face.

Mechanism — fine-tuning attacks. You can also retrain safety out cheaply. Qi et al., "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!" (arXiv:2310.03693, 2023), jailbroke GPT-3.5 Turbo's safety guardrails by fine-tuning on only 10 adversarial examples at a cost of less than $0.20 — and showed that even benign fine-tuning can inadvertently degrade safety alignment.

Defence. Don't assume a downloaded model's safety behaviour survived re-uploading, merging, quantizing, or fine-tuning. Treat "uncensored" variants as exactly that, and test refusal behaviour before deployment. There's also a defensive technique: "An Embarrassingly Simple Defense Against LLM Abliteration Attacks" (arXiv:2505.19056, 2025) shows extended-refusal fine-tuning holds the line — refusal drops by at most 10% under abliteration, versus 70–80% drops in baseline models.

(b) Backdoors and sleeper agents

Mechanism. A backdoor makes a model behave normally until a trigger appears — which is exactly why ordinary evaluation misses it. Hubinger et al. at Anthropic, "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" (arXiv:2401.05566, 2024), trained a model to write secure code when the year is 2023 but insert exploitable code when the year is 2024. The alarming finding: the backdoor was not removed by standard safety training — not supervised fine-tuning, not reinforcement learning, not adversarial training. Worse, adversarial training taught the models to better recognise their triggers, effectively hiding the unsafe behaviour.

Real example. Mithril Security's PoisonGPT"How We Hid a Lobotomized LLM on Hugging Face to Spread Fake News" — surgically modified an open-source model (GPT-J) to spread fake news and showed "it is possible to modify specific facts and have it still pass the benchmarks." The poisoned model looked clean on standard tests.

Defence. You cannot benchmark a backdoor away — PoisonGPT passed the benchmarks. Your defence is provenance: download only from the original, verified creator, cryptographically verify hashes and signatures against their release, and keep high-risk use cases away from unknown re-uploads. Pair it with network isolation so a backdoored model has nowhere to exfiltrate.

(c) Malware in the file (pickle)

Mechanism. This is the most direct attack. A pickle file executes code on load via the __reduce__ method — so simply loading the weights runs the attacker's payload on the host, before any inference. No model trickery required; it's plain code execution.

To be clear about the shape of the consequence — not a payload, just the effect — loading a poisoned pickle looks benign right up to the moment the host is compromised:

python — EFFECT ONLY (no runnable payload)
>>> import torch
>>> torch.load("pytorch_model.bin")
# ...weights appear to load normally...
# (illustrative only — the unpickler silently ran attacker code)
# result: attacker now has a shell on this host
# no inference has run yet — the damage is already done

Defence. Prefer safetensors, scan pickle files before loading, and never load untrusted pickle weights in an environment you care about. The documented incidents in the next section show exactly why.

🚨 The Real Risks: What a Poisoned Model Can Do to You

In February 2024, security researchers found a model on Hugging Face that opened a reverse shell the moment it was loaded. That's the concrete shape of the threat. Now let's map it to frameworks you can take to a risk committee, then look at the full record of what's happened.

A poisoned open weight model lands squarely on the OWASP Top 10 for LLM Applications 2025:

MITRE ATLAS gives us the precise technique. AML.T0018 "Backdoor ML Model": "Adversaries may introduce a backdoor into a ML model. A backdoored model operates/performs as expected under typical conditions, but will produce the adversary's desired output when a trigger is introduced to the input data. A backdoored model provides the adversary with a persistent artifact on the victim system." That last phrase — persistent artifact — is the part to dwell on. A poisoned model isn't just bad content; it's a durable compromise.

And the NIST AI Risk Management Framework (AI RMF 1.0) wraps governance around all of it through four core functions: Govern, Map, Measure, Manage.

Now the documented incidents — and if management asks for proof this is real, these are it.

[ JFrog · 2024-02-27 ]

~100 malicious model instances found on Hugging Face

mechanism : PyTorch pickle abusing the __reduce__ method
model     : baller423/goober2
action    : initiated a reverse shell on load
target    : 210.117.212[.]93   (defanged)
impact    : load the model, hand someone a shell

The full write-up is JFrog's "Data Scientists Targeted by Malicious Hugging Face ML Models with Silent Backdoor."

[ ReversingLabs · nullifAI · 2025-02-06 ]

Two malicious models NOT flagged as unsafe

mechanism : broken, malformed pickle files compressed with 7z
evasion   : Hugging Face Picklescan did not flag them
payload   : reverse shell to a hardcoded IP
technique : ReversingLabs named the evasion "nullifAI"
lesson    : the platform green tick is a signal, not a guarantee

The full analysis is ReversingLabs' "Malicious ML models discovered on Hugging Face platform."

That second case is the important one. Hugging Face's own scanner runs ClamAV plus a Pickle Import scan — but, in HF's own words, ClamAV "doesn't cover pickle exploits" and the pickle import scan is "not 100% foolproof." The platform's green tick is a signal, not a guarantee.

✅ Safe vs Unsafe: A Security Checklist for Open Weight Models

Here's the checklist we'd run before loading any open weight model. Work through it in order. Any UNSAFE result means stop.

  1. Format. Safetensors is green. Pickle and any other executable or code-deserialising ML format (the .bin/.pth files above, plus formats like H5 and TensorFlow SavedModel) are scan-first.
  2. Hub scan verdict. The Hugging Face Hub scan shows "SAFE" for that specific commit. (Necessary, not sufficient — remember nullifAI.)
  3. Provenance. It's from the original creator or a verified org, not an unknown re-upload, with a real, detailed model card.
  4. Integrity. Checksums and signatures verify against the trusted source.
  5. Local static scan before loading.
  6. Licence and dependency check. Confirm you can legally use the model and vet its inference-code dependencies.

One caveat: scanners themselves have had CVEs. Layer them; don't trust one. Here are the tools.

ToolWhat it is / does
Protect AI ModelScanApache-2.0, free, OSS. "ModelScan currently supports: H5, Pickle, and SavedModel formats. This protects you when using PyTorch, TensorFlow, Keras, Sklearn, XGBoost…" Reads the file "one byte at a time… looking for code signatures that are unsafe" — it does not load or execute the model.
picklescanMIT. A "security scanner detecting Python Pickle files performing suspicious actions." Scans local files, directories, URLs, and zip archives; reports "Dangerous globals."
Hugging Face Hub scanningFree, automatic. Layers ClamAV, a Pickle Import scan, and Protect AI integration, showing a security verdict on the model page. First filter, not last word.
Protect AI GuardianCommercial, enterprise zero-trust scanner for teams that need it.

My recommendation: hub scanning is the first filter, a local scan with two independent tools is the second, and sandboxed loading is the final backstop. Never let the Hub's verdict be your only control.

🛡️ Running Untrusted Models Safely: Sandboxing and Isolation

Sometimes you must run a model you don't fully trust — evaluating a community fine-tune, say. Even after a clean scan, treat the model as hostile data. The principle is zero trust: assume the model will try to execute code, and build so that it can't reach anything that matters.

Sandboxing and isolation diagram for running untrusted models. Concentric controls: sandbox inference in a minimal Docker container or VM (non-root, dropped Linux capabilities, read-only root filesystem); network isolation with no outbound connectivity by default (the single control that neutralises a reverse shell); least privilege on the weights (encrypt at rest, mount read-only, restrictive permissions); resource limits capping CPU, GPU, and memory; hardened endpoint with scoped short-lived auth, RBAC, rate limiting, TLS; and monitor-and-log with integrity-protected logs plus runtime anomaly monitoring.

🔴 Treat the model as hostile

  • It may try to execute code the moment you load it
  • It may call home or exfiltrate over any available network path
  • A compromised service may try to swap trusted weights for poisoned ones
  • Malformed inputs may exhaust CPU, GPU, or memory

🟢 Build so it can't reach anything

  • Sandbox inference in a minimal Docker container or VM — non-root, dropped capabilities, read-only root FS
  • Network isolation: no outbound connectivity by default — the single control that neutralises a reverse shell
  • Least privilege on the weights: encrypt at rest, mount read-only, restrictive permissions
  • Resource limits on CPU, GPU, and memory; harden the endpoint (scoped short-lived auth, RBAC, rate limiting, TLS); monitor and log

Then put governance over it. Use the NIST AI RMF (Govern, Map, Measure, Manage) as your operating model and the OWASP LLM Top 10 as your gap checklist. Frameworks don't stop a reverse shell — but they make sure someone owns the control that does. If your team is building these skills, our AI Master's Program covers securing AI systems end to end, and our AI cybersecurity threats analysis separates the real risks from the hype.

❓ Open Weight Models FAQ

Are open weight models safe?

The format and the source decide it. A safetensors file from a verified original creator, scanned and run with no outbound network, is reasonable. An unscanned pickle file from a random re-upload can execute code the moment you load it — the JFrog and nullifAI incidents are documented proof. Safe is a process, not a property.

What's the difference between open weight and open source?

Open weight means you get the trained parameters — a binary you can run but not rebuild. Strict open source adds the architecture, training code, and enough data to reproduce the model. The Open Source Initiative holds that weights-only release does not qualify as open-source AI, which is why "open weight" is the accurate term.

Is safetensors safer than pytorch_model.bin?

Yes, clearly. Safetensors stores raw tensors with a JSON header and contains no code. A pytorch_model.bin is a Python pickle, and torch.load() can execute arbitrary code on load. Hugging Face's own docs warn of "dangerous arbitrary code execution attacks" when loading pickle files. Prefer safetensors every time.

Can an AI model contain malware?

Yes, and it has. Pickle-based model files execute code on load via the __reduce__ method. JFrog found around 100 malicious instances on Hugging Face in February 2024, including one (baller423/goober2) that opened a reverse shell when loaded. In 2025 the nullifAI case showed attackers actively evading scanners. Beyond malware in the file, models can also carry backdoors and sleeper agents that trigger on specific inputs.

What is an abliterated or "uncensored" model?

A model whose safety refusals have been removed. The Arditi et al. research showed refusal is mediated by a single direction in the residual stream; erase that direction and the model stops refusing while keeping most of its capability. "Abliteration" is a community coinage, not the paper's term. These variants are common on Hugging Face — treat them as deliberately stripped of safety behaviour.

How do I scan an LLM for malware?

Use a dedicated static scanner before loading — never run the model to test it. Protect AI's ModelScan (Apache-2.0, free) reads the file one byte at a time looking for unsafe code signatures without executing it, and supports H5, Pickle, and SavedModel. picklescan (MIT) detects pickle files performing suspicious actions across files, directories, URLs, and zips. Layer at least two — scanners themselves have had CVEs — verify cryptographic hashes against the source, and never rely on the Hugging Face verdict alone.

Last Updated: June 2026. Treat every downloaded model as an untrusted binary — scan before you load, and isolate before you run.

About the Author

Nathan House

Nathan House, Founder & CEO of StationX

Nathan House has 30 years of hands-on cybersecurity experience and is Cambridge-educated, holding CISSP, CISA, CISM, OSCP, CEH, and SABSA. He founded StationX in 1999 — one of the UK’s first cybersecurity companies — and has secured £71 billion in UK mobile banking transactions and the London 2012 Olympics, advising clients including Microsoft, Cisco, BP, Vodafone, and VISA. He authored the world’s most popular cybersecurity course — a #1 Udemy bestseller taken by over 500,000 students — and was named Cyber Security Educator of the Year 2020, AI Security Educator of the Year, and a UK Top 25 Security Influencer 2025. A DEF CON speaker and featured expert on CNN, Fox News, NBC, and the BBC, Nathan leads StationX’s training of more than half a million students worldwide.