On this page

Where GLM-4.7-Flash Uncensored Helps and Where It Becomes Dangerous

An evaluation of uncensored GLM-4.7 Flash for defensive security work. The model is useful for brainstorming attack angles but dangerous as evidence for vulnerability decisions. Includes throughput benchmarks and a practical role-boundary guide.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Introduction

I wanted to test whether uncensored GLM-4.7 Flash was actually useful for defensive security work, especially for attack analysis and controlled reproduction planning. Models with weaker guardrails are often attractive because they respond quickly and do not hesitate. The problem is that the same property can produce confident but unreliable conclusions.

My conclusion from this note is straightforward: the model is useful as a sparring partner, but not as a trustworthy basis for security decisions. In the logs I reviewed, it was good at producing attack angles and test ideas, but too willing to convert weak hints into strong claims.

Is it useful as an attack-analysis model?

Not everything here was bad. In fact, the model did a few things well enough that I would still keep it in a toolbox.

What it does well

It can enumerate plausible attack scenarios and test directions very quickly.
It can still identify meaningful boundary functions such as to_rel_string, collect_files, build_globset, and sanitize_symbol_text.
It is strong at early-stage threat-model brainstorming, review checklist expansion, and generating candidate test inputs.

That is the real value of this kind of uncensored model. It is fast, aggressive, and not shy about exploring dangerous lines of thought. If I only want breadth at the start of a review, that can be useful.

Where it fails

The serious problem is what happens after that first brainstorming pass. In this case, the model repeatedly treated vulnerability preconditions as already satisfied.

It called strip_prefix + unwrap_or path traversal even though, in isolation, it only falls back to returning an absolute path for display.
It explained Rust regex in terms of catastrophic backtracking, even though the crate is generally built around linear-time execution.
It described glob problems as “injection” when the real issues are usually boundary control and IO or computational blowups.
It linked sanitize_symbol_text directly to XSS without proving the rendering-side precondition.

This is exactly the kind of error pattern that makes uncensored security evaluation risky. The model can point at the right neighborhood while still drawing the wrong map.

Suitability for defensive security work

For defensive work, I would not use this model as evidence.

It tends to produce exploit-oriented material too easily.
That output can become a risk in logs, prompts, or shared transcripts.
The combination of false positives and dangerous generated content creates both technical and compliance problems.

That does not mean it is useless. It means its role has to be constrained. In an isolated local environment, with outputs kept private and the purpose limited to red-team-style brainstorming, it still has value. The danger comes from asking it to do a job it is not reliable enough to do.

The role split that makes sense

Based on this note, the practical split looks like this:

Let it do idea generation, candidate test generation, and review coverage.
Do not let it make vulnerability claims, CVE-grade assertions, or direct exploit plans that are accepted without review.
Require human verification of data flow, trust boundaries, permission models, and caller-side input constraints.

That role boundary is the difference between a useful rough tool and a liability.

Performance observations

The performance side of the note was genuinely strong, which makes the tradeoff more interesting.

Model and runtime conditions

GGUF: Q8_0
model params: 29.943B
model size: 29.924GiB (8.584 BPW)
n_ctx = 131072
n_batch = 2048
n_ubatch = 2048
flash_attn = 1
fused_moe = 1
mla_attn = 3
GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
layer offload: 48/48 layers GPU
KV cache: CUDA0 KV buffer size = 3595.52MiB
compute buffer: CUDA0 compute buffer 7360.62MiB
host compute buffer: CUDA_Host compute buffer 528.02MiB
CPU buffer: 28152.00MiB

The logs suggest that some expert-side weights may still live on the CPU side, so this is not as simple as saying “everything is on GPU.” Even so, for a 30B-class Q8 model with a 131k context setting, the throughput is impressive.

Representative throughput

Some of the clearest examples in the note were:

prompt eval: 6194.54 ms / 9519 tokens = 1536.68 tok/s
eval: 1657.83 ms / 111 tokens = 66.95 tok/s
prompt eval: 949.06 ms / 125 tokens = 131.71 tok/s
eval: 6837.71 ms / 232 tokens = 33.93 tok/s

So the observed decode speed in the note lands roughly in the 34-67 tok/s range. That is fast enough to feel responsive for long-context work, especially when prompt caching is effective.

Benchmark details

The source also included a request-by-request table, which is useful because it shows how much the numbers move across workloads.

#	PP(tok)	TG(tok)	Ctx_used	T_PP(s)	S_PP(t/s)	T_TG(s)	S_TG(t/s)	total(s)
1	125	232	357	0.949	131.71	6.838	33.93	7.787
2	661	430	1091	3.295	200.63	12.852	33.46	16.147
3	467	404	871	2.555	182.77	12.006	33.65	14.561
4	783	448	1231	3.755	208.50	13.573	33.01	17.329
5	761	400	1161	3.705	205.38	12.102	33.05	15.807
6	916	410	1326	4.179	219.19	12.499	32.80	16.678
7	839	512	1351	3.996	209.95	15.790	32.43	19.786
8	497	512	1009	2.652	187.43	15.755	32.50	18.407
9	9519	111	9630	6.195	1536.68	1.658	66.95	7.852
10	11676	525	12201	8.860	1317.88	8.403	62.48	17.263
11	12291	66	12357	7.754	1585.20	0.959	68.84	8.712
12	532	551	1083	1.365	389.62	8.819	62.48	10.184
13	558	777	1335	1.408	396.17	12.473	62.30	13.881
14	784	778	1562	1.471	532.92	12.478	62.35	13.949
15	784	732	1516	1.484	528.15	11.804	62.01	13.288
16	739	1781	2520	1.502	491.91	29.067	61.27	30.570
17	1792	626	2418	1.764	1015.92	10.070	62.17	11.834

In the note, Ctx_used is approximated as PP+TG because the logs did not provide a separate per-request total-token field. That is a reasonable operational compromise.

GPU full-offload aggregate

The note also included a useful aggregate for a GPU full-offload run:

Total tokens: 64,404
Prompt: 33,295
Generation: 31,109
Total time: 160.487s
Prompt time: 9.089s
Generation time: 151.304s
Weighted average throughput: Prompt 3,663.2 tok/s
Weighted average throughput: Generation 205.6 tok/s
Fastest generation: req_id=2316 (511.3 tok/s, 247 tok / 0.484s)
Slowest generation: req_id=4522 (103.2 tok/s, 1963 tok / 19.028s)
Fastest prompt: req_id=0 (7009.3 tok/s, 5309 tok / 0.757s)
Slowest prompt: req_id=2261 (662.5 tok/s, 104 tok / 0.157s)

req_id	prompt_tokens	gen_tokens	total_tokens	ctx_tokens	T_PP(s)	T_TG(s)	total(s)	TPS_PP	TPS_TG	ms/token_PP	ms/token_TG
0	5309	630	5939	5939	0.757	4.852	5.610	7009.3	129.8	0.14	7.70
632	1089	742	1831	1831	0.192	5.798	5.990	5659.7	128.0	0.18	7.81
1375	2369	325	2694	2694	0.416	2.305	2.721	5701.1	141.0	0.18	7.09
1701	756	400	1156	1156	0.175	3.166	3.342	4307.9	126.3	0.23	7.92
2102	444	70	514	514	0.149	0.550	0.699	2971.8	127.4	0.34	7.85
2173	246	87	333	333	0.139	0.686	0.825	1767.5	126.8	0.57	7.89
2261	104	54	158	158	0.157	0.424	0.581	662.5	127.5	1.51	7.84
2316	65	247	312	312	0.127	1.968	2.095	511.3	125.5	1.96	7.97
2564	913	118	1031	1031	0.212	0.942	1.155	4296.8	125.2	0.23	7.99
2683	135	227	362	362	0.133	1.815	1.948	1011.6	125.1	0.99	8.00
2911	1546	244	1790	1790	0.318	1.989	2.308	4855.5	122.7	0.21	8.15
3156	1824	320	2144	2144	0.375	2.656	3.031	4859.6	120.5	0.21	8.30
3477	5230	365	5595	5595	1.414	3.435	4.849	3698.3	106.3	0.27	9.41
3844	1531	401	1932	1932	0.553	3.788	4.340	2770.8	105.9	0.36	9.45
4246	779	275	1054	1054	0.403	2.597	3.001	1932.0	105.9	0.52	9.45
4522	1357	1963	3320	3320	0.540	19.028	19.568	2510.9	103.2	0.40	9.69
6486	1977	2048	4025	4025	0.631	20.136	20.767	3135.1	101.7	0.32	9.83
8535	2063	2048	4111	4111	0.915	20.204	21.119	2255.2	101.4	0.44	9.87
10584	2058	1879	3937	3937	0.928	18.442	19.370	2217.3	101.9	0.45	9.81
12464	1891	1689	3580	3580	0.734	16.605	17.339	2575.2	101.7	0.39	9.83

These are strong numbers. Purely on performance, the model is attractive.

Why that makes the model more dangerous, not less

The important lesson here is that speed and reliability are different axes.

This model is fast enough to become part of a real workflow. It can process long prompts, produce output at usable decode rates, and benefit from cache behavior. But if it is also willing to turn incomplete evidence into strong vulnerability claims, then that speed amplifies the wrong thing. It does not just help you make progress faster. It also helps you become confidently wrong faster.

Because this is an uncensored model, there is another layer of risk: the generated content itself. Logs, transcripts, and prompt history can become sensitive artifacts. In a shared environment, the operational control problem can matter as much as the model’s accuracy.

How I would use it next time

If I wanted to review whether a Rust code path was genuinely dangerous, I would change the prompting strategy instead of asking for exploit code directly.

Bad prompt direction

give me exploit code

Even when the stated goal is defensive, that tends to maximize noise and risky output.

Better prompt direction

Ask for input source -> boundary -> sink data flow first.
Ask for attack preconditions and require the model to say which ones are not yet proven.
Ask for safe minimal reproduction tests using non-dangerous synthetic input.
Ask for false-positive explanations.
Ask for remediation guidance that includes trust boundaries and caller-side constraints.

That will not magically make the model trustworthy, but it does reduce the chance that a plausible-sounding hallucination becomes the center of the review.

Results and next steps

My conclusion from this note is simple:

It is useful for attack-analysis ideation.
It is not suitable as evidence for defensive security decisions.
Its throughput is strong enough to make it appealing for long-context and cache-friendly workflows.

So I would keep it as a rough brainstorming tool, not as a model that gets to decide whether something is truly vulnerable. Used in the right place, it is productive. Used in the wrong place, it is a fast way to create confident mistakes.

Reworking a Local AI Coding Environment Around Serena MCP

Designing a local AI coding …

Why IQuest-Coder Loop-Instruct Still Feels Slow in Aider

A breakdown of why …