Where GLM-4.7-Flash Uncensored Helps and Where It Becomes Dangerous
An evaluation of uncensored GLM-4.7 Flash for defensive security work. The model is useful for brainstorming attack angles but dangerous as evidence for vulnerability decisions. Includes throughput benchmarks and a practical role-boundary guide.
Introduction
I wanted to test whether uncensored GLM-4.7 Flash was actually useful for defensive security work, especially for attack analysis and controlled reproduction planning. Models with weaker guardrails are often attractive because they respond quickly and do not hesitate. The problem is that the same property can produce confident but unreliable conclusions.
My conclusion from this note is straightforward: the model is useful as a sparring partner, but not as a trustworthy basis for security decisions. In the logs I reviewed, it was good at producing attack angles and test ideas, but too willing to convert weak hints into strong claims.
Is it useful as an attack-analysis model?
Not everything here was bad. In fact, the model did a few things well enough that I would still keep it in a toolbox.
What it does well
- It can enumerate plausible attack scenarios and test directions very quickly.
- It can still identify meaningful boundary functions such as
to_rel_string,collect_files,build_globset, andsanitize_symbol_text. - It is strong at early-stage threat-model brainstorming, review checklist expansion, and generating candidate test inputs.
That is the real value of this kind of uncensored model. It is fast, aggressive, and not shy about exploring dangerous lines of thought. If I only want breadth at the start of a review, that can be useful.
Where it fails
The serious problem is what happens after that first brainstorming pass. In this case, the model repeatedly treated vulnerability preconditions as already satisfied.
- It called
strip_prefix + unwrap_orpath traversal even though, in isolation, it only falls back to returning an absolute path for display. - It explained Rust
regexin terms of catastrophic backtracking, even though the crate is generally built around linear-time execution. - It described
globproblems as “injection” when the real issues are usually boundary control and IO or computational blowups. - It linked
sanitize_symbol_textdirectly to XSS without proving the rendering-side precondition.
This is exactly the kind of error pattern that makes uncensored security evaluation risky. The model can point at the right neighborhood while still drawing the wrong map.
Suitability for defensive security work
For defensive work, I would not use this model as evidence.
- It tends to produce exploit-oriented material too easily.
- That output can become a risk in logs, prompts, or shared transcripts.
- The combination of false positives and dangerous generated content creates both technical and compliance problems.
That does not mean it is useless. It means its role has to be constrained. In an isolated local environment, with outputs kept private and the purpose limited to red-team-style brainstorming, it still has value. The danger comes from asking it to do a job it is not reliable enough to do.
The role split that makes sense
Based on this note, the practical split looks like this:
- Let it do idea generation, candidate test generation, and review coverage.
- Do not let it make vulnerability claims, CVE-grade assertions, or direct exploit plans that are accepted without review.
- Require human verification of data flow, trust boundaries, permission models, and caller-side input constraints.
That role boundary is the difference between a useful rough tool and a liability.
Performance observations
The performance side of the note was genuinely strong, which makes the tradeoff more interesting.
Model and runtime conditions
- GGUF:
Q8_0 - model params:
29.943B - model size:
29.924GiB (8.584 BPW) n_ctx = 131072n_batch = 2048n_ubatch = 2048flash_attn = 1fused_moe = 1mla_attn = 3- GPU:
NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB - layer offload:
48/48 layers GPU - KV cache:
CUDA0 KV buffer size = 3595.52MiB - compute buffer:
CUDA0 compute buffer 7360.62MiB - host compute buffer:
CUDA_Host compute buffer 528.02MiB - CPU buffer:
28152.00MiB
The logs suggest that some expert-side weights may still live on the CPU side, so this is not as simple as saying “everything is on GPU.” Even so, for a 30B-class Q8 model with a 131k context setting, the throughput is impressive.
Representative throughput
Some of the clearest examples in the note were:
prompt eval: 6194.54 ms / 9519 tokens = 1536.68 tok/seval: 1657.83 ms / 111 tokens = 66.95 tok/sprompt eval: 949.06 ms / 125 tokens = 131.71 tok/seval: 6837.71 ms / 232 tokens = 33.93 tok/s
So the observed decode speed in the note lands roughly in the 34-67 tok/s range. That is fast enough to feel responsive for long-context work, especially when prompt caching is effective.
Benchmark details
The source also included a request-by-request table, which is useful because it shows how much the numbers move across workloads.
| # | PP(tok) | TG(tok) | Ctx_used | T_PP(s) | S_PP(t/s) | T_TG(s) | S_TG(t/s) | total(s) |
|---|---|---|---|---|---|---|---|---|
| 1 | 125 | 232 | 357 | 0.949 | 131.71 | 6.838 | 33.93 | 7.787 |
| 2 | 661 | 430 | 1091 | 3.295 | 200.63 | 12.852 | 33.46 | 16.147 |
| 3 | 467 | 404 | 871 | 2.555 | 182.77 | 12.006 | 33.65 | 14.561 |
| 4 | 783 | 448 | 1231 | 3.755 | 208.50 | 13.573 | 33.01 | 17.329 |
| 5 | 761 | 400 | 1161 | 3.705 | 205.38 | 12.102 | 33.05 | 15.807 |
| 6 | 916 | 410 | 1326 | 4.179 | 219.19 | 12.499 | 32.80 | 16.678 |
| 7 | 839 | 512 | 1351 | 3.996 | 209.95 | 15.790 | 32.43 | 19.786 |
| 8 | 497 | 512 | 1009 | 2.652 | 187.43 | 15.755 | 32.50 | 18.407 |
| 9 | 9519 | 111 | 9630 | 6.195 | 1536.68 | 1.658 | 66.95 | 7.852 |
| 10 | 11676 | 525 | 12201 | 8.860 | 1317.88 | 8.403 | 62.48 | 17.263 |
| 11 | 12291 | 66 | 12357 | 7.754 | 1585.20 | 0.959 | 68.84 | 8.712 |
| 12 | 532 | 551 | 1083 | 1.365 | 389.62 | 8.819 | 62.48 | 10.184 |
| 13 | 558 | 777 | 1335 | 1.408 | 396.17 | 12.473 | 62.30 | 13.881 |
| 14 | 784 | 778 | 1562 | 1.471 | 532.92 | 12.478 | 62.35 | 13.949 |
| 15 | 784 | 732 | 1516 | 1.484 | 528.15 | 11.804 | 62.01 | 13.288 |
| 16 | 739 | 1781 | 2520 | 1.502 | 491.91 | 29.067 | 61.27 | 30.570 |
| 17 | 1792 | 626 | 2418 | 1.764 | 1015.92 | 10.070 | 62.17 | 11.834 |
In the note, Ctx_used is approximated as PP+TG because the logs did not provide a separate per-request total-token field. That is a reasonable operational compromise.
GPU full-offload aggregate
The note also included a useful aggregate for a GPU full-offload run:
- Total tokens:
64,404 - Prompt:
33,295 - Generation:
31,109 - Total time:
160.487s - Prompt time:
9.089s - Generation time:
151.304s - Weighted average throughput:
Prompt 3,663.2 tok/s - Weighted average throughput:
Generation 205.6 tok/s - Fastest generation:
req_id=2316 (511.3 tok/s, 247 tok / 0.484s) - Slowest generation:
req_id=4522 (103.2 tok/s, 1963 tok / 19.028s) - Fastest prompt:
req_id=0 (7009.3 tok/s, 5309 tok / 0.757s) - Slowest prompt:
req_id=2261 (662.5 tok/s, 104 tok / 0.157s)
| req_id | prompt_tokens | gen_tokens | total_tokens | ctx_tokens | T_PP(s) | T_TG(s) | total(s) | TPS_PP | TPS_TG | ms/token_PP | ms/token_TG |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5309 | 630 | 5939 | 5939 | 0.757 | 4.852 | 5.610 | 7009.3 | 129.8 | 0.14 | 7.70 |
| 632 | 1089 | 742 | 1831 | 1831 | 0.192 | 5.798 | 5.990 | 5659.7 | 128.0 | 0.18 | 7.81 |
| 1375 | 2369 | 325 | 2694 | 2694 | 0.416 | 2.305 | 2.721 | 5701.1 | 141.0 | 0.18 | 7.09 |
| 1701 | 756 | 400 | 1156 | 1156 | 0.175 | 3.166 | 3.342 | 4307.9 | 126.3 | 0.23 | 7.92 |
| 2102 | 444 | 70 | 514 | 514 | 0.149 | 0.550 | 0.699 | 2971.8 | 127.4 | 0.34 | 7.85 |
| 2173 | 246 | 87 | 333 | 333 | 0.139 | 0.686 | 0.825 | 1767.5 | 126.8 | 0.57 | 7.89 |
| 2261 | 104 | 54 | 158 | 158 | 0.157 | 0.424 | 0.581 | 662.5 | 127.5 | 1.51 | 7.84 |
| 2316 | 65 | 247 | 312 | 312 | 0.127 | 1.968 | 2.095 | 511.3 | 125.5 | 1.96 | 7.97 |
| 2564 | 913 | 118 | 1031 | 1031 | 0.212 | 0.942 | 1.155 | 4296.8 | 125.2 | 0.23 | 7.99 |
| 2683 | 135 | 227 | 362 | 362 | 0.133 | 1.815 | 1.948 | 1011.6 | 125.1 | 0.99 | 8.00 |
| 2911 | 1546 | 244 | 1790 | 1790 | 0.318 | 1.989 | 2.308 | 4855.5 | 122.7 | 0.21 | 8.15 |
| 3156 | 1824 | 320 | 2144 | 2144 | 0.375 | 2.656 | 3.031 | 4859.6 | 120.5 | 0.21 | 8.30 |
| 3477 | 5230 | 365 | 5595 | 5595 | 1.414 | 3.435 | 4.849 | 3698.3 | 106.3 | 0.27 | 9.41 |
| 3844 | 1531 | 401 | 1932 | 1932 | 0.553 | 3.788 | 4.340 | 2770.8 | 105.9 | 0.36 | 9.45 |
| 4246 | 779 | 275 | 1054 | 1054 | 0.403 | 2.597 | 3.001 | 1932.0 | 105.9 | 0.52 | 9.45 |
| 4522 | 1357 | 1963 | 3320 | 3320 | 0.540 | 19.028 | 19.568 | 2510.9 | 103.2 | 0.40 | 9.69 |
| 6486 | 1977 | 2048 | 4025 | 4025 | 0.631 | 20.136 | 20.767 | 3135.1 | 101.7 | 0.32 | 9.83 |
| 8535 | 2063 | 2048 | 4111 | 4111 | 0.915 | 20.204 | 21.119 | 2255.2 | 101.4 | 0.44 | 9.87 |
| 10584 | 2058 | 1879 | 3937 | 3937 | 0.928 | 18.442 | 19.370 | 2217.3 | 101.9 | 0.45 | 9.81 |
| 12464 | 1891 | 1689 | 3580 | 3580 | 0.734 | 16.605 | 17.339 | 2575.2 | 101.7 | 0.39 | 9.83 |
These are strong numbers. Purely on performance, the model is attractive.
Why that makes the model more dangerous, not less
The important lesson here is that speed and reliability are different axes.
This model is fast enough to become part of a real workflow. It can process long prompts, produce output at usable decode rates, and benefit from cache behavior. But if it is also willing to turn incomplete evidence into strong vulnerability claims, then that speed amplifies the wrong thing. It does not just help you make progress faster. It also helps you become confidently wrong faster.
Because this is an uncensored model, there is another layer of risk: the generated content itself. Logs, transcripts, and prompt history can become sensitive artifacts. In a shared environment, the operational control problem can matter as much as the model’s accuracy.
How I would use it next time
If I wanted to review whether a Rust code path was genuinely dangerous, I would change the prompting strategy instead of asking for exploit code directly.
Bad prompt direction
give me exploit code
Even when the stated goal is defensive, that tends to maximize noise and risky output.
Better prompt direction
- Ask for
input source -> boundary -> sinkdata flow first. - Ask for attack preconditions and require the model to say which ones are not yet proven.
- Ask for safe minimal reproduction tests using non-dangerous synthetic input.
- Ask for false-positive explanations.
- Ask for remediation guidance that includes trust boundaries and caller-side constraints.
That will not magically make the model trustworthy, but it does reduce the chance that a plausible-sounding hallucination becomes the center of the review.
Results and next steps
My conclusion from this note is simple:
- It is useful for attack-analysis ideation.
- It is not suitable as evidence for defensive security decisions.
- Its throughput is strong enough to make it appealing for long-context and cache-friendly workflows.
So I would keep it as a rough brainstorming tool, not as a model that gets to decide whether something is truly vulnerable. Used in the right place, it is productive. Used in the wrong place, it is a fast way to create confident mistakes.
