Introduction

I wanted to test whether uncensored GLM-4.7 Flash was actually useful for defensive security work, especially for attack analysis and controlled reproduction planning. Models with weaker guardrails are often attractive because they respond quickly and do not hesitate. The problem is that the same property can produce confident but unreliable conclusions.

My conclusion from this note is straightforward: the model is useful as a sparring partner, but not as a trustworthy basis for security decisions. In the logs I reviewed, it was good at producing attack angles and test ideas, but too willing to convert weak hints into strong claims.

Is it useful as an attack-analysis model?

Not everything here was bad. In fact, the model did a few things well enough that I would still keep it in a toolbox.

What it does well

  • It can enumerate plausible attack scenarios and test directions very quickly.
  • It can still identify meaningful boundary functions such as to_rel_string, collect_files, build_globset, and sanitize_symbol_text.
  • It is strong at early-stage threat-model brainstorming, review checklist expansion, and generating candidate test inputs.

That is the real value of this kind of uncensored model. It is fast, aggressive, and not shy about exploring dangerous lines of thought. If I only want breadth at the start of a review, that can be useful.

Where it fails

The serious problem is what happens after that first brainstorming pass. In this case, the model repeatedly treated vulnerability preconditions as already satisfied.

  • It called strip_prefix + unwrap_or path traversal even though, in isolation, it only falls back to returning an absolute path for display.
  • It explained Rust regex in terms of catastrophic backtracking, even though the crate is generally built around linear-time execution.
  • It described glob problems as “injection” when the real issues are usually boundary control and IO or computational blowups.
  • It linked sanitize_symbol_text directly to XSS without proving the rendering-side precondition.

This is exactly the kind of error pattern that makes uncensored security evaluation risky. The model can point at the right neighborhood while still drawing the wrong map.

Suitability for defensive security work

For defensive work, I would not use this model as evidence.

  • It tends to produce exploit-oriented material too easily.
  • That output can become a risk in logs, prompts, or shared transcripts.
  • The combination of false positives and dangerous generated content creates both technical and compliance problems.

That does not mean it is useless. It means its role has to be constrained. In an isolated local environment, with outputs kept private and the purpose limited to red-team-style brainstorming, it still has value. The danger comes from asking it to do a job it is not reliable enough to do.

The role split that makes sense

Based on this note, the practical split looks like this:

  • Let it do idea generation, candidate test generation, and review coverage.
  • Do not let it make vulnerability claims, CVE-grade assertions, or direct exploit plans that are accepted without review.
  • Require human verification of data flow, trust boundaries, permission models, and caller-side input constraints.

That role boundary is the difference between a useful rough tool and a liability.

Performance observations

The performance side of the note was genuinely strong, which makes the tradeoff more interesting.

Model and runtime conditions

  • GGUF: Q8_0
  • model params: 29.943B
  • model size: 29.924GiB (8.584 BPW)
  • n_ctx = 131072
  • n_batch = 2048
  • n_ubatch = 2048
  • flash_attn = 1
  • fused_moe = 1
  • mla_attn = 3
  • GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
  • layer offload: 48/48 layers GPU
  • KV cache: CUDA0 KV buffer size = 3595.52MiB
  • compute buffer: CUDA0 compute buffer 7360.62MiB
  • host compute buffer: CUDA_Host compute buffer 528.02MiB
  • CPU buffer: 28152.00MiB

The logs suggest that some expert-side weights may still live on the CPU side, so this is not as simple as saying “everything is on GPU.” Even so, for a 30B-class Q8 model with a 131k context setting, the throughput is impressive.

Representative throughput

Some of the clearest examples in the note were:

  • prompt eval: 6194.54 ms / 9519 tokens = 1536.68 tok/s
  • eval: 1657.83 ms / 111 tokens = 66.95 tok/s
  • prompt eval: 949.06 ms / 125 tokens = 131.71 tok/s
  • eval: 6837.71 ms / 232 tokens = 33.93 tok/s

So the observed decode speed in the note lands roughly in the 34-67 tok/s range. That is fast enough to feel responsive for long-context work, especially when prompt caching is effective.

Benchmark details

The source also included a request-by-request table, which is useful because it shows how much the numbers move across workloads.

#PP(tok)TG(tok)Ctx_usedT_PP(s)S_PP(t/s)T_TG(s)S_TG(t/s)total(s)
11252323570.949131.716.83833.937.787
266143010913.295200.6312.85233.4616.147
34674048712.555182.7712.00633.6514.561
478344812313.755208.5013.57333.0117.329
576140011613.705205.3812.10233.0515.807
691641013264.179219.1912.49932.8016.678
783951213513.996209.9515.79032.4319.786
849751210092.652187.4315.75532.5018.407
9951911196306.1951536.681.65866.957.852
1011676525122018.8601317.888.40362.4817.263
111229166123577.7541585.200.95968.848.712
1253255110831.365389.628.81962.4810.184
1355877713351.408396.1712.47362.3013.881
1478477815621.471532.9212.47862.3513.949
1578473215161.484528.1511.80462.0113.288
16739178125201.502491.9129.06761.2730.570
17179262624181.7641015.9210.07062.1711.834

In the note, Ctx_used is approximated as PP+TG because the logs did not provide a separate per-request total-token field. That is a reasonable operational compromise.

GPU full-offload aggregate

The note also included a useful aggregate for a GPU full-offload run:

  • Total tokens: 64,404
  • Prompt: 33,295
  • Generation: 31,109
  • Total time: 160.487s
  • Prompt time: 9.089s
  • Generation time: 151.304s
  • Weighted average throughput: Prompt 3,663.2 tok/s
  • Weighted average throughput: Generation 205.6 tok/s
  • Fastest generation: req_id=2316 (511.3 tok/s, 247 tok / 0.484s)
  • Slowest generation: req_id=4522 (103.2 tok/s, 1963 tok / 19.028s)
  • Fastest prompt: req_id=0 (7009.3 tok/s, 5309 tok / 0.757s)
  • Slowest prompt: req_id=2261 (662.5 tok/s, 104 tok / 0.157s)
req_idprompt_tokensgen_tokenstotal_tokensctx_tokensT_PP(s)T_TG(s)total(s)TPS_PPTPS_TGms/token_PPms/token_TG
05309630593959390.7574.8525.6107009.3129.80.147.70
6321089742183118310.1925.7985.9905659.7128.00.187.81
13752369325269426940.4162.3052.7215701.1141.00.187.09
1701756400115611560.1753.1663.3424307.9126.30.237.92
2102444705145140.1490.5500.6992971.8127.40.347.85
2173246873333330.1390.6860.8251767.5126.80.577.89
2261104541581580.1570.4240.581662.5127.51.517.84
2316652473123120.1271.9682.095511.3125.51.967.97
2564913118103110310.2120.9421.1554296.8125.20.237.99
26831352273623620.1331.8151.9481011.6125.10.998.00
29111546244179017900.3181.9892.3084855.5122.70.218.15
31561824320214421440.3752.6563.0314859.6120.50.218.30
34775230365559555951.4143.4354.8493698.3106.30.279.41
38441531401193219320.5533.7884.3402770.8105.90.369.45
4246779275105410540.4032.5973.0011932.0105.90.529.45
452213571963332033200.54019.02819.5682510.9103.20.409.69
648619772048402540250.63120.13620.7673135.1101.70.329.83
853520632048411141110.91520.20421.1192255.2101.40.449.87
1058420581879393739370.92818.44219.3702217.3101.90.459.81
1246418911689358035800.73416.60517.3392575.2101.70.399.83

These are strong numbers. Purely on performance, the model is attractive.

Why that makes the model more dangerous, not less

The important lesson here is that speed and reliability are different axes.

This model is fast enough to become part of a real workflow. It can process long prompts, produce output at usable decode rates, and benefit from cache behavior. But if it is also willing to turn incomplete evidence into strong vulnerability claims, then that speed amplifies the wrong thing. It does not just help you make progress faster. It also helps you become confidently wrong faster.

Because this is an uncensored model, there is another layer of risk: the generated content itself. Logs, transcripts, and prompt history can become sensitive artifacts. In a shared environment, the operational control problem can matter as much as the model’s accuracy.

How I would use it next time

If I wanted to review whether a Rust code path was genuinely dangerous, I would change the prompting strategy instead of asking for exploit code directly.

Bad prompt direction

  • give me exploit code

Even when the stated goal is defensive, that tends to maximize noise and risky output.

Better prompt direction

  • Ask for input source -> boundary -> sink data flow first.
  • Ask for attack preconditions and require the model to say which ones are not yet proven.
  • Ask for safe minimal reproduction tests using non-dangerous synthetic input.
  • Ask for false-positive explanations.
  • Ask for remediation guidance that includes trust boundaries and caller-side constraints.

That will not magically make the model trustworthy, but it does reduce the chance that a plausible-sounding hallucination becomes the center of the review.

Results and next steps

My conclusion from this note is simple:

  • It is useful for attack-analysis ideation.
  • It is not suitable as evidence for defensive security decisions.
  • Its throughput is strong enough to make it appealing for long-context and cache-friendly workflows.

So I would keep it as a rough brainstorming tool, not as a model that gets to decide whether something is truly vulnerable. Used in the right place, it is productive. Used in the wrong place, it is a fast way to create confident mistakes.