Making ggex-bench schemas less awkward for LLMs

This post was originally posted on Vaporlens Patreon

Hey folks!

Quick follow-up on the gaming knowledge graph extraction benchmark. I changed one thing that mattered more than I expected:

LLMs do much better when the extraction schema looks like the structure they already tend to produce.

Obvious in hindsight. Less obvious after staring at RDF output for too long.

In one of the previous posts I've shared the first ggex-bench results. Models could extract useful game knowledge from long articles, but exact RDF/graph extraction was still rough. Semantic precision was often decent, semantic recall - not so much, and strict RDF precision and recall were low across the board.

At first I treated that as a model / pipeline problem. Maybe I needed a multi-pass extractor, or validation and repair pass, or a missing-fact follow-up step.

I played with those ideas a bit, but I don't think they are the main fix. Extra passes can find more facts, but they also push the model toward tiny low-value details and hurt precision.

So before making the pipeline more complicated, I changed the schema to fit how LLMs seem to read reviews.

That alone improved results and reduced cost.

The first schema was graph-shaped, not LLM-shaped

The original ggex-bench schema was claims-based.

From a knowledge graph point of view, it made sense: extract small, atomic, evidence-backed claim nodes. If a review praised combat, criticized the camera, compared the combat to Devil May Cry, and described a strong flow state, each point could become a separate claim.

Something like this:

ent:article1 pred:claims ent:claim1 .

ent:claim1 a ent:Praise ;
    pred:about ent:Dead_as_Disco ;
    pred:aspect ent:combat ;
    pred:sentiment ent:positive ;
    pred:evidence "Combat is spectacular and flow-state inducing." .

ent:article1 pred:claims ent:claim2 .

ent:claim2 a ent:Critique ;
    pred:about ent:Dead_as_Disco ;
    pred:aspect ent:combat ;
    pred:sentiment ent:negative ;
    pred:evidence "The camera lags slightly behind the action." .

That is a reasonable graph schema.

It is also a pretty unnatural LLM output schema.

The model has to keep deciding how to split one paragraph into database atoms:

is this one claim or two?
is the subject the game, combat, camera, or the whole gameplay loop?
is this a Praise, Critique, Observation, or mixed statement?
does the comparison belong in its own claim?
should "spectacular combat" and "flow-state inducing combat" be separate facts?
how much evidence text should be copied into each node?

Those are interpretation choices, not just formatting choices.

LLMs are not very consistent at making fine-grained graph modelling decisions across a long output. One model might create three claims. Another might create one claim with several predicates. Another might attach the camera issue to combat. Another might invent a CameraSystem node.

In theory, all of those interpretations can be reasonable readings of the article. In practice, this becomes a pain to cleanup and align with internal schema. And for a benchmark that checks the resulting RDF, that freedom is brutal.

The model may understand the article, but strict matching still fails if it expresses that understanding in a shape the benchmark did not expect. The old schema also burns more output tokens, because every tiny claim carries its own ID, type, evidence, etc.

The better schema follows the review

I changed the schema from fragmented claim extraction to consolidated facet assessment.

Instead of asking the model to create arbitrary claim nodes, I asked it to produce one stable assessment per important facet.

For combat, the output now looks more like this:

ent:Dead_as_Disco pred:hasAssessment ent:Dead_as_Disco_CombatAssessment .

ent:Dead_as_Disco_CombatAssessment a ent:GameplayAssessment ;
    pred:category ent:combat ;
    pred:polarity ent:mixed ;
    pred:positive ent:spectacular_combat, ent:addictive_flow_state ;
    pred:negative ent:camera_delay ;
    pred:comparedTo ent:Devil_May_Cry ;
    pred:summary "Combat is perceived as spectacular, flow-state inducing, and comparable to Devil May Cry, though slightly held back by minor camera delay." .

This still produces a knowledge graph, but the task is closer to structured review analysis:

identify the facet
list positives
list negatives
mention comparisons
decide the overall polarity
write a short summary

Game reviews are usually organized around facets anyway: combat, story, pacing, progression and so on. Reviewers rarely write in clean RDF atoms, instead they write something like:

combat is spectacular and flow-state inducing, almost Devil May Cry-ish, but the camera sometimes lags behind the action

The old schema asks the model to decompose that into several separate claims. The new schema lets the model keep it as one mixed combat assessment. The schema is not just simpler - it's closer to the shape the model seems to reach for when it reads a review: pros, cons, comparisons, and a short judgment for one aspect.

Same information, less fighting the model

The original schema was asking the model to fight its own habits.

LLMs are comfortable with outputs like:

Combat:
- Positive: spectacular, flow-state inducing
- Negative: camera delay
- Compared to: Devil May Cry
- Overall: mixed but positive-leaning

They are less consistent with outputs like:

Create claim_17 as Praise about combat fun.
Create claim_18 as Observation about combat comparison.
Create claim_19 as Critique about camera behavior.
Attach each to the article.
Use exactly the expected predicates.
Do not merge or split them differently than the benchmark.

The second format works, but it is brittle.

Brittleness costs money. You pay for malformed Turtle, extra tokens, repair steps, and lower strict matching.

The goal is not to make the schema less rigorous, but is to put the rigor in a place the model can hit reliably.

Don't ask the LLM to output the database schema you wish it could follow. Ask it for a structure close to what it already produces well, then make that structure useful for the database.

The results

I re-ran the benchmark on the same 20 game review dataset and compared the old claims-based baseline against the new consolidated assessment schema.

These numbers are extraction-only, same as last time.

Run / Model	Malformed TTLs	Semantic Precision	Semantic Recall	Strict RDF Precision	Strict RDF Recall	Extraction Cost
`gemini-3-flash-preview` baseline	0	91.80%	30.80%	28.80%	9.50%	$0.3596
`gemini-3-flash` consolidated	0	93.69%	41.40%	31.27%	14.05%	$0.2751
`deepseek-v4-flash` baseline	8	86.59%	28.67%	18.70%	7.36%	$0.0603
`deepseek-v4-flash` consolidated	1	76.90%	44.86%	26.70%	15.82%	$0.0384

The Gemini comparison is the cleanest:

semantic precision went from 91.80% -> 93.69%
semantic recall went from 30.80% -> 41.40%
strict RDF precision went from 28.80% -> 31.27%
strict RDF recall went from 9.50% -> 14.05%
extraction cost went from $0.3596 -> $0.2751
malformed outputs stayed at 0

Gemini extracted more correct facts, expressed them closer to the expected graph, and spent fewer tokens doing it.

The DeepSeek run is messier, but probably more useful:

malformed Turtle files dropped from 8 -> 1
semantic recall went from 28.67% -> 44.86%
strict RDF precision went from 18.70% -> 26.70%
strict RDF recall went from 7.36% -> 15.82%
extraction cost went from $0.0603 -> $0.0384

Semantic precision dropped from 86.59% -> 76.90%, so the new schema is not better on every metric. Looking at the output, DeepSeek mostly lost precision because it repeated the same point across multiple assessments. So the issue was less "it found too many marginal facts" and more "it sometimes attached the same fact to several facets."

The practical improvement is still big. The old DeepSeek run produced malformed RDF in 8 out of 20 cases. The new one produced only 1 malformed file, had much better recall, and cost less.

That makes the failure mode more specific. DeepSeek was not simply bad at RDF. It was bad at that RDF shape.

Why this matters for cost

The new schema is cheaper because the model spends less output on schema bookkeeping.

With the old claims format, the model repeatedly emits boilerplate like:

ent:article1 pred:claims ent:claimN .
ent:claimN a ent:Praise ;
    pred:about ... ;
    pred:aspect ... ;
    pred:sentiment ... ;
    pred:evidence ... .

Over and over.

With the assessment format, related facts collapse into one object with multi-valued fields. Fewer IDs, fewer repeated predicates, fewer duplicated evidence snippets, fewer chances to drift from the expected pattern.

For one benchmark run, the savings are small.

At VaporLens scale - small per-extraction savings add up. If this runs across many, every repeated predicate and unnecessary node turns into real money.

The benchmark lesson

I shouldn't have treated the schema as neutral. The schema is part of the prompt. It tells the model what kind of object to produce.

If the schema looks like a familiar review-analysis object, the model can lean on patterns it already knows:

aspect-based summaries
pros and cons
mixed sentiment
comparison lists
concise explanations

If the schema looks like a low-level graph modelling exercise, the model has to spend more effort deciding how to represent each sentence. That freedom sounds useful, but it makes the extraction less stable.

So the benchmark question is not only:

which model is best at game knowledge graph extraction?

It is also:

what should a game knowledge graph schema look like if current LLMs need to fill it reliably?

That feels like the more useful question right now.

What's next

I am still working on the benchmark, although I got slightly delayed by migrating the current VaporLens code to deepseek-v4-flash.

Once the next benchmark batch is in a state I am happy with, I will share more details and results.

Cheers,
Tim

← Back to Blog