Progress on gaming knowledge graph extraction LLM benchmark

This post was originally posted on Vaporlens Patreon

Hey folks!

Thought I'd share a short update on the gaming knowledge graph extraction LLM benchmark (or ggex-bench) I've been working on.

As I mentioned in the previous post, the idea is pretty simple: if I want VaporLens to eventually move from "a bunch of LLM-written summaries" towards proper structured game knowledge, I first need to know which models can actually extract that knowledge reliably.

So I put together a small test set of 20 gaming articles, (semi-)manually annotated the expected facts as knowledge graphs, and started running models against them.

How the benchmark works right now

The current version is intentionally very strict and very simple.

For each article, the benchmark loads the raw text and the matching handmade gold knowledge graph. If that article has not been processed before, it sends one LLM request containing the full article text plus a description of the graph schema it should use.

The model is asked to produce a Turtle/RDF knowledge graph from both inputs: the article content and the schema description. The prompt tells it to use the article itself as the main entity, attach extracted claim nodes to it, and represent normal facts as regular RDF triples.

The important part is that claims are supposed to be atomic and evidence-backed. The model is asked to extract observations, praise, critique, comparisons, recommendations, and important story/gameplay statements. It is also explicitly told to avoid making claim nodes for bare mentions, trivial metadata, screenshots, or simple lists.

For example, the gold graph for one article starts with basic article/game metadata like this:

ent:article1 a ent:Review;
    pred:reviewOf ent:Crimson_Desert;
    pred:reviewer ent:Lewis_Gordon.

ent:Crimson_Desert a ent:OpenWorldGame;
    pred:developer ent:Pearl_Abyss;
    pred:setInWorld ent:Pywel;
    pred:hasProtagonist ent:Kliff;
    pred:playerMode ent:SinglePlayer;
    pred:estimatedDurationMinHours "50";
    pred:estimatedDurationMaxHours "100";
    pred:comparedTo ent:Candy_Crush, ent:The_Witcher.

But it also captures more detailed game facts, like characters, abilities, systems, activities, locations, factions, and comparisons:

ent:Kliff a ent:Character;
    pred:memberOf ent:Greymanes;
    pred:usesWeapon ent:melee_weapons, ent:shield;
    pred:combatStyle ent:taekwondo;
    pred:canUseAbility ent:Axiom_Force, ent:Force_Palm, ent:shield_bash.

ent:combat a ent:System;
    pred:hasQuality ent:weighty;
    pred:comparedTo ent:fighting_games, ent:Devil_May_Cry.

And then the review-specific interpretation is represented as claims connected back to the article:

ent:article1 pred:claims ent:claim25.

ent:claim25 a ent:Praise;
    pred:about ent:combat;
    pred:aspect ent:fun;
    pred:sentiment ent:positive;
    pred:evidence "These steely showdowns ... are terrific fun".

ent:article1 pred:claims ent:claim33.

ent:claim33 a ent:Critique;
    pred:about ent:systems;
    pred:aspect ent:explanation_burden;
    pred:sentiment ent:negative;
    pred:evidence "Crimson Desert cannot help but sag under the weight of having to explain all these systems".

So the target is not just "what game is this article about?" It's closer to: what concrete world facts, systems, comparisons, opinions, and evidence-backed observations can be pulled out of the text in a shape that a database can use later?

And that's it. No retries. No chunking. No validation pass. No repair step. No iterative refinement.

The full prompt goes in once, and whatever Turtle/RDF comes back is what gets saved. If the model wraps the result in a markdown code block, the wrapper is stripped, but the generated graph itself is otherwise left alone. The benchmark also stores token usage so I can estimate extraction cost and resume interrupted runs.

After that, each result is evaluated in two ways:

Semantic fact matching - an LLM judge checks whether the extracted graph captures the same facts as the gold graph at the meaning level (since LLMs can produce the same facts with a slightly different schema, which is fine).
Strict RDF matching - a deterministic RDF parser compares exact triples against the gold graph.

This gives two very different views of quality: did the model understand the article, and did it express that understanding in exactly the graph shape I expected?

First results

The early results are already pretty interesting.

The best run so far came from mistral-medium-2508, which reached 47.6% semantic recall with 71.5% semantic precision, while producing only one malformed Turtle file out of 20. That last bit matters a lot, because a model that finds good facts but outputs broken RDF half the time is not very useful in practice.

The second strongest run was gpt-5.4-mini, with 43.6% semantic recall, zero malformed outputs, and the best strict RDF recall at 17.1%. The downside was cost: it used a frankly silly amount of output/reasoning tokens and ended up roughly 10x more expensive than the current best practical baseline.

A few runs had very high precision but much lower recall. qwen3.6-plus, for example, reached 90.4% semantic precision, but only 37.7% recall. That's still interesting! It means some models are cautious and mostly extract things that are correct, but they miss a lot.

And then there are the cheap-but-chaotic runs. mimo-v2-flash was the cheapest at only a few cents, but produced malformed RDF in 15 out of 20 cases. So, uh, not quite production-ready for this use case 😅

Here's the full table from the current run:

Model	Malformed TTLs	Semantic Precision	Semantic Recall	Strict RDF Precision	Strict RDF Recall	Tokens	Cost
mistral-medium-2508	1	71.5%	47.6%	20.5%	13.5%	222,399	$0.287894
gpt-5.4-mini	0	74.2%	43.6%	27.9%	17.1%	724,692	$2.911081
step-3.5-flash	14	77.0%	41.4%	9.0%	5.1%	486,853	$0.126699
qwen3.6-plus	6	90.4%	37.7%	19.8%	7.8%	376,547	$0.578913
deepseek-v4-pro	2	75.3%	36.0%	23.8%	11.6%	452,258	$0.351299
gemini-3.5-flash	1	89.6%	35.5%	26.2%	10.2%	409,854	$2.972166
minimax-m2.7	6	82.4%	34.1%	17.7%	7.1%	194,485	$0.147587
gemini-3.1-pro	0	90.0%	32.5%	27.3%	9.9%	312,362	$2.792984
gemini-3-flash-preview	0	91.8%	30.8%	28.8%	9.5%	199,483	$0.359609
gemma-4-31b-it	4	90.4%	30.4%	23.7%	8.1%	198,741	$0.049503
glm-5.1	7	74.2%	29.0%	19.1%	9.1%	374,653	$0.956890
deepseek-v4-flash	8	86.6%	28.7%	18.7%	7.4%	317,606	$0.060293
mimo-v2-flash	15	88.9%	27.8%	8.7%	2.9%	164,658	$0.030552
grok-4.3	6	92.8%	17.2%	22.1%	4.2%	134,666	$0.220005

The important bit: exact graphs from large, dense gaming articles are still hard

One thing that stood out immediately: strict RDF scores are low across the board for this specific task.

Even the best run only reached 17.1% strict RDF recall. That sounds terrible, but it is not entirely surprising. These are long, dense gaming articles with lots of entities, systems, comparisons, subjective claims, and tiny details. Exact graph matching is also harsh: if the model captures the right fact but uses a slightly different structure, URI, relation, or literal format, the strict score says "nope".

That's why I'm tracking both semantic scores and strict RDF scores. Semantic scoring tells me whether the model understood and extracted the right meaning. Strict RDF scoring tells me whether it produced the exact machine-readable shape I expected.

For VaporLens, both matter. A graph that is semantically good but structurally messy is useful for research, but painful to normalize. A graph that is structurally perfect but misses half the important facts is also not enough.

What this tells me so far

My current read is:

the best practical baseline is not necessarily the most expensive option
some expensive runs are structurally stronger, but may be hard to justify at scale
high precision / low recall models might be useful in multi-model pipelines
malformed RDF rate matters a lot and probably deserves its own score
a single-pass extraction prompt is not enough if I want production-quality graphs

The last point is the big one. This benchmark currently tests the simplest possible pipeline, and even then the results are already useful. But the actual production system will probably need at least a validation + repair step, and maybe a second pass for missing facts.

What's next

Next, I want to clean up the benchmark output a bit and try a few pipeline variations.

The main thing I'm trying now is a multi-pass harness that can hopefully improve recall without losing too much precision. I probably won't run it across every model because that can get expensive very quickly, but I do want to test it on at least a couple of the more promising ones.

The variations I want to compare are roughly:

single-pass extraction, current baseline
extraction + a missing-fact follow-up pass that returns a new full graph
extraction + multiple missing-fact passes that return only missing facts, followed by graph merge and RDF repair if needed

I also want to improve the judging side. Right now the semantic score uses a single LLM judge, which is good enough for early experiments but not something I want to blindly trust forever. Although this is the priciest part of the benchmark, I'll need to introduce at least a couple more high-end models to co-judge the results before I can be fully confident in them.

Still, this is already enough to confirm that the knowledge graph direction is worth exploring. The models are not magically solving it, but they are good enough that a carefully designed pipeline could work.

And if that works, VaporLens can get much more interesting than just summaries and tags.

Cheers, Tim

← Back to Blog