A conceptual overview · knowledge representation · entity resolution · graph algorithms · large language models
Abstract
Public records are usually published as flat tables: rows of transactions with
inconsistent, free-text identifiers and no explicit links between them. This shape is easy to
store but hard to reason about, because the questions people actually ask — who is connected
to whom, through what, and how strongly — are questions about relationships, not rows.
This overview describes the conceptual stack that converts such records into a
knowledge graph: a model in which real-world entities are nodes and the facts that
relate them are edges. We sketch why entity resolution is the load-bearing step, how
graph algorithms answer relationship questions that table queries cannot, how
modern language models translate natural-language intent into precise retrieval over
millions of records, and the techniques that keep all of this tractable at scale.
1. Knowledge representation: why a graph
A knowledge graph represents information as a set of
entities (people, organizations, accounts) connected by typed, directed
relationships (gave-to, paid, affiliated-with). The structure mirrors how the
facts actually join together, so a relationship that would require many self-joins in a
relational table becomes a single edge traversal. Crucially, the graph makes
indirect connections first-class: a path of several edges (A relates to B, B to C)
is as queryable as a direct one. This is the difference between asking "what rows contain this
name" and asking "what is connected to this entity, and by how much."
2. Entity resolution: the hard, foundational problem
Real records do not come with stable identifiers. The same entity appears under many surface
forms — spelling variants, abbreviations, punctuation differences, transcription errors, and
differing free-text descriptions. Entity resolution (also called record
linkage or deduplication) is the task of deciding which of these surface forms refer to the
same underlying entity, and collapsing them into one canonical node.
Conceptually it combines several signals: deterministic rules for obvious matches,
approximate string similarity (fuzzy matching) for near-duplicates, and
semantic comparison — often via vector embeddings, numeric
representations of meaning where similar entities sit close together in space. Getting this
step right is what separates an under-counted, fragmented view from a faithful one: a single
actor scattered across dozens of variant strings is invisible to keyword search until it is
resolved back into one node.
3. Reasoning with graph algorithms
Once data is a graph, a body of well-studied graph algorithms answers
relationship questions directly. Shortest-path and reachability search trace how
one entity connects to another through intermediaries. Centrality measures rank
which nodes are structurally important — the hubs through which influence or value flows.
Community detection (clustering) exposes densely connected groups that behave as a
bloc, even when no single member stands out. Flow propagation models how a
quantity spreads through the network, splitting at each node, to estimate where it ultimately
lands. These are general techniques: the same algorithms illuminate financial networks, supply
chains, citation graphs, and social structures alike.
4. Language models as an interface to retrieval
A large language model (LLM) is a system trained to predict and generate
text; its practical power here is translation of intent. A user asks a question in
ordinary language; the model maps that intent onto precise, structured operations — which entity
to locate, which relationship to traverse, which analysis to run — and then narrates the result.
This pattern, often called tool use or retrieval-augmented generation,
keeps the model grounded: it does not invent figures, it orchestrates queries against the
authoritative data and reports what comes back. The heavy lifting — scanning millions of records —
is done by indexes and the database; the model decides what to ask and explains
what it means, returning answers in seconds rather than hours of manual cross-referencing.
5. Managing complexity at scale
A graph of millions of nodes cannot be drawn or queried naively. Several techniques keep it
tractable. Indexing and pre-aggregation turn relationship lookups that would scan
entire tables into near-instant seeks. Level-of-detail rendering and lazy loading
fetch only what is in view and only when needed, so an interface can sit atop a graph far larger
than any screen. Bounded, ranked expansion ensures that when an entity has thousands
of connections, the most significant ones are surfaced first rather than an arbitrary slice.
Together these let a person explore an enormous network interactively — zooming from a single
actor to a whole community and back — without ever loading the whole thing at once.
6. Provenance and verifiability
Analysis is only trustworthy if it can be checked. The principle of provenance
holds that every derived figure should decompose back to the primary records that produced it.
In a knowledge graph this means an aggregated edge is never a dead end: it can always be expanded
into the individual source transactions behind it. Resolution decisions — which surface forms were
merged into one entity — should likewise be inspectable rather than hidden. This makes computational
findings defensible: a claim is not "the system says so" but "here are the underlying
filings, see for yourself."
Further reading (concepts)
- Knowledge representation & knowledge graphs — entities, relations, and graph data models.
- Entity resolution / record linkage — deterministic rules, fuzzy matching, and embedding-based similarity.
- Graph algorithms — shortest paths, centrality, community detection, and flow propagation.
- Large language models & tool use — natural-language interfaces and retrieval-augmented generation.
- Scalable graph systems — indexing, level-of-detail, and interactive exploration of large networks.