February 26, 2026
Making Financial Data LLM-Friendly
Extending the data warehouse with SEC EDGAR fundamentals, normalizing XBRL taxonomy chaos, and laying the ground for a token-efficient API that agents can actually use.
After getting the price data warehouse running, I started working on the dashboard that would consume it. The idea was to wire up a chatbot that could answer questions about companies by pulling from my own API. Pretty quickly, two things became obvious. First, stock prices alone are not enough to say anything useful about a company. I needed fundamentals: revenue, earnings, balance data. Second, the way financial data is typically available to LLM agents is remarkably wasteful. These two problems turned out to fit neatly into the same solution, so I went back to the data API to extend it.
The Token Problem
SEC EDGAR is the canonical source for US public company financials. It has a JSON API that returns pre-parsed XBRL data, which is convenient. But the responses look like this:
{
"cik": 320193,
"entityName": "Apple Inc.",
"facts": {
"us-gaap": {
"RevenueFromContractWithCustomerExcludingAssessedTax": {
"label": "Revenue from Contract with Customer, Excluding Assessed Tax",
"units": {
"USD": [
{"end": "2023-09-30", "val": 383285000000, "accn": "0000320193-23-000106", "fy": 2023, "fp": "FY", "form": "10-K", "filed": "2023-11-03"},
{"end": "2023-07-01", "val": 81797000000, "accn": "0000320193-23-000077", "fy": 2023, "fp": "Q3", "form": "10-Q", "filed": "2023-08-04"}
]
}
}
}
}
}
That's Apple's revenue. One metric, one company. The full companyfacts response contains thousands of entries across hundreds of XBRL tags. If an agent consumes this directly, it burns tokens on accession numbers, redundant labels, and a structure designed for regulatory compliance, not for reasoning.
In addition to that, different companies use different XBRL tags for the same concept. Apple reports revenue as RevenueFromContractWithCustomerExcludingAssessedTax. Microsoft might use Revenues. I've also seen SalesRevenueNet. An agent comparing revenue across companies has to know all of these variants, or it will miss data or double-count.
The data is public, free, and comprehensive, but it is locked behind a layer of complexity that makes it hard to use for automated consumption without preprocessing. So the plan became: ingest data from EDGAR, normalize the messy XBRL taxonomy into canonical metrics, and eventually serve everything in compact, token-efficient responses. The first two parts are donea at the moment I write this. The third will come.
Talking to the SEC
EDGAR is strict about how you use its API. They require a custom User-Agent header with your contact info and enforce rate limits (10 requests per second). The HTTP client I built handles this with a simple throttle:
_rate_lock = threading.Lock()
_last_request_time: float = 0.0
def _throttle() -> None:
global _last_request_time
with _rate_lock:
now = time.monotonic()
elapsed = now - _last_request_time
if elapsed < 0.1:
time.sleep(0.1 - elapsed)
_last_request_time = time.monotonic()
A lock ensuring at least 100ms between requests. It's synchronous by design (the SEC calls are wrapped with asyncio.to_thread for the FastAPI layer).
The CIK Problem
EDGAR identifies companies by their CIK (Central Index Key), not by ticker. Before fetching Apple's financials, you need to know that AAPL maps to CIK 0000320193. The SEC publishes a ticker-to-CIK mapping file, but downloading it on every request would be wasteful.
The solution is a two-tier cache: an in-memory dictionary for the current process, backed by a local JSON file with a 7-day TTL. On first use, the client downloads the full mapping, writes it to .cache/, and keeps it in memory. After a week, it refreshes.
_CACHE_MAX_AGE_SECONDS = 7 * 24 * 3600 # 7 days
def _load_cik_cache_sync() -> dict[str, str]:
global _cik_cache, _cache_loaded_at
if _cik_cache:
return _cik_cache
cache_path = _get_cache_path()
if cache_path.exists():
age = time.time() - cache_path.stat().st_mtime
if age < _CACHE_MAX_AGE_SECONDS:
with open(cache_path) as f:
raw = json.load(f)
_cik_cache = {
str(entry["ticker"]).upper(): str(entry["cik_str"]).zfill(10)
for entry in raw.values()
}
return _cik_cache
# ... fetch from SEC if cache missing/expired
Taxonomy
This part got a bit tricky.
When a company files a 10-K or 10-Q with the SEC, they report their financials using XBRL tags from the US-GAAP taxonomy. The taxonomy is huge, and companies have flexibility in which tags they use. Revenue alone has at least five common variants:
RevenueFromContractWithCustomerExcludingAssessedTaxRevenuesSalesRevenueNetSalesRevenueGoodsNetRevenueFromContractWithCustomerIncludingAssessedTax
They all mean roughly the same thing, but a system that needs to compare Apple's revenue with Microsoft's needs to know that. Multiply this across 16 core metrics (revenue, net income, EPS, total assets, total liabilities, equity, operating income, gross profit, cost of revenue, the three cash flow categories, capex, total debt, shares outstanding) and the mapping work becomes non-trivial.
The approach is a priority-ordered taxonomy dictionary:
TAXONOMY: dict[str, list[str]] = {
"revenue": [
"RevenueFromContractWithCustomerExcludingAssessedTax",
"Revenues",
"SalesRevenueNet",
],
"net_income": [
"NetIncomeLoss",
"ProfitLoss",
],
"total_assets": ["Assets"],
"eps_basic": ["EarningsPerShareBasic"],
# ... 16 canonical metrics total
}
The order within each list matters. The first tag is the preferred one. If a company reports revenue under both Revenues and RevenueFromContractWithCustomerExcludingAssessedTax, the system picks the first match. This gets compiled into a reverse lookup at import time:
_TAG_TO_METRIC: dict[str, tuple[str, int]] = {}
for _metric, _tags in TAXONOMY.items():
for _priority, _tag in enumerate(_tags):
_TAG_TO_METRIC[_tag] = (_metric, _priority)
The normalization function iterates through every XBRL concept in a company's facts, checks if it maps to a canonical metric, and keeps only the best match per (metric, end_date, period) combination:
def normalize_company_facts(
company_id: int, facts_json: dict
) -> list[FinancialFact]:
us_gaap = facts_json.get("facts", {}).get("us-gaap", {})
best: dict[tuple[str, str, str], tuple[int, FinancialFact]] = {}
for tag, concept_data in us_gaap.items():
lookup = _TAG_TO_METRIC.get(tag)
if lookup is None:
continue
metric, priority = lookup
for unit_key, entries in concept_data.get("units", {}).items():
for entry in entries:
if entry.get("form") not in {"10-K", "10-Q"}:
continue
dedup_key = (metric, entry["end"], entry.get("fp", ""))
existing = best.get(dedup_key)
if existing is not None and existing[0] <= priority:
continue
fact = FinancialFact(
company_id=company_id,
metric=metric,
value=float(entry["val"]),
# ...
)
best[dedup_key] = (priority, fact)
return [fact for _, fact in best.values()]
The best dictionary: for each unique combination of metric, date, and period type, it keeps only the highest-priority tag that has data. If Apple reports revenue under two different tags for the same quarter, it does not double-count.
The Pipeline
With the client and the taxonomy in place, the ingestion pipeline is a single async function that orchestrates the full flow: resolve the CIK, fetch the raw facts, normalize them, and upsert into the database.
async def ingest_company_financials(db: Session, ticker: str) -> dict:
company = repository.get_company_by_ticker(db, ticker)
if company is None:
raise EdgarClientError(f"Company {ticker} not found in database")
cik = await edgar_client.resolve_cik(ticker)
if company.cik is None:
repository.update_company_cik(db, company.id, cik)
db.refresh(company)
facts_json = await edgar_client.fetch_company_facts(cik)
facts = taxonomy.normalize_company_facts(company.id, facts_json)
inserted, updated = repository.upsert_financial_facts(db, facts)
return {
"ticker": ticker,
"cik": cik,
"inserted": inserted,
"updated": updated,
}
The upsert logic ensures idempotency. Running the pipeline twice for the same company updates existing records instead of creating duplicates. This matters because EDGAR data can get amended when a company restates its financials.
The taxonomy gets seeded on application startup. When the FastAPI app boots, it populates the TaxonomyMapping table so there's always a reference of which XBRL tags map to which canonical metrics:
@asynccontextmanager
async def lifespan(_app: FastAPI):
SQLModel.metadata.create_all(engine)
with Session(engine) as db:
count = seed_taxonomy(db)
logger.info("Seeded %d taxonomy mappings", count)
yield
Data Model
The new models follow the same SQLModel pattern from the previous post. The FinancialFact table is the core addition:
class FinancialFact(SQLModel, table=True):
__table_args__ = (
UniqueConstraint(
"company_id", "metric", "end_date", "period_type",
name="uq_company_metric_end_period",
),
)
id: int | None = Field(default=None, primary_key=True)
company_id: int = Field(foreign_key="company.id", index=True)
metric: str = Field(index=True)
value: float
unit: str
end_date: date
period_type: str # e.g. "Q1", "FY"
filing_date: date | None = None
source: str | None = None
The unique constraint on (company_id, metric, end_date, period_type) is what makes the upsert logic safe. You can't accidentally have two "revenue" entries for Apple's Q3 2023.
And TaxonomyMapping provides the reference layer:
class TaxonomyMapping(SQLModel, table=True):
id: int | None = Field(default=None, primary_key=True)
xbrl_tag: str = Field(unique=True, index=True)
metric: str = Field(index=True)
description: str | None = None
Seeing It Work
Once a company is onboarded and its financials are ingested, both price data and fundamentals are queryable from the same API:
POST /companies/AAPL/financials
{
"ticker": "AAPL",
"cik": "0000320193",
"inserted": 847,
"updated": 0
}
847 normalized financial facts from a single call. Revenue, net income, EPS, balance sheet items, cash flows, all extracted from the raw XBRL and stored in a clean schema. Querying is then instant:
GET /companies/AAPL/financials?metric=revenue&period_type=FY
This returns Apple's annual revenue across every year they've filed electronically, sorted and clean. No accession numbers, no redundant tags.
What Comes Next
The EDGAR integration handles the data problem. The next step is the serving problem. It is actually quite easy to generate good ideas using LLMs, so I already have concrete details I would like to integrate next.
Right now, if an agent queries Apple's key metrics, it gets back full FinancialFact objects with descriptive field names. An agent doesn't need "period_type": "FY" when "pt": "FY" communicates the same thing in fewer tokens. Across hundreds of facts, this adds up.
The plan is a set of endpoints with a ?format= parameter:
minimal- shortest possible keys, numbers only. What agents would use.standard- readable keys with units. For debugging.verbose- full field names and descriptions. For humans.
Along with field selection (?fields=rev,ni,eps) so agents request only what they need, and pre-computed ratios (P/E, gross margin, debt-to-equity) derived from stored facts plus current price data. An agent should not have to compute gross margin itself when the API already has the inputs.
There's also a comparison endpoint in the design (/compare?tickers=AAPL,MSFT,GOOG&metrics=pe,rev,ni) that returns a matrix-style response. When an LLM compares three companies, a structured table is cheaper, in tokens, than three separate queries.
None of this is built yet. But the hard part, getting the messy EDGAR data into a clean relational schema, is done. What remains is API design.
Where Things Stand
The test suite has 49 tests covering the EDGAR client, taxonomy normalization, the ingestion pipeline, and all API endpoints. The architecture still follows the same layered pattern from the previous post, with a new ingestion layer (EDGAR) sitting alongside Yahoo Finance. EDGAR provides the fundamentals, Yahoo provides the prices. Both feed into the same persistence layer, and the API serves them through a single interface.
This was not in the original roadmap. But while building the dashboard, it became clear that the data backend needed to serve more than just prices, and that how it serves them matters as much as what it serves. The token-efficient API layer is next, and then it's back to the dashboard, with a much more useful backend behind it... and integrated agents ;)
(Source code on GitHub.)