Hi HN, I built SDF (Structured Data Format), an open protocol that sits between web content and AI agents.
The problem: Every agent that consumes a web page independently fetches HTML, strips boilerplate, extracts entities, and classifies content. A typical page is ~89KB of HTML (~73K tokens). When 100 agents consume the same URL, this extraction happens 100 times with inconsistent results.
What SDF does: Convert once into a schema-validated JSON document (~750 tokens) containing entities, claims, relationships, summaries, and type-specific structured data. Agents consume the pre-extracted representation directly.
Results from production deployment (2,335 documents, 10 content types):
99.2% token reduction from HTML
90% extraction accuracy with fine-tuned 1.5B + 3B model cascade
4.1x faster than monolithic 14B baseline
Downstream experiment: general-purpose 7B model scores 0.739 accuracy from SDF vs 0.352 from raw markdown (p < 0.05)
The pipeline runs locally on consumer hardware (dual RTX 3090 Ti). Fine-tuned models are open on HuggingFace (sdfprotocol/sdf-classify, sdfprotocol/sdf-extract). Protocol spec and JSON schemas are on GitHub.
I wonder if people will eventually surf ad-free by sniffing out these files. Easy to parse (maybe even easier than the actual article itself) and no ads or otherwise unrelated distractions.
Not your ads. I'm saying that if a site that has ads also has these files, you could get the gist of the article by reading these files instead of going to the ad-laden page itself.
Hi HN, I built SDF (Structured Data Format), an open protocol that sits between web content and AI agents.
The problem: Every agent that consumes a web page independently fetches HTML, strips boilerplate, extracts entities, and classifies content. A typical page is ~89KB of HTML (~73K tokens). When 100 agents consume the same URL, this extraction happens 100 times with inconsistent results.
What SDF does: Convert once into a schema-validated JSON document (~750 tokens) containing entities, claims, relationships, summaries, and type-specific structured data. Agents consume the pre-extracted representation directly.
Results from production deployment (2,335 documents, 10 content types):
99.2% token reduction from HTML 90% extraction accuracy with fine-tuned 1.5B + 3B model cascade 4.1x faster than monolithic 14B baseline Downstream experiment: general-purpose 7B model scores 0.739 accuracy from SDF vs 0.352 from raw markdown (p < 0.05) The pipeline runs locally on consumer hardware (dual RTX 3090 Ti). Fine-tuned models are open on HuggingFace (sdfprotocol/sdf-classify, sdfprotocol/sdf-extract). Protocol spec and JSON schemas are on GitHub.
Protocol spec + schemas: https://github.com/sdfprotocol/sdf Whitepaper: https://doi.org/10.5281/zenodo.18559223 Models: https://huggingface.co/sdfprotocol Happy to answer questions about the design decisions, the type system, or the evaluation methodology.
I wonder if people will eventually surf ad-free by sniffing out these files. Easy to parse (maybe even easier than the actual article itself) and no ads or otherwise unrelated distractions.
What do you mean? I just wanted to share something I am working on. Trying to understand what you meant by ads.
Not your ads. I'm saying that if a site that has ads also has these files, you could get the gist of the article by reading these files instead of going to the ad-laden page itself.
That actually is great, we can add ads detection and extract only the relevant information. Thanks @ksaj
That's a step further than I was thinking, but I most definitely like the direction.