In the rush to build AI agents, many engineers overlook the most critical step: Data Preprocessing.
They scrape a website, get a massive JSON object or raw HTML string, and dump it directly into their Vector Database or LLM context window.
This is a mistake.
Raw HTML is full of “token junk”: <div class="w-full flex items-center">, scripts, and tracking pixels. None of this carries semantic meaning. It distracts the model and burns through your API budget.
The solution is Markdown. It is the only format that preserves semantic structure (headers, lists, tables) while stripping away the noise.
The Efficiency Gap: HTML vs. Markdown
Let’s look at a simple comparison of a product page snippet.
Raw HTML (218 Tokens):
<div class="product-card">
<h2 class="text-xl font-bold">Wireless Headphones</h2>
<div class="rating">
<span class="star">�?/span> 4.5/5
</div>
<p class="desc">Noise cancelling, 20h battery life.</p>
<a href="/buy" class="btn btn-primary">Buy Now</a>
</div>
Markdown (24 Tokens):
## Wireless Headphones
**Rating:** 4.5/5
Noise cancelling, 20h battery life.
[Buy Now](/buy)
Result: A 89% reduction in tokens. When you are processing millions of pages for RAG, this difference saves thousands of dollars.
The Challenge: Dynamic Content
You might think, “I’ll just use a library like Turndown or BeautifulSoup to convert HTML to Markdown.”
That works for static blogs. But modern web apps (React, Vue, Next.js) generate content dynamically via JavaScript. If you just fetch the HTML source code, you often get an empty div.
This is where SearchCans Reader API shines. It acts as a headless browser that:
- Executes JavaScript to fully render the page.
- Cleans the DOM (removes ads, navbars, footers).
- Converts to Markdown automatically.
Implementation: Building a Cleaning Pipeline
Here is how to automate this process using Python. We will transform a raw API response into clean Markdown ready for embedding.
Step 1: Fetch and Convert
Instead of writing complex regex to parse HTML, we just call the Reader API.
import requests
def clean_url_data(target_url):
# We use the Reader API with 'use_browser=true'
# This ensures we get the rendered state, not just source code
api_url = "https://www.searchcans.com/api/url"
headers = {"Authorization": "Bearer YOUR_SEARCHCANS_KEY"}
params = {
"url": target_url,
"use_browser": "true"
}
response = requests.get(api_url, headers=headers, params=params)
# The API returns pure Markdown
return response.text
Step 2: Post-Processing (Optional)
Sometimes you want to clean the Markdown even further—for example, removing images if your LLM is text-only.
import re
def remove_images(markdown_text):
# Regex to remove  patterns
return re.sub(r'!\[.*?\]\(.*?\)', '', markdown_text)
# Usage
raw_markdown = clean_url_data("https://www.example.com/product")
clean_text = remove_images(raw_markdown)
print(f"--- Ready for Vector DB ---\n{clean_text[:200]}")
Why Not JSON?
We often see developers trying to convert scraped data into JSON structures like {"title": "...", "body": "..."}.
While JSON is great for code, LLMs are trained on text. They understand the flow of a document better when it looks like a document. Markdown mimics the natural reading order and hierarchy that models like GPT-4 favor.
By standardizing your pipeline on Markdown, you ensure:
- Higher Retrieval Accuracy: Headers (
##) naturally chunk content for vector search. - Lower Costs: Fewer tokens per document.
- Better Debugging: You can read the data yourself.
Conclusion
Data cleaning isn’t just about removing bad characters; it’s about reformatting for intelligence.
SearchCans Reader API abstracts away the complexity of headless browsers and HTML parsing, giving you a clean, standardized Markdown stream. It is the plumbing your AI agent deserves.
Resources
Related Topics:
- Building Advanced RAG with Real-Time Data - How to use this Markdown in RAG.
- SERP API vs Web Scraping Comparison - Choosing the right tools.
Get Started:
- Free Trial - Get 100 free credits
- API Documentation - See the
use_browserparameter - Pricing - Cost-effective data cleaning
- Playground - Convert a URL to Markdown now
SearchCans provides real-time data for AI agents. Start building now →