SearchCans

RAG Data Ingestion: Build a Python Vector ETL Pipeline with SearchCans

Garbage In, Garbage Out. Learn how to build a robust RAG ingestion pipeline that cleans web data using SearchCans Reader API before embedding it into your Vector DB.

5 min read

In the rush to build GenAI applications, engineers often obsess over the model (“Should I use GPT-4o or Claude 3.5?”) or the database (“Pinecone vs. Qdrant?”).

But they ignore the most critical factor: The Data Quality.

The “Garbage In, Garbage Out” principle is ruthless in RAG (Retrieval-Augmented Generation). If you scrape a webpage and embed the navigation bar, footer links, and cookie banners, your Vector Database becomes polluted. Your expensive LLM will retrieve irrelevant chunks (“Privacy Policy” or “Contact Us”) instead of the core knowledge.

To fix this, you need a Vector ETL (Extract, Transform, Load) Pipeline.

In this guide, we will build a robust ingestion pipeline using SearchCans Reader API as the transformation layer. We will turn messy URLs into high-quality, embedding-ready vectors.

The Anatomy of a Vector ETL Pipeline

A production-grade RAG pipeline has four distinct stages:

  1. Extract (Scrape): Fetch the raw HTML from the target URL.
  2. Transform (Clean): (Crucial Step) Strip noise, format tables, and convert to Markdown.
  3. Chunk: Split the clean text into overlapping segments.
  4. Load (Embed & Upsert): Generate vectors and store them in the DB.

Most tutorials suggest using basic tools like BeautifulSoup for step 2. This is brittle. Websites change, and writing custom parsers for every site is impossible. We need an API that generalizes this.

Why SearchCans for the “Clean” Step?

SearchCans Reader API is designed specifically for LLM ingestion. It uses a headless browser to render dynamic JavaScript content and then applies intelligent parsing to return Semantic Markdown.

  • vs. Raw HTML: Reduces token usage by ~70% and removes non-semantic tags.
  • vs. Text Only: Preserves structure like headers (#) and tables, which are vital for accurate chunking.

Tutorial: Building the Pipeline in Python

Let’s write a script that takes a URL, cleans it with SearchCans, and prepares it for a Vector DB (like Qdrant).

Prerequisites

pip install requests openai qdrant-client

Step 1: The “Extract & Clean” Function

We use SearchCans to combine extraction and cleaning into one API call.

import requests

def fetch_clean_content(url):
    print(f"📥 Ingesting: {url}")
    
    # SearchCans Reader API Endpoint
    api_url = "https://www.searchcans.com/api/url"
    api_key = "YOUR_SEARCHCANS_KEY"
    
    headers = {"Authorization": f"Bearer {api_key}"}
    
    # 'b=true' handles dynamic JS sites
    params = {
        "url": url,
        "b": "true", 
        "w": 2000
    }
    
    try:
        resp = requests.get(api_url, headers=headers, params=params)
        data = resp.json()
        
        # The 'markdown' field contains the cleaned, main content
        return data.get("markdown", "")
        
    except Exception as e:
        print(f"Error fetching data: {e}")
        return None

Step 2: The “Chunk” Function

Now that we have clean Markdown, we can split it. Markdown headers make excellent natural break points.

def chunk_text(text, chunk_size=500, overlap=50):
    """
    A simple character-based splitter for demonstration.
    In production, use LangChain's RecursiveCharacterTextSplitter.
    """
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i : i + chunk_size])
    return chunks

Step 3: The “Embed & Upsert” Function

We verify the data is clean before spending money on embeddings.

from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct

# Initialize clients
openai_client = OpenAI(api_key="YOUR_OPENAI_KEY")
qdrant_client = QdrantClient(url="http://localhost:6333")

def embed_and_store(chunks, collection_name="knowledge_base"):
    print(f"🧠 Embedding {len(chunks)} chunks...")
    
    points = []
    for idx, chunk in enumerate(chunks):
        # Generate Embedding
        response = openai_client.embeddings.create(
            input=chunk,
            model="text-embedding-3-small"
        )
        vector = response.data[0].embedding
        
        # Create point for Qdrant
        point = PointStruct(
            id=idx,
            vector=vector,
            payload={"text": chunk}
        )
        points.append(point)
    
    # Upsert to Qdrant
    qdrant_client.upsert(
        collection_name=collection_name,
        points=points
    )
    
    print("�?Data successfully upserted to Vector DB.")

Step 4: Run the Pipeline

if __name__ == "__main__":
    target_url = "https://en.wikipedia.org/wiki/Retrieval-augmented_generation"
    
    # 1. Clean
    clean_markdown = fetch_clean_content(target_url)
    
    if clean_markdown:
        print(f"📄 Cleaned Data Size: {len(clean_markdown)} chars")
        
        # 2. Chunk
        chunks = chunk_text(clean_markdown)
        print(f"📦 Created {len(chunks)} chunks")
        
        # 3. Store
        embed_and_store(chunks)

Production Enhancements

Metadata Enrichment

Add metadata to improve retrieval:

point = PointStruct(
    id=idx,
    vector=vector,
    payload={
        "text": chunk,
        "source_url": target_url,
        "chunk_index": idx,
        "timestamp": datetime.now().isoformat()
    }
)

Batch Processing

Process multiple URLs efficiently:

import asyncio
import aiohttp

async def process_urls_batch(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_clean_content_async(session, url) for url in urls]
        contents = await asyncio.gather(*tasks)
        
        all_chunks = []
        for content in contents:
            if content:
                all_chunks.extend(chunk_text(content))
        
        embed_and_store(all_chunks)

Common Pitfalls to Avoid

  1. Embedding HTML: Wastes 70% of your token budget on tags
  2. No Deduplication: Embedding the same content multiple times
  3. Poor Chunking: Splitting mid-sentence or mid-paragraph
  4. Ignoring Metadata: Making it impossible to trace sources

Monitoring Data Quality

Track ingestion metrics:

def calculate_quality_score(markdown_text):
    # Simple heuristics
    has_headers = '##' in markdown_text
    has_paragraphs = '\n\n' in markdown_text
    low_html_tags = markdown_text.count('<') < 10
    
    score = sum([has_headers, has_paragraphs, low_html_tags])
    return score / 3.0

# Use before embedding
quality = calculate_quality_score(clean_markdown)
if quality < 0.6:
    print("⚠️ Low quality content detected. Review before embedding.")

Conclusion

Building a RAG system without a proper cleaning layer is like building a house on a swamp. No matter how expensive your materials (LLMs) are, the foundation will sink.

By integrating SearchCans into your Vector ETL pipeline, you automate the hardest part of data engineering: turning the messy web into clean, structured knowledge.


Resources

Related Topics:

Get Started:


SearchCans provides real-time data for AI agents. Start building now →

SearchCans Team

SearchCans Team

SearchCans Editorial Team

Global

The SearchCans editorial team consists of engineers, data scientists, and technical writers dedicated to helping developers build better AI applications with reliable data APIs.

API DevelopmentAI ApplicationsTechnical WritingDeveloper Tools
View all →

Trending articles will be displayed here.

Ready to try SearchCans?

Get 100 free credits and start using our SERP API today. No credit card required.