File Extraction API
Convert PDF, DOCX, XLSX, PPTX and other documents to clean, structured Markdown with one POST request. Headings, tables, lists, and code blocks are all preserved — built for RAG pipelines, LLM ingestion, and document automation workflows.
Try in Playground →PDF to Markdown — Live Output Demo
Adobe PDF sample# PDF BOOKMARK SAMPLE
**Sample Date:** May 2001
**Prepared by:** Accelio Present Applied Technology
**Features Demonstrated:**
- Primary bookmarks in a PDF file.
- Secondary bookmarks in a PDF file.
### Overview
This sample consists of a simple form containing four distinct
fields. The data file contains eight separate records.
By default, the data file will produce a PDF file containing
eight separate pages. The selective use of the bookmark file
will produce the same PDF with a separate pane containing
bookmarks.
### Sample Files
| Filename | Description |
|---------------------|---------------------------------|
| ap_bookmark.IFD | The template design. |
| ap_bookmark.mdf | Template targeted for PDF output|
| ap_bookmark.dat | Sample data file in DAT format. |
| ap_bookmark.bmk | Sample bookmark file. |
| ap_bookmark.pdf | Sample PDF output. | Source: c4611_sample_explain.pdf — Adobe official sample document, parsed via File Extraction API.
Supported Formats
API Endpoint
https://www.searchcans.com/api/v1/url Code Examples
curl -X POST "https://www.searchcans.com/api/v1/url" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"t": "url",
"s": "https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
"file": 1
}' Example Response
{
"code": 0,
"data": {
"id": "q_2061257715480530944_065651eb-1b1d-44ee-8496-c12e5b220cea",
"markdown": "",
"title": "",
"description": "",
"fileUrl": "https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
"fileMarkdown": "# **PDF BOOKMARK SAMPLE**\n\n**Sample Date:** May 2001\n\n**Prepared by:** Accelio Present Applied Technology\n\n**Created and Tested Using:** - Accelio Present Central 5.4\n\n - Accelio Present Output Designer 5.4\n\n**Features Demonstrated:** - Primary bookmarks i… (721 chars total)"
},
"msg": "success",
"requestId": null
} Request Body (JSON)
| Parameter | Type | Description |
|---|---|---|
| t | string | API type — always "url" for File Extraction API "url" |
| s | string | URL pointing to the document to extract (PDF, DOCX, DOC, XLSX, XLS, PPTX, EPUB, etc.) "https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf" |
| file | integer | Set to 1 to trigger file parsing. The URL must point directly to a document file. 1 |
| proxy | integer | Proxy tier for auth-protected or CDN-gated file URLs: 0 = none (default), 1 = Shared Pool (+2 credits), 2 = Datacenter (+5 credits), 3 = Residential (+10 credits). 0 |
| d | integer | Request timeout in milliseconds (default: 30000). Use 60000+ for large documents. 60000 |
API Response Fields
The File Extraction API returns document content as structured Markdown, preserving all headings, tables, lists, and code blocks from the source file.
fileMarkdown
Full document content as clean Markdown. Headings, tables, lists, and inline code are preserved from the original file structure.
fileUrl
The original document URL that was submitted in the request — echoed back for correlation in batch processing pipelines.
id
Unique request ID for tracing and support. Format: q_{uid}_uuid.
title / description
Document metadata when available from the file's internal properties. May be empty for documents without embedded metadata.
Document Parsing for RAG Pipelines
The File Extraction API is built for LLM ingestion workflows. Pass the returned fileMarkdown directly into a vector store or chunking pipeline — no post-processing needed.
For documents hosted behind authentication (AWS S3 presigned URLs, SharePoint, CDN-gated files), use the proxy parameter to route through a residential or datacenter proxy tier. Set "d": 60000 or higher for large documents to avoid timeout errors.
File Extraction vs Reader API
| Feature | File Extraction API | Reader API |
|---|---|---|
| Input | Document file URL (PDF, DOCX, XLSX…) | Any web page URL |
| Output field | fileMarkdown | markdown |
| JS rendering | Not required | Optional via mode=1 |
| Key parameter | file: 1 | (default behavior) |
| Same endpoint? | Yes — both use POST /api/v1/url | |
Start Building Free
Start extracting documents immediately. 100 free credits on sign up.