Build a scalable document ingestion pipeline for our RAG system. ## Document Sources {{document_sources}} ## Document Types {{document_types}} ## Volume Expectations {{volume_expectations}} Implement a complete pipeline: ```python class IngestionPipeline: async def ingest(self, source: DocumentSource) -> IngestionResult: """ 1. Document extraction (PDF, DOCX, HTML, etc.) 2. Content cleaning and normalization 3. Chunking with metadata preservation 4. Embedding generation (batched) 5. Vector store upsert 6. Index refresh """ pass ``` Include: - Parallel processing architecture - Error handling and retry logic - Progress tracking and resumability - Deduplication strategy - Incremental update support
Document Ingestion Pipeline
Build a scalable document ingestion pipeline with extraction, chunking, embedding generation, and vector storage with parallel processing and error recovery.
87 copies0 forks
Share this prompt:
Details
Category
CodingUse Cases
Document ingestionData pipelineRAG infrastructure
Works Best With
claude-sonnet-4-20250514gpt-4o
Created Updated Shared