Document Deduplication Pipeline

Samira El-Masri

@samira-el-masri

·December 31, 2025

Build a scalable document deduplication pipeline using hash, MinHash/LSH, and embedding-based methods with merge strategies.

55 copies0 forks

Share this prompt:

Build a document deduplication pipeline for our RAG corpus.

## Corpus Statistics
{{corpus_stats}}

## Deduplication Goals
{{dedup_goals}}

## Resource Constraints
{{resource_constraints}}

Implement deduplication:

```python
class DocumentDeduplicator:
    def find_exact_duplicates(self, documents: List[Document]) -> List[Set[str]]:
        """Hash-based exact matching"""
        pass
    
    def find_near_duplicates(self, documents: List[Document], threshold: float) -> List[Set[str]]:
        """MinHash/LSH for near-duplicates"""
        pass
    
    def find_semantic_duplicates(self, documents: List[Document], threshold: float) -> List[Set[str]]:
        """Embedding-based similarity"""
        pass
    
    def merge_strategy(self, duplicate_set: Set[str]) -> Document:
        """Select best representative or merge"""
        pass
```

Include:
- Scalable implementation (100M+ docs)
- Quality preservation
- Incremental deduplication
- Audit trail

Details

Category

Coding

Use Cases

DeduplicationCorpus cleaningData quality

Works Best With

claude-sonnet-4-20250514gpt-4o

Created December 31, 2025Updated January 2, 2026Shared December 31, 2025

Related Prompts

Context Relevance Scorer

by @crisdux

Build a context relevance scorer combining similarity, keyword, entity, and topic signals to filter retrieved documents before LLM generation.

Context Relevance Scorer

by @levi-smith

Build a context relevance scorer combining similarity, keyword, entity, and topic signals to filter retrieved documents before LLM generation.

Context Relevance Scorer

by @crisdux

Build a context relevance scorer combining similarity, keyword, entity, and topic signals to filter retrieved documents before LLM generation.

Context Relevance Scorer

by @eagerlynx2759

Build a context relevance scorer combining similarity, keyword, entity, and topic signals to filter retrieved documents before LLM generation.

Context Relevance Scorer

by @levi-smith

Build a context relevance scorer combining similarity, keyword, entity, and topic signals to filter retrieved documents before LLM generation.

Context Relevance Scorer

by @livelybear1320

Build a context relevance scorer combining similarity, keyword, entity, and topic signals to filter retrieved documents before LLM generation.

More from @samira-el-masri

Context Relevance Scorer

Zero-Shot Code Bug Detection

LLM Observability Stack Setup

Negative Sampling Strategy

Create your own prompt vault and start sharing