Rate Limiter for LLM APIs

U

@

·

Build a production-grade distributed rate limiter for LLM APIs with token buckets, priority queuing, and burst handling.

94 copies0 forks
Implement a production-grade rate limiter for LLM API calls.

## Rate Limits
{{rate_limits}}

## Traffic Pattern
{{traffic_pattern}}

## Priority Tiers
{{priority_tiers}}

```python
class LLMRateLimiter:
    """
    Implement:
    - Token bucket algorithm
    - Per-model rate limits
    - Priority-based queuing
    - Burst handling
    - Distributed coordination
    """
    
    async def acquire(self, model: str, tokens: int, priority: int) -> bool:
        pass
    
    async def wait_for_capacity(self, model: str, tokens: int) -> float:
        """Returns estimated wait time"""
        pass
```

Include:
- Redis-based distributed implementation
- Graceful degradation under pressure
- Metrics and alerting
- Client-side retry guidance

Details

Category

Coding

Use Cases

Rate limitingAPI managementTraffic control

Works Best With

claude-sonnet-4-20250514gpt-4o
Created Shared

Create your own prompt vault and start sharing