Understand how LLM Resayil enforces rate limits and quotas to ensure fair usage and service stability. Learn how to handle rate limit responses and implement backoff strategies.
LLM Resayil applies rate limits to prevent abuse and ensure fair access for all users. Limits are enforced per user ID per minute via Laravel RateLimiter.
Limits are calculated in UTC time. Your quota resets at the beginning of each minute. When you exceed a limit, you'll receive a 429 Too Many Requests response and must wait for the quota to reset before retrying.
Note: Admin users automatically bypass rate limits. Contact support if you need higher limits for legitimate use cases.
Your rate limits depend on your subscription tier. Here is a breakdown of limits for each tier:
| Tier | Requests/Min | Requests/Day | Max Tokens/Request |
|---|---|---|---|
| Basic | 10 | — | — |
| Pro | 30 | — | — |
| Enterprise | 60 | — | — |
When you exceed your rate limit, you'll receive an HTTP 429 Too Many Requests response.
The response includes a retry_after field indicating how many seconds to wait before retrying:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 20
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1741305600
{
"error": {
"message": "Rate limit exceeded",
"code": 429
},
"retry_after": 45
}
These HTTP headers are returned on every response so you can monitor usage and avoid hitting the limit proactively:
| Header | Description |
|---|---|
| X-RateLimit-Limit | Your maximum requests allowed per minute (e.g., 20) |
| X-RateLimit-Remaining | Requests remaining in the current minute window |
| X-RateLimit-Reset | Unix timestamp when the quota window resets |
| retry_after | Seconds to wait before retrying (in the JSON response body) |
Monitor X-RateLimit-Remaining on every response to implement proactive client-side throttling:
const response = await fetch(apiUrl, options);
const remaining = parseInt(response.headers.get('X-RateLimit-Remaining'), 10);
const limit = parseInt(response.headers.get('X-RateLimit-Limit'), 10);
if (remaining < limit * 0.2) {
console.warn(`Only ${remaining} requests remaining in this window!`);
}
When you receive a 429 response, use the retry_after value from the JSON body
to determine how long to wait. Never retry immediately — always wait at least the specified
number of seconds.
import time
import requests
def call_with_retry(url, payload, headers, max_retries=5):
for attempt in range(max_retries):
response = requests.post(url, json=payload, headers=headers)
if response.status_code == 200:
return response.json()
if response.status_code == 429:
data = response.json()
retry_after = data.get("retry_after", 60)
print(f"Rate limited. Waiting {retry_after}s (attempt {attempt + 1})")
time.sleep(retry_after)
continue
response.raise_for_status()
raise Exception("Max retries exceeded")
When rate limited, implement exponential backoff with jitter to gracefully retry requests. This is more reliable than immediate retries and helps distribute load across clients.
The recommended approach uses exponential backoff with jitter to avoid thundering herd problems:
import time
import random
def make_api_call_with_backoff(api_url, data, max_retries=5):
for attempt in range(max_retries):
try:
response = requests.post(api_url, json=data, timeout=10)
if response.status_code == 200:
return response.json()
elif response.status_code == 429:
# Use retry_after from body if available, else exponential backoff
body = response.json()
wait_time = body.get("retry_after") or (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait_time:.1f} seconds...")
time.sleep(wait_time)
else:
response.raise_for_status()
except Exception as e:
if attempt < max_retries - 1:
wait_time = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait_time)
else:
raise
raise Exception("Max retries exceeded")
Combine multiple requests into a single API call when possible. This counts as one request toward your quota while processing more data.
Do not rely solely on server-side limiting. Implement client-side throttling to stay below 80% of your quota:
// Limit to 80% of max requests to stay safe
const MAX_SAFE_RATE = 0.8;
const maxRequestsPerMinute = 20; // Your tier limit
const safeRate = maxRequestsPerMinute * MAX_SAFE_RATE; // 16 req/min
const delayBetweenRequests = 60000 / safeRate; // ~3750ms
Cache API responses to avoid repeated requests for the same queries. This dramatically reduces API usage and keeps you well within rate limits.
Spread requests over time rather than sending them all at once. This prevents burst limit violations while maintaining consistent throughput.
Set up alerts when X-RateLimit-Remaining drops below 20% of your limit.
This gives you early warning to take action before you start receiving 429 errors.
If your application legitimately needs higher rate limits, upgrade your subscription tier or contact support about enterprise options.