Skill v1.0.0
Trusted Publisher100/100version: "1.0.0" name: twilio-reliability-patterns description: > Handle rate limits, retries, and failures when building on Twilio at scale. Covers 429 exponential backoff with jitter, per-number throughput limits, StatusCallback resilience, thin-receiver pattern, and fallback chains. Use this skill whenever sending messages or making calls at volume, or when building production-grade Twilio integrations.
Overview
Twilio enforces per-resource rate limits. At scale, 429 errors are expected behavior — not bugs. This skill teaches the patterns that prevent production failures: exponential backoff, throughput management, and resilient callback handling.
429 concurrency errors are not well documented — implement exponential backoff with ±10% jitter.
Prerequisites
- A working Twilio integration (any product)
- Understanding of your expected volume (messages/sec, calls/sec)
- StatusCallback URLs configured — see
twilio-messaging-services,twilio-sms-send-message
Key Patterns
1. Exponential Backoff with Jitter
When you receive a 429 (Too Many Requests), wait and retry. Naive fixed-interval retry creates thundering herds. Use exponential backoff with randomized jitter.
Python
import time, random, requestsdef send_with_backoff(client, to, body, messaging_service_sid, max_retries=5):for attempt in range(max_retries):try:message = client.messages.create(to=to,body=body,messaging_service_sid=messaging_service_sid,status_callback="https://yourapp.com/status")return messageexcept Exception as e:if hasattr(e, 'status') and e.status == 429:# Exponential backoff: 100ms, 200ms, 400ms, 800ms, 1600msbase_delay = 0.1 * (2 ** attempt)# Add ±10% jitter to prevent thundering herdjitter = base_delay * 0.1 * (2 * random.random() - 1)delay = min(base_delay + jitter, 30) # cap at 30 secondstime.sleep(delay)else:raise # Non-429 errors: don't retry, investigateraise Exception(f"Failed after {max_retries} retries")
Node.js
async function sendWithBackoff(client, to, body, messagingServiceSid, maxRetries = 5) {for (let attempt = 0; attempt < maxRetries; attempt++) {try {return await client.messages.create({to,body,messagingServiceSid,statusCallback: "https://yourapp.com/status",});} catch (err) {if (err.status === 429) {// Exponential backoff: 100ms, 200ms, 400ms, 800ms, 1600msconst baseDelay = 100 * Math.pow(2, attempt);// Add ±10% jitterconst jitter = baseDelay * 0.1 * (2 * Math.random() - 1);const delay = Math.min(baseDelay + jitter, 30000); // cap at 30sawait new Promise(r => setTimeout(r, delay));} else {throw err; // Non-429: don't retry}}}throw new Error(`Failed after ${maxRetries} retries`);}
Parameters:
- Initial delay: 100ms
- Multiplier: 2x per attempt
- Jitter: ±10% of base delay (randomized)
- Max delay: 30 seconds
- Max retries: 5 (covers up to ~3.2 second base delay)
2. Per-Number Throughput Limits
These limits are not prominently documented:
| Number type | SMS throughput | Voice throughput | Notes | |
|---|---|---|---|---|
| Local (long code) | ~1 SMS/sec | 1 concurrent call | Lowest cost, lowest throughput | |
| Toll-free | ~3 SMS/sec | — | Faster verification (3-5 days) | |
| Short code | 10-100 SMS/sec | — | Highest throughput, 8-12 week provisioning, expensive | |
| Messaging Service (pool) | Sum of all numbers in pool | — | Multiply throughput by adding numbers |
Throughput opacity: Sending velocity and queue depth are opaque — there is no dashboard showing messages per second. Use Messaging Services to multiply throughput by pooling numbers. A pool of 10 long codes = ~10 SMS/sec.
3. Bulk Send Pattern
For sending to large lists, use a rate-limited dispatch loop:
Python
import asynciofrom collections import dequeasync def bulk_send(client, recipients, body, messaging_service_sid, rate_per_second=10):"""Send to a list of recipients with rate limiting and backoff."""queue = deque(recipients)results = []while queue:batch = []for _ in range(min(rate_per_second, len(queue))):batch.append(queue.popleft())for recipient in batch:try:msg = send_with_backoff(client, recipient, body, messaging_service_sid)results.append({"to": recipient, "sid": msg.sid, "status": "sent"})except Exception as e:results.append({"to": recipient, "error": str(e), "status": "failed"})if queue: # Don't sleep after last batchawait asyncio.sleep(1) # 1 second between batchesreturn results
Key: Set rate_per_second based on your number pool size, not your desired speed. Sending faster than your pool supports just generates 429s.
Compliance: Before bulk sending, verify recipient consent (opt-in records), respect quiet hours, and implement maximum batch size limits. Monitor for anomalous send patterns that could indicate abuse.
4. StatusCallback Resilience
At scale, StatusCallbacks create their own load problem.
The math: 50 concurrent calls × 6 status events per call = 300 webhook invocations per second. Twilio Functions allow 30 concurrent executions per service.
Thin-receiver pattern — receive, queue, respond immediately:
Node.js (Express)
const { Queue } = require("bullmq");const statusQueue = new Queue("twilio-status");// Thin receiver: accept callback, queue it, respond 200 immediatelyapp.post("/status", async (req, res) => {await statusQueue.add("status-event", {callSid: req.body.CallSid,callStatus: req.body.CallStatus,timestamp: Date.now(),});res.sendStatus(200); // Respond FAST — Twilio will retry on timeout});// Process asynchronouslyconst worker = new Worker("twilio-status", async (job) => {const { callSid, callStatus } = job.data;await updateDatabase(callSid, callStatus);});
Python (Flask + Celery)
@app.route("/status", methods=["POST"])def status_callback():# Queue for async processingprocess_status.delay(call_sid=request.form["CallSid"],call_status=request.form["CallStatus"])return "", 200 # Respond FAST@celery.taskdef process_status(call_sid, call_status):update_database(call_sid, call_status)
Idempotency key: Use {CallSid}-{CallStatus} as a composite key. Twilio retries on timeout, which can cause duplicate callbacks. Deduplicate before processing.
5. Fallback Chains
When delivery on one channel fails, escalate to the next:
Python
async def send_with_fallback(client, to, message, messaging_service_sid):"""Try SMS → Voice → Email fallback chain."""# Try SMS firsttry:msg = client.messages.create(to=to, body=message, messaging_service_sid=messaging_service_sid,status_callback="https://yourapp.com/status")# Wait for delivery confirmation via StatusCallback# If undelivered after timeout, fall through to voicereturn {"channel": "sms", "sid": msg.sid}except Exception:pass # SMS failed, try voice# Fallback to voicetry:call = client.calls.create(to=to, from_="+15551234567",twiml=f"<Response><Say>{message}</Say></Response>",status_callback="https://yourapp.com/call-status")return {"channel": "voice", "sid": call.sid}except Exception:pass # Voice failed, try email# Last resort: email# Use SendGrid — see twilio-sendgrid-emailreturn {"channel": "email", "status": "queued"}
6. Voice Concurrency Limits
| Resource | Default limit | Notes | |
|---|---|---|---|
| Concurrent calls per account | 1 (trial) / variable (paid) | Request increase via support | |
| Calls per second (CPS) | 1 CPS (default) | Increase via support for outbound campaigns | |
| Conference participants | 250 per conference | ||
| Twilio Functions concurrent | 30 per service | Use thin-receiver pattern above |
For outbound campaigns, request CPS increase before launch — not during.
7. Webhook Timeout Handling
Twilio expects a response within 15 seconds for voice webhooks and 15 seconds for messaging webhooks. If your endpoint doesn't respond:
- Voice: Twilio hangs up or falls back to
voiceFallbackUrl - Messaging: Twilio retries the callback
Always configure fallback URLs:
# On phone number configurationnumber = client.incoming_phone_numbers(phone_sid).update(voice_url="https://yourapp.com/voice",voice_fallback_url="https://yourapp.com/voice-fallback", # backup endpointsms_url="https://yourapp.com/sms",sms_fallback_url="https://yourapp.com/sms-fallback")
Monitoring Checklist
Set up these alerts before going to production:
| Metric | Alert threshold | How to track | |
|---|---|---|---|
| 429 error rate | > 5% of requests | Count 429s in your backoff handler | |
| Delivery failure rate | > 2% of messages | StatusCallback failed/undelivered events | |
| Webhook response time | > 5 seconds p95 | Your APM tool (DataDog, New Relic) | |
| Queue depth | Growing over 5 minutes | Your message queue metrics | |
| Concurrent calls | > 80% of limit | Twilio Usage API or Event Streams |
Twilio's built-in alerting systems are under-used — end-users often discover issues before developers do. Configure StatusCallbacks + Event Streams for delivery failure alerts on every integration.
CANNOT
- Cannot avoid 429 errors on any Twilio API — Backoff patterns apply to all APIs (Messaging, Voice, Verify, Lookup)
- Cannot increase per-number throughput — Add more numbers via Messaging Services instead
- Cannot configure StatusCallback retry behavior — Twilio retries on timeout automatically; not configurable
- Cannot exceed Twilio Functions limits — 30 concurrent executions/service, 10-second timeout, 256 MB memory
- Cannot use a native Twilio rate limiting API — You must implement rate limiting in your application
Next Steps
- Messaging at scale:
twilio-messaging-services - Monitor delivery:
twilio-sms-send-message(StatusCallbacks) - Debug failures:
twilio-debugging-observability - Compliance for bulk sends:
twilio-compliance-traffic