Wialon API Rate Limits and Resilient Sync Design
Rate limits are not an edge case. They are part of normal operation at scale. A sync design that ignores this will eventually miss business-critical events.
Understand Wialon rate limit behavior
Wialon Remote API does not publish a formal rate limit specification the way cloud APIs from AWS or Google do. Instead, throttling is enforced per-session and per-account, and the behavior differs between Wialon Hosting and Wialon Local. On Hosting, the shared infrastructure means your account competes with others on the same server. During peak hours — typically 08:00 to 10:00 local time when thousands of dashboards refresh simultaneously — response times degrade and requests start returning error code 6 (server error) more frequently.
Some endpoints are significantly heavier than others. Report execution (report/exec_report) is the most expensive operation because it triggers server-side aggregation across potentially millions of messages. Message loading (messages/load_interval) for large time windows is the second heaviest. Lightweight operations like core/search_items with small flag sets or unit/get_pos are comparatively cheap. A pipeline that treats all endpoints equally — applying the same polling interval and concurrency to report execution as to position queries — will exhaust its budget on heavy calls and starve the lightweight ones.
On Wialon Local deployments, you have more control because the server is dedicated. However, resource limits still exist — the hardware has finite CPU and memory, and concurrent report executions can spike server load above the point where the web interface becomes unresponsive for dispatchers. Coordinate with the Wialon Local administrator to establish acceptable API load thresholds. A practical guideline is to limit concurrent report executions to 2 per server and total concurrent API sessions to 10.
Queue first, request second
The core architectural pattern for resilient Wialon sync is to decouple request generation from request execution. Every component that needs data from Wialon — the position poller, the trip extractor, the report generator, the geofence updater — publishes a request to a queue rather than calling the API directly. A pool of worker processes pulls from the queue and executes requests against the API within a controlled concurrency budget.
Implement priority lanes in the queue. Real-time requests (current vehicle positions for a live map) get a high-priority lane with reserved worker capacity. Batch requests (historical message extraction, report generation) get a normal-priority lane that yields to real-time traffic. This prevents a large batch job from blocking live dashboard updates. A simple two-lane model with 2 workers reserved for high-priority and 4 workers for normal-priority covers most deployment sizes.
Calculate your available request budget per interval. If you observe that Wialon reliably handles 60 requests per minute per session without degradation, and you have 3 active sessions, your budget is 180 requests per minute. Your queue depth monitoring should alert when the pending request count exceeds what your budget can drain in 5 minutes. If the queue depth grows faster than workers can drain it, you have a backpressure signal that should throttle request generation, not add more workers — more workers means more concurrent API load, which triggers more throttling.
- Use a persistent queue (PostgreSQL-backed or Redis) rather than in-memory — if the worker process crashes, queued requests survive.
- Track request latency per endpoint type to detect Wialon-side degradation before it causes errors.
- Log queue depth, drain rate, and worker utilization every minute for capacity planning.
Use token-aware backoff strategy
Generic exponential backoff — retry after 1s, 2s, 4s, 8s — is insufficient because it does not distinguish between error classes. A 4xx response from Wialon (invalid parameters, access denied, invalid session) will never succeed on retry with the same parameters. Retrying it wastes budget and delays other queued requests. A 5xx response (server error, timeout) is transient and likely to succeed after a short wait. Classify every error response and apply the appropriate strategy.
For invalid session errors (error code 1), the correct action is not retry — it is re-authentication. Obtain a new session token via token/login, update the worker's session context, and replay the failed request. If re-authentication itself fails, the token may have been revoked — escalate to the dead-letter queue. For access denied errors (error code 7), the request will never succeed without configuration changes — route directly to the dead-letter queue with a notification to the integration admin.
For transient server errors, apply exponential backoff with jitter: base delay of 1 second, multiplied by 2 on each retry, with a random jitter of 0 to 30% of the delay, capped at 60 seconds. After 5 retries, route to the dead-letter queue. The jitter is critical — without it, when Wialon recovers from a brief outage, all workers that backed off simultaneously will retry at the same moment, creating a thundering herd that triggers another round of throttling.
Track retry metrics per error class. If invalid session errors spike, you have an authentication lifecycle problem. If server errors spike during specific hours, you are hitting capacity limits and need to redistribute your extraction schedule. If access denied errors appear for specific units, a Wialon administrator changed permissions without notifying the integration team.
Implement circuit breaker pattern
A circuit breaker prevents your pipeline from hammering a degraded Wialon instance with requests that will fail, wasting both your API budget and Wialon server resources. The pattern has three states: closed (normal operation, requests pass through), open (Wialon is down, requests are immediately rejected without calling the API), and half-open (Wialon might be recovering, allow a single probe request to test).
Transition from closed to open when the failure rate exceeds a threshold within a time window. A practical setting is 50% failure rate over a 60-second sliding window with a minimum of 10 requests. When the breaker opens, all queued requests are held rather than executed. Workers log the circuit break event and begin a recovery timer. After 30 seconds, the breaker moves to half-open and allows one probe request through. If the probe succeeds, the breaker closes and normal processing resumes. If the probe fails, the breaker reopens and the recovery timer resets with doubled duration (30s, 60s, 120s, capped at 5 minutes).
Circuit breakers prevent cascade failures in multi-service architectures. If your pipeline also writes to an ERP, a message queue, and a notification service, a degraded Wialon instance should not cause connection pool exhaustion in your database or queue overflow in your message broker. The breaker isolates the Wialon dependency so that other pipeline components continue operating normally and resume Wialon sync when the API recovers.
Expose circuit breaker state in your monitoring dashboard. A breaker that trips frequently (more than twice per day) indicates a systemic problem — either your load is too high for the Wialon infrastructure, or the Wialon server has an underlying health issue. A breaker that never trips might mean your threshold is too generous and you are absorbing degradation without reacting.
Design idempotent sync operations
When retries are a normal part of operation — and they will be — every sync operation must produce the same result regardless of how many times it executes. A message extraction that inserts raw payloads into staging must use ON CONFLICT DO NOTHING keyed on the source_hash. A trip record upsert must use ON CONFLICT (unit_id, trip_start_time) DO UPDATE SET fields to latest values. Without idempotency, every retry creates duplicate records that cascade through every downstream report and dashboard.
Wialon message IDs are not globally unique across all time — they are sequential per unit and can wrap. The reliable deduplication key for messages is a composite of unit_id, message_time (Unix timestamp), and message_type. For trips extracted via reports, use unit_id and trip_start_time. For events from notifications, use the notification_id and trigger_time. Document your deduplication key for each entity type in a schema reference so that future developers do not accidentally create a different key that introduces duplicates.
The distinction between exactly-once and at-least-once semantics matters here. True exactly-once delivery is impractical in a distributed pipeline. Instead, design for at-least-once delivery with idempotent consumers. Your extraction layer may deliver the same message twice. Your staging layer deduplicates it. Your transformation layer processes it once. This approach is simpler to build, easier to debug, and more resilient than attempting exactly-once guarantees across Wialon API, your queue, and your database.
Build a dead-letter queue for unprocessable records
Some records will fail permanently: a message references a unit that was deleted from Wialon, a sensor reading contains a value outside the physically possible range, a required field is null because the tracker firmware has a bug. These records should not loop through retries forever. After exhausting the retry budget, route them to a dead-letter queue (DLQ) — a separate table or queue where they await manual review.
Classify failures into three categories. Transient failures (network timeout, session expired, server overloaded) get retried with backoff. Permanent failures (access denied, invalid parameters, schema violation) go directly to the DLQ on first occurrence. Unknown failures (unexpected error codes, unparseable responses) get a small retry budget (2 attempts) then route to the DLQ. This classification prevents wasting retry capacity on records that will never succeed while giving genuinely transient failures a fair chance to recover.
The DLQ needs a review and reprocessing workflow. Assign a daily review to a team member: check DLQ depth, inspect the oldest 10 records, classify root causes. Common root causes cluster — 80% of DLQ entries often share the same failure mode (e.g., a specific tracker model sending malformed CAN data). Fix the root cause, then reprocess the DLQ batch. Track DLQ metrics: depth over time, age of oldest message, arrival rate, and reprocessing success rate. A growing DLQ depth is an early indicator of a systemic problem that retries alone cannot solve.
- Set a DLQ retention period (30 days is typical) after which unresolved records are archived and an incident report is generated.
- Include the full request context in the DLQ entry: original payload, error response, retry count, failure classification, and timestamp of each attempt.
- Alert when DLQ depth exceeds 100 records or the oldest message is more than 48 hours old.
Observe lag, not only error count
The most dangerous failure mode in a sync pipeline is silent delay. Error-rate dashboards show zero errors, all health checks pass, no alerts fire — but the data in your database is 45 minutes old because extraction slowed down and nobody noticed. Dispatchers make decisions based on stale positions. Reports show incomplete daily totals. The pipeline is technically healthy but operationally useless.
Ingestion lag is the primary health metric for any Wialon sync pipeline. Define it as the difference between the current wall clock time and the timestamp of the most recently ingested message per unit. If a vehicle last reported at 14:32 and it is now 14:35, the lag is 3 minutes. If the lag for any unit exceeds your freshness SLO — typically 10 minutes for real-time dashboards, 60 minutes for batch analytics — trigger an alert. This catches problems that error-rate monitoring misses entirely: slow API responses, queue buildup, transformation bottlenecks, and database write contention.
Design your monitoring dashboard with four panels: ingestion lag distribution (histogram of lag across all units, updated every minute), error rate by category (authentication, validation, transformation, network, as a stacked time series), queue depth and drain rate (line chart showing pending vs processed requests), and daily reconciliation gap count (bar chart, updated after nightly reconciliation). This dashboard should be the first thing the on-call engineer checks during any data quality complaint.
Set freshness SLOs collaboratively with the operations team. Ask them: how old can vehicle position data be before it causes a wrong dispatching decision? The answer defines your real-time lag SLO. Ask: how late can the daily fuel report be before it misses the morning review meeting? The answer defines your batch lag SLO. These SLOs drive your alerting thresholds, your extraction schedule, and your infrastructure sizing. Without them, you are guessing at performance requirements.