All posts
Part 05 of 10
Before: 930 polls/secAfter: ~4 pollsevery 5s, all characters, blindonly when cache expires
Part 05 of 10

The Scheduler Was Broken All Along

930 polls per second on the UI thread. Nobody noticed for years.

Engineers·12 min read

For engineers. This is the root cause of the crash from Part 1.

With the god object tamed and the codebase split into layers, I could finally isolate the scheduler — the component that fetches character data from CCP's ESI API. This is the thing that crashed the app at sixty characters. I needed to understand exactly how it worked before I could fix it.

I built a diagnostic tool. 221 lines of code: a TCP stream on port 5555 that broadcasts structured JSON events in real-time. Developer-only, never shipped. I turned it on, added characters, and watched.

Within sixty seconds, it was clear: the scheduler wasn't just buggy. It was architecturally broken.

01

930 Polls Per Second

Here's what the old scheduler did:

  1. A timer fires every one second on the UI thread
  2. It broadcasts to 35 subscribers
  3. Each of 31 data monitors per character checks "is it time to fetch yet?"
  4. If yes, it fires an HTTP request that blocks the calling thread until the UI processes the response

With thirty characters: 930 monitors, each running every second. Plus another 30 data orchestrators doing the same thing. That's 960 method calls per second on the UI thread — the same thread responsible for drawing the screen.

96% of those calls returned "not yet" and did nothing useful. But they still consumed CPU time on the thread that needed to be free to keep the application responsive.

With sixty characters: 1,860 calls per second. The UI thread couldn't keep up. The screen froze. The app crashed.

02

The Thundering Herd

On startup, every monitor was set to "force update." On the first timer tick, approximately 240 HTTP requests fired simultaneously. But the connection limit was ten. The remaining 230 queued up. Each response blocked a thread waiting for the UI to process it. The UI was busy running the next round of 960 checks. The system deadlocked.

This is why the crash happened during startup. Not after the characters loaded — during the initial fetch.

03

More Problems

The diagnostic stream revealed issue after issue:

A new HTTP connection for every request. The code created a fresh connection for every single API call, then disposed it. Network sockets stay open for four minutes after disposal. With hundreds of requests, the operating system ran out of sockets.

No proactive rate limiting. The only protection was reactive — it waited until CCP's API returned a "slow down" response, then stopped everything. No per-character awareness. No budget tracking.

Cache headers ignored on multi-page responses. Page one of an asset list used the cache header to avoid unnecessary downloads. Pages two through fifty did not. Large asset lists downloaded in full every time.

Timer callbacks with no error handling. Three critical timer handlers had no try-catch. Any unhandled error in the async chain crashed the entire application instantly.

Each of these would be a serious bug on its own. Together, they created a system that worked for small loads and collapsed at scale.

04

The New Scheduler

The replacement is 973 lines across seven files. The core idea: instead of checking every character every second, maintain a priority queue sorted by when each piece of data actually expires. Sleep until the next job is due. Process it. Sleep again.

Zero wasted polling.

How it works: The UI thread sends commands into a thread-safe queue — "register this character," "this tab is now visible," "force refresh." The scheduler runs on its own background thread, processes commands, and maintains a priority queue. When a job is due, it fetches the data, processes the response, and schedules the next fetch based on the cache expiry header CCP returns.

Concurrency: Twenty simultaneous HTTP connections, gated by a semaphore. Enough to be fast, not enough to trigger rate limits.

Per-character rate limiting: Each character has a token bucket — 150 tokens per fifteen-minute window with a 10% safety margin. When a character's budget is spent, its jobs are deferred until the budget refills. One character's rate limit never affects another.

Startup phasing: Instead of the thundering herd, characters are phased in over several seconds:

PhaseWhat fetchesWhen
1Skills and queue for the visible characterImmediately
2Implants and attributes for all charactersStaggered over 2 seconds
3Market orders, contracts, industry, mailStaggered over 3 seconds
4Everything elseStaggered over 5 seconds

For a hundred characters, the full startup takes about six seconds instead of trying to fire 3,100 requests in the first tick.

Tab awareness: When you switch to a different character's tab, the scheduler promotes that character's jobs to high priority and demotes the previous one. The character you're looking at always gets fresh data first.

Auth isolation: If one character's ESI token expires, only that character is affected. No other characters pause. When the user re-authenticates, that character's jobs resume automatically.

05

The Diagnostic Stream

FIG 5.3
nc $WINHOST 5555 | jq '.msg' GET /characters/12345/skills → 304 Not Modified (89ms) char47/Skills → 200 OK (180ms), tokens=142/150 ESI rate limited — char12 budget exhausted GET /characters/67890/wallet → 200 OK (45ms) Port 5555
Real-time TCP stream on port 5555 — 221 lines that caught bugs the codebase carried for years. Debug builds only.
The Diagnostic Stream
221 lines of TCP JSON streaming that paid for itself on day one. Every ESI call, every scheduler decision, every rate limit — all visible in real-time on port 5555. Debug builds only — never included in any release.

The tool that revealed all of this — TcpJsonLoggerProvider — stayed in the codebase as a developer tool. Connect from any terminal and watch the scheduler in real-time:

{"ts":"...","tag":"FETCH","msg":"char47/Skills → 200 OK (180ms), tokens=142/150"}
{"ts":"...","tag":"FETCH","msg":"char12/Assets → 304 Not Modified (23ms)"}
{"ts":"...","tag":"WARN","msg":"char12 budget exhausted — deferring 3 jobs"}

Every ESI request, every cache hit, every rate limit decision — visible in real time. The tool that cost 221 lines to build caught bugs the original codebase carried for years.

06

The Crash Is Fixed

After the scheduler rewrite, I loaded a hundred characters. No crash. No freeze. No thundering herd. The UI stayed responsive throughout startup. Characters loaded in phases, data streamed in over a few seconds, and the priority queue kept everything orderly.

The problem that started this entire journey — a crash at sixty characters — was solved. But by the time I got here, the architecture was clean enough to do something nobody had done with EVEMon in twenty years: build a new UI.

Previous: Part 4 — Surgery on a Beating Heart

Next: Part 6 — 101,000 Lines and Zero ViewModels