Nov 28, 2025
5 min read

Why WiFi Wouldn't Reconnect: The Race Condition in My Disconnect Handler

Calling esp_wifi_stop() triggers a disconnect event that immediately tries to reconnect, leaving WiFi in a weird half-stopped state.

The symptom was simple. WiFi would connect fine the first time. But after calling wifi_disconnect() to save power, the next wake cycle couldn’t reconnect. The connection would just hang forever.

Turns out esp_wifi_stop() triggers WIFI_EVENT_STA_DISCONNECTED, which my event handler interpreted as “connection lost, retry immediately.” The retry would start while WiFi was still shutting down, leaving everything in a confused state.

The Race Condition

Here’s what the original event handler looked like:

static void event_handler(void *arg, esp_event_base_t event_base,
                          int32_t event_id, void *event_data)
{
    if (event_base == WIFI_EVENT && event_id == WIFI_EVENT_STA_DISCONNECTED) {
        s_connected = false;
        if (s_retry_num < CONFIG_WIFI_MAXIMUM_RETRY) {
            esp_wifi_connect();
            s_retry_num++;
            ESP_LOGI(TAG, "Retrying connection (%d/%d)", s_retry_num, CONFIG_WIFI_MAXIMUM_RETRY);
        }
    }
}

Looks reasonable, right? If we get disconnected, try to reconnect. That’s what retry logic is supposed to do.

The problem is when you call wifi_disconnect():

void wifi_disconnect(void)
{
    if (s_initialized) {
        if (s_connected) {
            esp_wifi_disconnect();  // Triggers WIFI_EVENT_STA_DISCONNECTED
        }
        esp_wifi_stop();  // Also triggers WIFI_EVENT_STA_DISCONNECTED
    }
}

Both esp_wifi_disconnect() and esp_wifi_stop() fire the disconnect event. My event handler would see that event and immediately call esp_wifi_connect() to retry. But the WiFi subsystem is in the middle of shutting down. It’s not ready to start a new connection.

The next wake cycle would call wifi_connect(), but the WiFi subsystem was in a weird state from the interrupted shutdown. Connections would time out. Sometimes it would work. Sometimes it wouldn’t. Classic race condition behavior.

The Fix: A Stopping Flag

The solution is to tell the event handler when a disconnect is intentional:

static bool s_stopping = false;  // Flag to track intentional disconnect

static void event_handler(void *arg, esp_event_base_t event_base,
                          int32_t event_id, void *event_data)
{
    if (event_base == WIFI_EVENT && event_id == WIFI_EVENT_STA_DISCONNECTED) {
        s_connected = false;
        if (s_stopping) {
            // Intentional disconnect - don't retry
            ESP_LOGI(TAG, "WiFi stopped intentionally");
            return;
        }
        if (s_retry_num < CONFIG_WIFI_MAXIMUM_RETRY) {
            esp_wifi_connect();
            s_retry_num++;
            ESP_LOGI(TAG, "Retrying connection (%d/%d)", s_retry_num, CONFIG_WIFI_MAXIMUM_RETRY);
        }
    }
}

And in wifi_disconnect():

void wifi_disconnect(void)
{
    if (s_initialized) {
        s_stopping = true;  // Signal event handler to not retry
        if (s_connected) {
            esp_wifi_disconnect();  // Disconnect from AP first
        }
        esp_wifi_stop();
        s_stopping = false;
        s_connected = false;
        ESP_LOGI(TAG, "WiFi disconnected");
    }
}

Now the flow is clean:

  1. Set s_stopping = true before disconnecting
  2. Disconnect events fire, but event handler sees the flag and skips retry
  3. WiFi shuts down cleanly
  4. Clear the flag after shutdown completes
  5. Next wake cycle can connect normally

No more race condition. No more hanging connections.

The Subtlety of Event-Driven Code

This is the kind of bug that’s obvious in hindsight but brutal to debug in the moment. The event handler was doing exactly what I told it to do: retry on disconnect. The problem was I hadn’t thought about all the ways a disconnect event could be triggered.

Event-driven programming means thinking about event ordering and timing. When you call an API function that triggers an event, your event handler runs before the function completes. If your event handler calls another API function that depends on the first one being done, you get a race.

The fix isn’t to eliminate events or callbacks. It’s to add state that disambiguates different event triggers. The s_stopping flag tells the handler “this disconnect is intentional, don’t try to reconnect.”

Other Approaches I Considered

Option 1: Unregister the event handler before stopping

This would work but adds complexity. You’d need to save the handler instances and unregister/re-register around every disconnect. Plus you’d lose all events during shutdown, which might hide real problems.

Option 2: Check WiFi state before retrying

You could query the WiFi subsystem state before calling esp_wifi_connect(). But that’s racing too. The state might change between your check and your connect call.

Option 3: Use a mutex or semaphore

Overkill for this case. The flag is simpler and more explicit about intent.

The flag approach is the cleanest. It’s two lines of code and makes the intent obvious to anyone reading the code.

Testing the Fix

After adding the flag, I tested the wake-sleep-wake cycle repeatedly. Every time, WiFi would connect on the first try after waking from deep sleep. The timeouts disappeared. The hanging connections disappeared.

The logs looked clean:

I (1234) wifi: Connecting to MySSID...
I (2456) wifi: Connected, IP: 192.168.1.42
...
I (5678) wifi: WiFi stopped intentionally
I (5678) wifi: WiFi disconnected
...
[deep sleep]
...
I (1234) wifi: Connecting to MySSID...
I (2456) wifi: Connected, IP: 192.168.1.42

No retries on intentional disconnects. Clean shutdown. Clean reconnect.

The Lesson

When you call an API function that triggers events your own handlers listen for, think about whether your handler should react to that event differently than to events from external triggers.

Disconnects from the AP failing are different from disconnects you initiated. Connection lost during usage is different from connection stopped before sleep. The event is the same, but the response should be different.

A simple boolean flag can disambiguate these cases. It’s not elegant, but it’s explicit and debuggable.

And explicit beats clever when you’re dealing with timing-sensitive event handlers.

References