Solving an embedded mystery in ten easy steps

2017-08-31

Sometimes there are bugs that cause mysterious effects that seems to defy all logic, but since we're dealing with computers we know that there's always a logical explanation for everything that happens. If you're running on a desktop computer with a few gigs of code running at once, finding the explanation is sometimes too time consuming, but on embedded devices you can usually figure out what's going on - one step at a time.

At work we're working on an embedded project based on a Cortex M3 ARM microcontroller (NXP's LPC1763) running FreeRTOS. In this project we don't have any external memory, and the internal 64kB has been a problematic constraint for a while. I've had stack overflows cause weird behaviour, memory corruption and leaks. Par for the course when developing C code.

Recently I updated to a newer version of our BSP to be able to run the code on my Mac using the POSIX implementation of the BSP. After getting that up and running it was time to update things on the ARM side, and this is where I ran into ”interesting behaviour”. "Interesting" is rarely a good thing when working on embedded projects...

I was getting stuck in task switches that didn't finish, hard faults in places that made no sense, ending up in bsp_epic_fail() or just spinning around in a single task without switching to the others.

This was likely due to either uninitialised memory, or stack overflows. There's a helpful comment in the FreeRTOS code that says ”if you get stuck here, you probably ran out of stack somewhere”. Time to get investigating!

Step 1: Things are zero

Looking around a bit in the debugger while stopped in one of the impossible states, I noticed that things were zero that should have been non-zero after init.

Step 2: They’re set properly on boot

Double-checked that init is in fact getting called, and that values are set as expected during startup.

Step 3: BSP init code is clearing RAM twice

That means something is overwriting the values, but who? A watchpoint set on one of the variables in question showed that it was the BSP init code that cleared the values. Now, it is actually supposed to do this - but only on boot. There's a portion of RAM called BSS where the linker places variables that are supposed to start life as zeroes.

Step 4: Nothing is calling the init code; broken callback or function pointer?

Now we know what code is causing havoc, but how did we end up there? The CPU doesn't jump to code on its own initiative, so something has to tell it to go there. What could that be? Nothing in the code referenced the init function, so it had to be either a random value that happened to point to this code, or a NULL pointer that somehow ended up there.

Step 5: Serial processing code is calling init code

About here is when my colleague got involved, and we proceeded to determine what called the init code. I suspected that some interrupt handler might be uninitialised, since the errors seemed to occur after a few ms, but not acting exactly the same every time.

We divided and conquered the problem by setting breakpoints at suspected locations and stepping until we crashed, narrowing it down a bit at a time. Soon we knew that the serial data task was the cause, due to a pointer being NULL that couldn't possibly be NULL.

Step 6: Because a pointer that cannot be NULL actually is

Here's the offending code, notice the missing NULL check after allocating the channel.

channel_t *channel = channel_new(...);

while ( true )
{
    if (bsp_uart_is_data_pending(UPSTREAM_UART))
    {
        uint8_t b;

        while (bsp_uart_read(DSS_UART, &b, 1) == 1)
        {
            serial_channel_add_data(channel, b, bsp_timer_start());
        }
    }
}

The reason for the missing NULL check is that we know that channel_new() can't fail at this point; we've allocated space for as many channels as we'll ever use. But still we get a NULL.

Step 7: It is NULL because something else that should be zero isn't

Looking into the channel_new() code we see that there is indeed no way we could end up with a NULL, because we know that .in_use is false on boot (all uninitialised global variables are set to zero in C).

static channel_t g_channels[NUM_CHANNELS];

channel_t *channel_new(...)
{
    channel_t *channel = NULL;

    for (int i=0; i<NUM_CHANNELS; i++)
    {
        channel_t *ch = &g_channels[i];
        if (ch->in_use)
            continue;

        channel = ch;
    }

    if (!channel)
        return NULL;

    ...
}

But a breakpoint in channel_new() showed that when the function is called, g_channels isn’t zeroed out - it’s filled with random data. But that’s not possible, the BSP init code zeroes out everything in BSS - it’s because it's good at doing that we’re here looking for a bug!

Step 8: It isn't zero because it isn't covered by BSS init code

Looks like we need to have a look at what the RAM init code is actually doing:

extern char chipZeroStart;         // RAM, start of zeroed statics
extern char chipZeroEnd;           // end of same, +1

// Clear uninitialized statics (.bss section)
for (char *target=&chipZeroStart; target<&chipZeroEnd;) {
    *target++ = 0;
}

Besides being a bit inefficient due to writing a byte at a time instead of writing words, it looks like it’s doing what it’s supposed to. But how does it know where the BSS section starts and ends? The chipZeroStart/End variables are provided by the linker, after it has placed the relevant variables. The linker script looks like this:

SECTIONS {
    ...

    .bss : {
      chipZeroStart = .;
      *(.bss .bss.*)                 /* zero-filled static data (non-const) */
      . = ALIGN(4);
      chipZeroEnd = .;
      }
    }

It sets the chipZeroStart variable to the address before the BSS segment, adds all the sections named .bss*, aligns to word size and then puts the current address into chipZeroEnd . We know that our variables should go into .bss.something sections, because we’re using the -fdata-sections parameter when running gcc.

Side note: linker scripts are insanely easy to get wrong, and the most fragile piece of code in any project.

Step 9: It isn't covered because it isn't in a section starting with ".bss"

Even though we know that things set up properly, we’ll just check the linker map file to make sure that – hey, g_channels is in section COMMON. And COMMON is placed after chipZeroEnd ! This means that g_channels isn’t being cleared by the init code. Which means that channel_new() is returning NULL, which causes the serial task to call the init code.

Step 10: Tweaked linker script, everything ok

We added COMMON to the sections added to BSS, which made sure that g_channels gets cleared and that channel_new() succeeds. There’s a flag you can give to gcc, -fno-common, that will tell it to not place things in COMMON, but we still needed the link script fix because a couple of variables from our stdlib were also in COMMON.

Tying up some loose ends

But we’re forgetting something - how did we end up in the init code at all? There were no references to it from the serial task code, so it shouldn’t be possible. To understand this, we need to know a little something about memory on embedded systems.

On ARM microcontrollers there's usually nothing special about NULL, its just a pointer that points to address zero, and that’s usually the beginning of Flash, where your code is. On the Cortex M series an interrupt vector is placed at address zero, which contains a bunch of function pointers:

Addr  Value
   0  initial stack pointer
   4  init function
   8  ...

When running things on a larger machine with an MMU you’re probably going to have a special ”not touching allowed” page mapped to NULL, that will cause a segfault if you try to even read from it. Not so on small embedded devices. Here the read will succeed and your code will continue running. In our case, this is how we managed to get the address to the init code. Notice that the second value above is a pointer to the init function. Now look at our serial handling code:

struct channel_t {
    uint32_t something;
    void (*add_data)(void *, uint8_t, uint32_t);
    ...
}

serial_receive_result_t serial_channel_add_data(channel_t *ch, uint8_t byte, uint32_t timestamp)
{
    return ch->add_data(ch, byte, timestamp);
}

The serial task calls serial_channel_add_data() with ch set to NULL. Then serial_channel_add_data() calls the function pointer in channel, which is offset from the base address of the channel by 4 bytes. So if channel is NULL, it treats NULL+4 as a function pointer - and this is where the pointer to our init code is.

Mystery solved; as usual it’s all due to a series of minor details being wrong.