STM32F4 – the first taste of speed

By | October 26, 2011

The recently announced STM32F4 series of processors using the ARM Cortex M4 are very attractive. High speeds, large memory space and a floating point unit are among the obvious benefits although there are many other architectural changes in the ST chips. This evening, I managed to get my STM32F4-Discovery board doing something…

Not much, to be fair, but at least it works and I can write programs for it. The STM32F4 discovery has an ST-Link V2 programming interface on it so all you need to get code onto it is a USB lead and some messing about on your computer. I started off with a Windows laptop since that had definitely available drivers. Not that they were that easy to find.

On the ST site, a bit of careful searching got me the Windows USB driver for the ST-Link V2 and a utility program to let me see if it was connected. If you start here

you should find links to the utility program and the USB drivers. Install these before you do anything else. That done, you can connect the discovery board and run the utility. Once connected, it will allow you to run and halt the processor, examine memory and upload/download code. I had none to use so it was time to try creating some.

My weapon of choice for STM32 processors is Rowley’s Crossworks package. It has some great strengths although, like many IDEs these days, it is not always easy to see how to do stuff. Full support for the ST-Link V2 is in the next release, expected in a few days. I found, through their support forum, and installed their most recent build. At first, it would not connect to the board but, this being Windows, I just had to unplug the USB connector and plug it back in to correct that problem. Possibly the ST utility had not released some resource or other.

While speed is often good, a look at the datasheet for the STM32F4 soon reveals a penalty for getting stuff done quickly. The current consumption of the chip rises in proportion to the clock speed. At the rated 168MHz, it would be about 90mA. That may be a bit high for my little micromouse so I shall probably end up running it at the more usual 64MHz where the current is down to about 30mA – much the same as the STM32F10x devices. That said, the dsPIC that I currently use will burn anything up to 180mA at 20MIPS so I shall still be well ahead even at full speed.

So, if clock speed isn’t going to be necessarily much better, what about the floating point unit? I wrote the simplest program I could to do a bit of testing of the comparative performance of the FPU and how much of an advantage it might give compared to using integers or software floating point on the STM32F103.

Essentially the same code fragment was used with an STM32F407VG and a STM32F103RE chip. The only difference was in the header file included to specify the device. Under Crossworks, it was necessary to set the GCC target to EABI and, for the STM32F4 chip, enable the hardware FPU. To keep things simple, the compiler was also told to treat doubles as floats to restrict everything to the 32 bit float format. Without that, many of the built-in functions assume a double as the argument(s) and return value. Both processors were running at 8MHz and no configuration of the clocks was done. The compiler had all optimisation turned off. The idea was only to compare the number of processor cycles needed for a few simple operations.

Here is the code run on the STM32F103RE.

#include "stm32f10x.h"
#include "float.h"
#include "math.h"

int main(void){
  float fX,fY,fZ;
  long lX, lY, lZ;
  lX = 123L;         //    1 cycle
  lY = 456L;         //    1 cycle
  lZ = lX*lY;         //    6 cycles
  fX = 123.456;   //    3 cycles
  fY = 9.99;         //    3 cycles
  fZ = fX * fY;      //   41 cycles
  fZ = sqrt(fY);    //  624 cycles
  fZ = sin(1.23); //  1017 cycles

Next to each line is the number of clock cycles taken.

Here is the same code run on the STM32F407VG:

#include "stm32f4xx.h"
#include "float.h"
#include "math.h"

int main(void){
  float fX,fY,fZ;
  long lX, lY, lZ;
  lX = 123L;         //    2 cycle
  lY = 456L;         //    2 cycle
  lZ = lX*lY;         //    5 cycles
  fX = 123.456;   //    3 cycles
  fY = 9.99;         //    3 cycles
  fZ = fX * fY;      //    6 cycles
  fZ = sqrt(fY);    //   20 cycles
  fZ = sin(1.23); //  124 cycles

I have no idea why the integer assignments should take 2 cycles in the second case. Perhaps I will look into that later. Meantime, a floating point multiply is 6 times faster using the FPU and about the same speed as a fixed point multiply. Not bad. Given that a typical fixed point multiply will need various adjustments to get the fractional parts back where they belong, it looks as if it will be very easy to do all the mouse control code in floating point using the new processor. A 20 cycle sqrt() is not too shabby either. Even complex operations like sin and log may be feasible although there is so much memory that, for many problems it may still be worth considering the use of look-up tables for some problems.

That was all very encouraging. Good hardware and software compatibility should mean that it is very easy to upgrade to the STM32F4 device. Although I have yet to investigate, I am told that peripheral pin remapping is also much more flexible so that is another advantage.

13 thoughts on “STM32F4 – the first taste of speed

  1. G Bulmer

    Pete, single precision ‘float’ operations exist in the maths library, which might avoid the need to use “weird” compiler flags.

    For example, single precision sin() is sinf(), and sqrt() is sqrtf().

  2. Peter Harrison Post author

    Indeed. But I am lazy and also keep forgetting to write 2.34f to specify a float. I shall try to be more tidy.

  3. Peter Harrison Post author

    Lest you think this only works on Windows, I just ran all the stuff above on my Mac. Worked just fine. On the mac, no drivers are required for the ST-LINK/V2. In Crossworks, you will have to make sure that the target interface type is set to SWD.

    To use the FPU, the compiler needs to be told to use the hardware. This is a project setting in Crossworks (arm_fp_abi) which presumably adds a suitable switch to the compiler command line. Since Crossworks uses the GCC tools, this should work in a similar way for other toolsets.

  4. G Bulmer

    Pete, you wrote: “Meantime, a floating point multiply is 6 times faster using the FPU …”

    Assuming the “number of cycles” annotation is correct, then a floating point multiply (which including loads and store) is almost 7x faster on the FPU than in software:
    fZ = fX * fY; // 41 cycles - STM32F103RE
    fZ = fX * fY; // 6 cycles - STM32F407VG0
    I assume both of those have a chunk of load and store, so if the values were in registers, the difference would be much greater.

    also “… and only 4 times longer than a fixed point multiply”, but the integer multiply:
    lZ = lX*lY; // 6 cycles - STM32F103RE
    is exactly the same number of cycles as the floating point multiply, and that integer multiply doesn’t have the ‘adjustment’ that a fixed point multiply would need (e.g. the result from a fixed point 8.24 x 8.24 needs to be adjusted). So, assuming the cycle counts are correct, a single precision FPU multiply is faster (in cycles) than a fixed-point multiply.

  5. Peter Harrison Post author

    How right you are. The code samples are a poor demonstration of relative speed and will be heavily dependant upon optimisations, register re-use, pipelining and all the other tricks.

    I must have been looking at the sqrt() function times when I wrote the bit about the fixed point s floating point. I have edited the original text now.
    Thanks for pointing that out

  6. Peter Harrison Post author

    A quick test toggling pins indicates that a sequence of sqrt() function calls with volatile local variables can execute in about 24 cycles each on the processor. That is under 350ns at 72MHz or 3 million square roots per second. It takes about 20% longer with global variables. That is enough pointless timing I think. It is fast.

  7. Dave


    Has anyone gotten the ST-Link on the F4 Discovery board to work properly in uVision (v4.23)? In my case, it will not show the contents of floating point registers. I disabled the ST-Link and patched in a uLINK2 and the FPU registers display properly. Note – these are with the latest ST drivers (as of 2012/02/09). Thank you for any information you may provide!


  8. Azriel

    Is there any good sites for tutorials or better in site how how to develope on the stm32 platform. I just ordered an educational board as well as the ST-link V2 and I would like to do some research and learning before it arrives. I too will also be developing on a Mac

  9. Florian Augustin

    I’m also coding with Crossworks IDE and the STM32F4Discovery.
    I have a few problems getting the FPU unit running so it would be a great convenience for me if you could send me this small project as a email or something!

    Thank you very much if it’s possible!

  10. Peter Harrison Post author

    I will see what I can do but I am afraid it will probably not be soon as I have quite a lot on at the moment.

    The key part is configuring Crossworks to use the correct floating point setup I think. Check out the other comments.

  11. Florian Augustin

    Would be great, even for the integer operations i need more cycle time. I’m also running with 8 MHz and tried everything with flash, sram etc… Thank you!

  12. Absurdev

    I’m a bit curious, how did you manager to get the number of cycles for each line ? Is it possible with the ST LINK v2 debugger ?

  13. Peter Harrison Post author

    I don’t think it is a function of the STLINK V2 as such. When I run the debugger in Crossworks, it gives me a cycle count. I don’t know how t gets that.

Leave a Reply