movw
|
32bit
|
1 cycle
|
movt
|
32bit
|
1 cycle
|
mov.w
|
32bit
|
1 cycle
|
mul.w
|
32bit
|
1 cycle
|
udiv
|
32bit
|
2 cycles
|
ldrh
|
16bit
|
2 cycles
|
ldr
|
16bit
|
2 cycles
|
str
|
16bit
|
2 cycles
|
b
|
16bit
|
2 cycles
|
cmp
|
16bit
|
1 cycles
|
Test method:
we toggle
one GPIO, and get GPIO low period as default time, and add 50 ARM instruction
in middle of the GPIO low period,
measure the GPIO low time and minus GPIO low default time, use the
result divide (1/20Mhz=50ns), and then we have 50 ARM instruction cycles.
AHB=20 MHz, Project release build configuration.
1) Code running in Flash, Default GPIO low period is 368ns.
a) Instruction only
i)
32bit
width, 2 cycles
(1)
50
“udiv” instructions
__asm("udiv
r3, r2, r3");
Total cycles is (5448-368)
/ 50 = 101, (100 cycles in theory)
ii)
32bit
width, 1 cycle
(1)
50
“movw” instructions
__asm("movw
r2, #41440 ");
Total cycles is (2888-368)
/ 50 = 50, (50 cycles in theory)
iii)
16bit
width, 2 cycles
iv)
16bit
width, 1 cycle
(1)
50
“cmp” instructions
__asm("cmp
r2, r3 ");
Total cycles is (2868-368)
/ 50 = 50, (50 cycles in theory)
b) Instruction with data access (0x20003000).
i)
32bit
width, 2 cycles
ii)
32bit
width, 1 cycle
iii)
16bit
width, 2 cycles
Generally, load-store instructions take two
cycles for the first access and one cycle for each additional access. Stores
with immediate offsets take one cycle.
(1)
50
“str” instructions
__asm("str
r2, [r7, #8] ");
Total cycles is (2961-368)
/ 50 = 51, (51 cycles in theory)
(2)
50
“ldrh” instructions
__asm("ldrh
r3, [r7, #0] ");
Total cycles is (2966-368)
/ 50 = 51, (51 cycles in theory)
iv)
16bit
width, 1 cycle
2) Code running in SRAM, Default GPIO low period is 668ns.
a) Instruction only
i)
32bit
width, 2 cycles
(1)
50
“udiv” instructions
__asm("udiv
r3, r2, r3");
Total cycles is (5708-668)
/ 50 = 100, (100 cycles in theory)
ii)
32bit
width, 1 cycle
(1)
50
“movw” instructions
__asm("movw
r2, #41440 ");
Total cycles is (5708-668)
/ 50 = 100, (50 cycles in theory)
iii)
16bit
width, 2 cycles
iv)
16bit
width, 1 cycle
(1)
50
“cmp” instructions
__asm("cmp
r2, r3 ");
Total cycles is (3185-668)
/ 50 = 50, (50 cycles in theory)
b) Instruction with data access. (0x20003000).
i)
32bit
width, 2 cycles
ii)
32bit
width, 1 cycle
iii)
16bit
width, 2 cycles
Generally, load-store instructions take two
cycles for the first access and one cycle for each additional access. Stores
with immediate offsets take one cycle.
(1)
50
“str” instructions
__asm("str
r2, [r7, #8] ");
Total cycles is (7304-668)
/ 50 = 132, (?? cycles in theory)
(2)
50
“ldrh” instructions
__asm("ldrh
r3, [r7, #0] ");
Total cycles is (7005-668)
/ 50 = 126, (?? cycles in theory)
iv)
16bit
width, 1 cycle
The result of SRAM is not same as we expected, let’s analysis it.We
found below words from Arm Cortex M3 technical reference manual.
14.5.6. Pipelined instruction fetches
To provide a clean timing interface on the
System bus, instruction and vector fetch requests to this bus are registered.
This results in an additional cycle of latency because instructions fetched
from the System bus take two cycles. This also means that back-to-back
instruction fetches from the System bus are not possible.
Note:
Instruction fetch requests to the ICode bus are
not registered. Performance critical code must run from the ICode interface.
From above, we know that
access SRAM from system bus, the instruction fetch need two cycles. But why
some of them are match with our expectation? Let’s go through them one by one.
1. “udiv”, 32bit
width, 2 cycles instruction.
We know ARM cortex M3 has 3-stage
pipeline, fetch->decode->execute.
The maximum cycles in one stage decide whole instruction cycles. “udiv”
execute stage takes two cycles, so fetch stage increase to two cycles will not
impact the final result, so the result we got is same as we calculate.
2. “movw”, 32bit
width, 1 cycle instruction.
Refer from above explanation, the
fetch stage become 2 cycles, so final result will be come two times then we calculate.
I have reconstructed the pipeline model, it seems reasonable.
3. “cmp”, 16bit width, 1 cycle instruction.
All Thumb instructions are halfword aligned in
memory, so two Thumb instructions are fetched at a time. For sequential code,
an instruction fetch is performed every second cycle.
Here are the words in spec "The PFU fetches instructions from the memory system that can supply one word each cycle. The PFU buffers up to three word fetches in its FIFO, which means that it can buffer up to three Thumb-2 instructions or six Thumb instructions. "
So each time it takes 2 cycles to read two Thumb instructions from SRAM into PFU. And processor fetches instruction from PFU, it keeps one clock reading one Thumb instruction, so the result makes sense.
So each time it takes 2 cycles to read two Thumb instructions from SRAM into PFU. And processor fetches instruction from PFU, it keeps one clock reading one Thumb instruction, so the result makes sense.
4. “str” and “ldrh”,
16bit width, 2 cycles instruction with data access.
Although we know when it is not Harvard
architecture, access memory cannot simulate execute code, but I fail to calculate what is meaning about number we got.
AHB=80 MHz, Project release build configuration.
1) Code running in Flash, Default GPIO low period is 120ns.
a) Instruction only
i)
32bit
width, 2 cycles
(1)
50
“udiv” instructions
__asm("udiv
r3, r2, r3");
Total cycles is (1370-120)
/ 12.5 = 100, (100 cycles in theory)
ii)
32bit
width, 1 cycle
(1)
50
“movw” instructions
__asm("movw
r2, #41440 ");
Total cycles is (1370-120)
/ 12.5 = 100, (50 cycles in theory)
iii)
16bit
width, 2 cycles
iv)
16bit
width, 1 cycle
(1)
50
“cmp” instructions
__asm("cmp
r2, r3 ");
Total cycles is (761-120)
/ 12.5 = 51, (50 cycles in theory)
b) Instruction with data access (0x20003000).
i)
32bit
width, 2 cycles
ii)
32bit
width, 1 cycle
iii)
16bit
width, 2 cycles
Generally, load-store instructions take two
cycles for the first access and one cycle for each additional access. Stores
with immediate offsets take one cycle.
(1)
50
“str” instructions
__asm("str
r2, [r7, #8] ");
Total cycles is (757-120)
/ 12.5 = 51, (51 cycles in theory)
(2)
50
“ldrh” instructions
__asm("ldrh
r3, [r7, #0] ");
Total cycles is (781-120)
/ 12.5 = 53, (51 cycles in theory)
iv)
16bit
width, 1 cycle
AHB=60 MHz, Project release build configuration.
1) Code running in Flash, Default GPIO low period is 130ns.
a) Instruction only
i)
32bit
width, 2 cycles
(1)
50
“udiv” instructions
__asm("udiv
r3, r2, r3");
Total cycles is
(1803-130) / 16.7 = 100, (100 cycles in theory)
ii)
32bit
width, 1 cycle
(1)
50
“movw” instructions
__asm("movw
r2, #41440 ");
Total cycles is (1378-130)
/ 16.7 = 75, (50 cycles in theory)
AHB=50 MHz, Project release build configuration.
1) Code running in Flash, Default GPIO low period is 160ns.
a) Instruction only
i)
32bit
width, 2 cycles
(1)
50
“udiv” instructions
__asm("udiv
r3, r2, r3");
Total cycles is (2153-160)
/ 20 = 100, (100 cycles in theory)
ii)
32bit
width, 1 cycle
(1)
50
“movw” instructions
__asm("movw
r2, #41440 ");
Total cycles is (1640-160)
/ 20 = 74, (50 cycles in theory)
(2)
50
“mul.w” instructions
__asm("mul.w
r3, r2, r3");
Total cycles is
(1648-160) / 20 = 74, (50 cycles in theory)
iii)
16bit
width, 1 cycle
(1)
50
“cmp” instructions
__asm("cmp
r2, r3 ");
Total cycles is (1144-160)
/ 20 = 49, (50 cycles in theory)
I have no idea about 32 bit widths 1 cycle
instruction result…
No comments:
Post a Comment