Today debug why armv5te does not work. And finally found that Multi compile the following code, It seems multi don't know #if 0.
; mov v1, #(1<<(COL_SHIFT-1))
; smlabt v2, ip, a4, v1 ;/* v2 = W4*col + (1<<(COL_SHIF1)) */
; smlabb v1, ip, a4, v1 ;/* v1 = W4*col + (1<<(COL_SHIF1)) */
; ldr a4, [a1, #(16*4)]
It works after comment the code, and I have tested the performance between simple_idct_arm.s, the put and add function speed increase 90%. However, I put armv5te code into whole project, the performance only improve 7%. Why this happens? The test code run in OCRAM, the project code run in SDRAM, there should have many memory access stall. So I mighe need consider to move decode_slice MB into OCRAM.