Logo
Get direct access via EPNdirect to Europe’s most comprehensive database of electronic products & suppliers
Search    Advanced Search Criteria

TOP PRODUCTS

Print | PDF | Digg This | Slashdot It! | Add to Del.icio.us |
Product group : Digital ICs
Product Sub-group : Digital Signal Processors
Code Placement of GSM/GPRS/EDGE Digital Baseband Processors
In recent years, caches...
In many embedded systems, but especially on mobile handsets, the most important resources other than processor execution time are the buses and external memory interfaces of a device. The ability to profile an application on the basis of its bandwidth consumption when using instruction caches is highly desirable. The method used to do this is complimentary to a conventional cycle-based profile, and provides a means of optimising an application in a device with limited external memory interfaces. We find that, in a device with a constrained external interface, optimising the program-fetch bandwidth of an application that uses caches has the beneficial side-effect of improving the device's execution running time. As an example, this article profiles an implementation of the H.264 video decoder running on a Blackfin DSP, which is integrated on the AD6900, an advanced low-power mobile-handset SoC.
01/09/2006
Reference: 11000

In recent years, caches have become commonplace in DSPs and in embedded systems. Before caches were available, the development of embedded software required detailed management of code in all the available physical memory resources such as on-chip SRAM, off-chip SRAM, SDRAM, and flash. This meant that a software designer had to locate all time-critical code modules according to heuristics based on run-time profiling. For instance, when a particular time-critical module is required, such as a data equaliser in a communications system, its code must be moved from slow off-chip storage to fast on-chip SRAM prior to execution. This pre-loading stage ensures that the module executes fast, and within predefined execution-time bounds.However, this process is time-consuming, is a source of run-time system bugs, and can be exceedingly complex except for relatively small systems. The multiplicity of wireless standards places a large burden on what already constitutes a highly complex hardware and software system, making explicit code placement impossible. At the other end of the spectrum, if we were to design a system with an instruction cache without being attentitive to the placement of the program, we would likely pay a relatively large overhead in the time required for the cache to fetch program code. In addition, this would mean that bandwidth needed for the memory to store the program would be increased. This would be the case if, for example, we placed the entire program in external flash.

AD6900 architectural description

For context it is useful to briefly describe the AD6900 DBB, shown in Figure 1. At the top-right is the Blackfin sub-system, which consists of the Blackfin core, L1 code and data memories (configurable as caches or SRAM), L2 memory, the Blackfin DMA controller referred to as the DSPDMA, and a set of DSP peripherals generally involved in acquiring and processing GSM data. The Blackfin subsystem is connected with the System Bus Interface Unit (SBIU), which is a multi-port crossbar proving parallel connections between the DSP and L1 memories, and the rest of the system. At the lower right is the ARM926EJ-S sub-system. The third level of on-chip memory in the system is called System RAM and is accessible by both the Blackfin and the ARM. External memory is accessed through an external bus controller, an SDRAM controller, and a NAND flash controller.

H.264 video codec

The H.264 video decoder, also referred to as MPEG 4 Part 10/AVC, is driving many of the new applications on the wireless handset. The H.264 decoder is used here as an example of a method of placing code in the AD6900 system because it has a high code-fetch-bandwidth requirement. Like many other DSPs, the Blackfin VDSP tools allow the capture of this type of profile information. Using this data, a code developer may choose to focus on and optimise the functions that consume the largest proportion of DSP cycles. In fact, this implementation, these two functions, as well as several other highly cycle-intensive functions, have been optimised in assembly language. Overall, approximately 10% of the code base has been optimised in assembly, while 90% remains in C.An important aspect of this profile is that it also contains information about the number of instruction-cache lines that are required on a per-function basis. For example, we see that the function _decode_residual, in addition to being the most cycle-consuming function, also causes the instruction cache to fetch a relatively large number of cache lines. In this case, this function alone generates 10.5% of the total number of cache lines fills required by the H.264 video decoder. Similarly, other functions, which may not be the most cycle-intensive in terms of DSP load, also require a relatively large number of instruction-cache fills. Table 1 highlights the three functions that require the largest proportion of line fills, collectively adding up to a total of almost 40% of the total. In general, we find that DSP cycle consumption and cache line fills are not correlated: many functions with a small DSP load may require a relatively large number of cache fills, while other functions with a high DSP load may require very few line fills. This information is useful because when the complete application code is placed on lower-level memory, such as off-chip flash memory, code fetch requires considerable bandwididth consisting of instruction-cache line fills. This code bandwidth contends for bus resources with the video and temporary data and has the tendency to lower the efficiency of the decoder. Approximately 50% of the off-chip memory bandwidth required by the complete decoder is program fetch, while the remaining 50% is video data, table look-ups, and state information.Placing three functions in a memory that is closer to the DSP, such as L1 program memory, it is possible to reduce the off-chip code fetch bandwidth by approximately 40%. More generally, Table 1 shows the profile sorted by cache-line fills in descending order. We see that some functions that do not constitute a high DSP load may still require a relatively large number of cache-line fills.Using the profile presented in Table 1, the functions that consume the largest proportion of cache-line fills are selected and placed in L1 DSP memory. The number of functions depends on the requirements of other parts of the overall system and the amount of available L1 memory. In this particular example, 16kBytes of L1 DSP memory is available, and by using it to place the top functions,one can improve the program-fetch bandwidth by 54%. As a consequence of this reduction in bandwidth, one improves the overall cycle time of the H.264 decoder by 10%. The program size of the H.264 decoder is approximately 125kBytes, which means that by re-linking and moving 13% of the code to on-chip memory, the cycle performance has also been improved.


Analog Devices Ltd.

Unit 3, Horizon Business Park 1, Brooklands Rd
KT13 0TJ Weybridge - United Kingdom -Surrey
tel: +44 01 932 358530

RELATED ARTICLES FROM Analog Devices Ltd. All their related products...
Search in the archives
Advanced Search Criteria
Magazine_mai_2012_small
Loupe
issue
May 2012