While DES is often criticized for its relatively short 56-bit key, a simple algorithm variant called triple-DES (3DES) extends the key to 168 bits by iterating the DES algorithm three times with different keys. Even though DES has been around since 1977, there is still no known attack on 3DES that is much faster than exhaustive search. Consequently, the 3DES encryption algorithm is used for both the Secure Shell tools (SSH) and the Internet Protocol for Security (IPSEC). Both of these security applications require high-speed encryption and decryption and both are widely implemented in embedded systems such as IP routers and set-top boxes. General-purpose microprocessors are too slow to encrypt or decrypt high-speed digital-media and Internet-packet streams. However, tailoring a configurable microprocessor's ISA by adding only four instructions can improve DES algorithm performance by 40x to 80x. The new instructions add about 5,000 gates to the processor core and do not reduce the processor's maximum clock rate. The peak throughput of the improved DES algorithm on a 150-MHz Xtensa processor is 378Mbit/s.The DES algorithm, described in Federal Information Processing Standard Publication46, dated January 15, 1977, involves many operations, which can be simplified in the diagram of Figure 1. The function IP() performs an initial fixed permutation on the 64-bit input data (plaintext), returning two 32-bit words called R and L. The eight byte-level parity bits in the original key are discarded, producing a 56 bit key that's split into two 28-bit subkeys C and D by function PC1(). Function PC2() computes 48-bit subkey Ki from C and D by compressing them after they are modified by bit-wise rotation. The encryption iteration repeats 16 times. Iterations 0, 1, 8, and 15 rotate the subkeys by 1 bit and the remaining iterations rotate by 2 bits.Function E() takes 32-bit value R and returns a 48-bit value by duplicating 16 bits. Complex function F() takes an exclusive-OR R of E(R) with the iteration's subkey and returns a 32-bit value that is permuted and then exclusive-ORed with 32-bit value L. This computation defines the R value for the next iteration. Function F() partitions the 48-bit subkey input into eight 6-bit values that are used as indexes into 8 different 64x4-bit lookup tables to assemble a 32-bit quantity. The FP() function performs a final fixed permutation on the R and L values and returns the 64-bit output data (ciphertext).Firmware DES implementations run slowly on general-purpose RISC processors due to the algorithm's numerous bit permutations. However, these permutations are easy to implement in hardware and they execute quickly because each permutation corresponds to simple wire scrambling (no gates required). Likewise, the DES algorithm's 28-bit rotations are inefficient on general-purpose, 32-bit RISC microprocessors, which lack 28-bit rotation instructions. However, these rotations are easy to perform in custom hardware. Similarly, the bit-packing, unpacking, and table-lookups required for F() are slow in firmware but are easy to implement in hardware and execute quickly. Tailoring a microprocessor's ISA to improve algorithm performance essentially adds DES -specific hardware to the processor, which gives the processor the execution speed of DES hardware while retaining the processor's firmware programmability. Three new processor registers and four new instructions are needed to accelerate the DES code. The three new registers are:C at 28 bits: Stores the C value for the DES instruction.D at 28 bits: Stores the D value for the DES instruction.DATA at 64 bits: Stores the data word for encryption or decryption.The assembly language descriptions of the custom DES instructions are:STKEY, as, at: Move 64 bits from (as, at) to the C and D registersSTDATA, as, at: Move 64 bits from (as, at) to the DATA registerLDDATA, ar, imm1: Move 32 bits of DATA to arimm1=0 to return the low 32-bits of DATAimm1=1 to return the high 32-bits of DATADES, imm2: Perform one iteration of the DES inner-loopimm2 = 0 for encryption step with rotate by 1imm2 = 1 for encryption step with rotate by 2imm2 = 2 for decryption step with rotate by 1imm2 = 3 for decryption step with rotate by 2When comparing the cycle counts for each DES block size, it comes out that adding custom DES registers and instructions reduces the code-and-data memory footprint from 5560 bytes to only 156 bytes. Note that there is an interesting trade-off here. ISA tailoring adds 5,000 gates to the processor but saves 45,216 bits of RAM. Hence, the net silicon cost for the processor with the DES extensions is somewhat less than 5,000 gates yet the resulting DES algorithm runs between 40 and 80 times faster than the tuned implementation running on the un-augmented processor.The custom DES instructions and registers are added to the processor core using the HDL-like TIE (Tensilica Instruction Extension) language. Instruction and register extensions described in TIE are fed to the automated Xtensa Processor Generator which then produces both the RTL description for the extended processor and a customized tool suite including a compiler, assembler, linker, debugger, and instruction-set simulator. Processor generation takes about an hour. The custom-processor RTL can be used in an SOC design just like any processor core.
New crypto on the block: AES
In 1997, the US National Institute of Standards and Technology (NIST) held a public contest to design the Advanced Encryption Standard (AES) cipher, a successor to the aging DES cipher. NIST chose the Rijndael block cipher for AES from amongst several candidates. The Rjindael cipher handles variable block lengths and key lengths of 128, 192, or 256 bits (The default block length is 128 bits). The AES cipher encrypts plaintext to ciphertext on a block-by-block basis. Figure 2 shows a 128-bit example.The plaintext block is conceptually stored and operated upon in a 16-byte state array. Plaintext bytes are stored in the upper left square of the 4x4-byte array, moving downward in the left column and then proceeding to the next columns until the lower right square is filled. The AES cipher performs transformations on rows or columns of the state array. Like the DES cipher, the AES cipher is based on a secret key. The same encryption key is used to decrypt ciphertext. Prior to encryption, the 128-bit AES key is expanded to 10 additional 128-bit keys. The original key along with the 10 expanded keys is called the key schedule.AES encryption consists of four transformations:SubBytes -performs byte-wise substitutions for each byte in the state array using a fixed table.ShiftRows -rearranges the bytes within the state array.MixColumns -treats each column of the state array as a polynomial and performs a Galois field multiplication with a fixed 4x4 matrix.AddRoundKey -performs byte-wise Galois field additions using bytes from the key schedule.Porting the AES cipher to a general-purpose RISC processor is simple but the performance of such an implementation is usually unacceptable. Consequently, SOC designers will generally rule out firmware-only implementations except when bandwidth requirements are low. Hardware AES encryption engines can be designed as stand-alone hardware blocks or as coprocessors using combinational logic and a state machine to schedule the e transformations. The benefits of this approach are high performance and power efficiency. However, hardware-based acceleration complicates an SOC's design and adds to the verification burden. Further, hardware-based acceleration has no flexibility once the chip has been fabricated. The chip must be respun if any changes are needed. ISA tailoring, as described above in the DES example, can transform a configurable processor into a fast, efficient AES engine by adding AES-specific registers and instructions to the processor's ISA. A 32-bit processor has a 32-bit general-purpose register file but the AES algorithm performs transformations on a 128-bit state array. Consequently, a 128-bit register file better fits AES transformations by allowing instructions to operate on all of a state array's 128 bits concurrently. Combined with wide, 128-bit buses, the wide register file also allows the state array to be transferred to and from memory with one memory transaction, rather than four. Xtensa processor can be configured with such wide buses and the TIE language allows the processors to be extended with custom register files as wide as 1024 bits. The AES engine exploits this capability by adding a 128-bit wide register file named AES.It's not necessary to write assembly language to use the added AES register file. A C compiler that supports the AES register file as a custom data type is automatically generated with the processor RTL. Just like C's built-in data types (int, short, char, etc.), the extended C compiler allocates program variables of the custom AES data type to the custom AES register file.The AES algorithm employs four transformations. The AES ShiftRow transformation rearranges the order of bytes within one row the state array. Custom TIE extensions can significantly accelerate this transformation, taking advantage of the extended processor's 128-bit data path to perform all of a row's byte transpositions in parallel using a custom 1-cycle instruction instead of performing this transformation byte by byte.The AES SubBytes transformation performs a byte-wise substitution for each byte in the state array using a fixed 256-byte table. Firmware-only implementations require a byte load and a table lookup followed by a store for each byte in the state array. The table lookup consists of a load operation using an effective address computed by adding the loaded byte value to the table's base address. Consequently, this transformation normally requires 4 operations per byte for a total of 64 operations when implemented completely in firmware. A custom SubBytes instruction executes all 64 operations in one cycle using using TIE tables, which are "hard-wired" inside the processor's execution unit. TIE tables are internal to the processor's execution unit and are accessible in fractions of a clock cycle. A custom TIE instruction can access several TIE tables simultaneously.The AES AddRoundKey transformation performs a byte-wise Galois field addition using bytes from the key schedule. A Galois field addition is equivalent to performing a simple bitwise XOR operation on all 128 bits in the state array with all 128 bits in the key schedule. The firmware-only C implementation requires four loads to read the state array, four loads to read the 128-bit key schedule, four 32-bit XOR operations, and four stores to write out the results. Therefore, a firmware-based version of this transform requires 16 operations. A custom 2-cycle TIE instruction achieves significantly better performance by loading the 128-bit key schedule from memory and performing the bit-wise XOR with the state array held in the AES register.The AES MixColumns transformation is a permutation of the state array performed on each of the four columns in the state array. Each 4-byte column is multiplied with a constant 4x4 matrix and the result of the matrix multiplication becomes a transformed column in the state array. Of all the AES cipher transforms, this AES transformation consumes the largest number of processor cycles when implemented with firmware. When compiled, the firmware-only version of the MixColumns transform consists of approximately 250 individual operations consisting of byte loads and stores, 32-bit XOR operations, comparisons, and shifts. A custom TIE instruction can perform all of these operations in one clock cycle.Writing firmware that takes advantage of custom TIE instructions is straightforward. It is not necessary to write in assembly code because the C compiler that is automatically generated with the processor RTL supports all of the processor's custom TIE extensions. The C compiler allocates variables of the 128-bit AES type to the AES register file and the custom TIE AES transformation instructions are automatically available through C intrinsic functions.
More than one block at a time
The AES TIE optimizations increase performance by taking advantage of the microscopic parallelism inherent in the algorithm's transformations. No transform calculation depends upon the results of other calculations in the state array so the custom TIE instructions can perform all bytes transformations in parallel. The AES cipher also has macroscopic parallelism; Encryption of one AES block doesn't depend on previously encrypted blocks. It's possible to design an AES engine that encodes multiple blocks simultaneously rather than sequentially. The AES engine described above encrypts blocks at a rate of 16 cycles per block. An AES engine that exploits the algorithm's macroscopic parallelism could deliver even more performance by widening the AES register file and instruction data paths in 128-bit multiples. For example, the AES register file width and simple changes to the custom TIE instructions could encrypt two consecutive blocks simultaneously, doubling performance. However, doubling the performance also doubles the area required to support the custom hardware created from the TIE descriptions so there is a tradeoff between cost and performance.AES encryption on the TIE enhanced processor requires approximately 60 cycles per block. Custom TIE extensions can increase performance by more than 80x relative to running the AES cipher in firmware on an unmodified Xtensa core but add only 35K gates to the processor core. Fabricated in a 130nm process technology, the described AES engine could achieve a maximum operating frequency of more than 250MHz, making the engine capable of encrypting 4.15 million plaintext blocks/second (530Mbit/s). By exploiting macroscopic parallelism, the encryption rate can be further boosted to more than 18 million blocks/second (more than 2Gbit/s).19167a.tif19167b.tif