Abstract. Simultaneous multi threading! a put simply, the share-ing of the execution resources of a super scalar processor between multiple execution threads! a has recently become widespread via its introduction (under the name! ^0 Hyper-Threading! +/-) into Intel Pentium 4 processors. In this implementation, for reasons of ef- and economy of processor area, the sharing of processor resources between threads extends beyond the execution units; of particular concern is that the threads share access to the memory caches. We demonstrate that this shared access to memory caches pro-video not only an easily used high bandwidth covert channel be-tween threads, but also permits a malicious thread (operating, in theory, with limited privileges) to monitor the execution of another thread, allowing in many cases for theft of cryptographic keys. Finally, we provide some suggestions to processor designers, op-eating system vendors, and the authors of cryptographic software, of how this attack could be mitigated or eliminated entirely. 1.

Introduction As integrated circuit fabrication technologies have improved, provide-ing not only faster transistors but smaller transistors, processor design-ers have been met with two critical challenges. First, memory latencies have increased dramatically in relative terms; and second, while it is easy to spend extra transistors on building additional execution units, many programs have fairly limited instruction-level parallelism, which limits the extent to which additional execution resources can be uti-lived. Caches provide a partial solution to the first problem, while out-of-order execution provides a partial solution to the second. In 1995, simultaneous multi threading was revived 1 in order to com-bat these two difficulties [12]. Where out-of-order execution allows instructions to be reordered (subject to maintaining architectural se-mantis) within a narrow window of perhaps a hundred instructions, Key words and phrases. Side channels, simultaneous multi threading, caching.

1 Simultaneous multi threading had existed since at least 1974 in theory [10], even if it had not yet been shown to be practically feasible. Page 2 simultaneous multi threading allows instructions to be reordered across threads; that is, rather than having the operating system perform con-text switches between two threads, it can schedule both threads sim ul- on the same processor, and instructions will be interleaved, dramatically increasing the utilization of existing execution resources. On the 2. 8 GHz Intel Pentium 4 with Hyper-Threading processor, with which the remainder of this paper is concerned 2, the two threads being executed on each processor share more than merely the exec u-ti on units; of particular concern to us, they share access to the memory caches [8]. Caches have already been demonstrated to be cryptograph-ic ally dangerous: Many implementations of AES [9] are subject to tim-ing attacks arising from the non-constancy of S-box lookup timings [1].

However, having caches shared between threads provides a vastly more dangerous avenue of attack. 2. Covert communication via paging To see how shared caches can create a cryptographic side-channel, we first step back for a moment to a simpler problem! a covert channels [7]! a and one of the classic examples of such a channel: virtual memory paging. Consider two processes, known as the Trojan process and the Spyprocess, operating at different privilege levels on a multilevel secure system, but both with access to some large reference file (naturally, on multilevel secure system this access would necessarily be read-only). The Trojan process now reads a subset of pages in this reference file, resulting in page faults which load the selected pages from disk into memory. Once this is complete (or even in the middle of this operation) the Spy process reads every page of the reference file and measures the time taken for each memory access.

Attempts to read pages which have been previously read by the Trojan process will complete very quickly, while those pages which have not already been read will incur the (easily measurable) cost of a disk access. In this manner, the Trojan process can repeatedly communicate one bit of information to the Spyprocess in the time it takes for a page to be loaded from disk into memory, up to a total number of bits equal to the size (in pages) of the shared reference file. 2 We examine the 2. 8 GHz Intel Pentium 4 with Hyper-Threading processor for reasons of availability, but expect that the results in this paper will apply equally to all processors with the same simultaneous multi threading and memory cache design. Page 3 CACHE MISSING FOR FUN AND PROFIT 3 If the two processes do not share any reference file, this approach will not work, but instead an opposite approach may be taken: Instead of faulting pages into memory, the Trojan process can fault pages out of memory.

Assume that the Trojan and Spy processes each have an address space of more than half of the available system memory and the operating system uses a least-recently-used page eviction strategy. To transmit a! ^0 one! +/- bit, the Trojan process reads its entire address space; to transmit a! ^0 zero! +/- bit, the Trojan process spins for the same amount of time while only accessing a single page of memory. The Spyprocess now repeatedly measures the amount of time needed to read its entire address space. If the Trojan process was sending a! ^0 one! +/-bit, then the operating system will have evicted pages owned by the Spy process from memory, and the necessary disk activity when those pages are accessed will provide an easily measurable time difference.

While this covert channel has far lower bandwidth than the previous channel! a it operates at a fraction of a bit per second, compared to a few hundred bits per second! a it demonstrates how a shared cache can be used as a covert channel, even if the two communicating processes do not have shared access to any potentially cached data. 3. L 1 cache missing The L 1 data cache in the Pentium 4 consists of 128 cache lines of 64 bytes each, organized into 32 4-way associative sets. This cache is completely shared between the two execution threads; as such, each of the 32 cache sets behaves in the same manner as the paging system discussed in the previous section: The threads cannot communicate by loading data into the cache, since no data is shared between the two threads 3, but they can communicate via a timing channel by forcing each other!.