TR-97-2.pdf


C. Zhang, X. Zhang, and Y. Yan  
"Two fast and high-associativity cache schemes" 

IEEE Micro, Vol. 17, No. 5, September/October, 1997, pp. 40-49. 
 
Abstract
--------

A traditional implementation of the set-associative cache has the
disadvantage of longer access cycle times than that of a direct-
mapped cache. Several methods have been proposed for implementing
associativity in non-traditional ways. However, most of them have
only achieved an associativity close to two. The others with higher
associativity still present longer access cycle times or suffer
from larger average access times. In this paper, we first
systematically address implementation issues of associativity and
evaluate existing implementation methods. We then propose two
schemes for implementing higher associativity: the Sequential
Multi-Column Cache, which is an extension of the Column Associative
Cache, and the Parallel Multi-Column Cache. In order to achieve the
same access cycle time as that of a direct-mapped cache, data memory
in the cache is organized into one bank in both schemes. We use the
multiple MRU block technique to increase the first hit ratio, thus
reducing the average access time. While the Parallel Multi-Column Cache
performs the tag checking in parallel, the Sequential Multi-Column Cache
sequentially searches through places in a set, and uses index information
to filter out unnecessary probes. In the case of an associativity of 4,
they both achieve the low miss rate of a 4-way set-associative cache. Our
simulation results using ATUM traces show that both schemes can
effectively reduce the average access time. They have average
improvements of 9.8% and 10.8% in average access time over a direct-mapped
cache, and have average relative improvements of 130% and 153% over the
Column-Associative Cache, respectively, when the cache size is 4K bytes
and miss penalty is 20 cycles. The improvement of the Sequential Multi-
Column Cache in average access time reaches 22.4% when the associativity
is 8 and the miss penalty increases to 100 cycles. The two schemes are
effective for both small and large caches (1K bytes to 128K bytes).