CSI is a switched fabric and a natural fit for cache coherentnon-uniform memory architectures (ccNUMA). However, simply recyclingIntel’s existing MESI protocol and grafting it onto a ccNUMA system isfar from efficient. The MESI protocol complements Intel’s olderbus-based architecture and elegantly enforces coherency. But in accNUMA system, the MESI protocol would send many redundant messagesbetween different nodes, often with unnecessarily high latency. Inparticular, when a processor requests a cache line that is stored inmultiple locations, every location might respond with the data.However, the requesting processor only needs a single copy of the data,so the system is wasting a bit of bandwidth.
Intel's solution to this issue is rather elegant. They adaptedthe standard MESI protocol to include an additional state, theForwarding (F) state, and changed the role of the Shared (S) state. Inthe MESIF protocol, only a single instance of a cache line may be inthe F state and that instance is the only one that may be duplicated[3]. Other caches may hold the data, but it will be in the shared stateand cannot be copied. In other words, the cache line in the F state isused to respond to any read requests, while the S state cache lines arenow silent. This makes the line in the F state a first amongst equals,when responding to snoop requests. By designating a single cache lineto respond to requests, coherency traffic is substantially reduced whenmultiple copies of the data exist.
When a cache line in the F state is copied, the F state migrates to thenewer copy, while the older one drops back to S. This has twoadvantages over pinning the F state to the original copy of the cacheline. First, because the newest copy of the cache line is always in theF state, it is very unlikely that the line in the F state will beevicted from the caches. In essence, this takes advantage of thetemporal locality of the request. The second advantage is that if aparticular cache line is in high demand due to spatial locality, thebandwidth used to transmit that data will be spread across severalnodes.
Figure 4 demonstrates the advantages of MESIF over the traditional MESIprotocol, reducing two responses to a single response (acknowledgementsare not shown). Note that a peer node is simply a node in the systemthat contains a cache.
In general, MESIF is a significant step forward for Intel’s coherencyprotocol. However, there is at least one optimization which Intel didnot pursue – the Owner state that is used in the MOESI protocol (foundin the AMD Opteron). The O state is used to share dirty cache lines(i.e. lines that have been written to, where memory has older or dirtydata), without writing back to memory.
Specifically, if a dirty cache line is in the M (modified)state, then another processor can request a copy. The dirty cache lineswitches to the Owned state, and a duplicate copy is made in the Sstate. As a result, any cache line in the O state must be written backto memory before it can be evicted, and the S state no longer impliesthat the cache line is clean. In comparison, a system using MESIF orMESI would change the cache line to the F or S state, copy it to therequesting cache and write the data back to memory – the O state avoidsthe write back, saving some bandwidth. It is unclear why Intel avoidedusing the O state in the newer coherency protocol for CSI – perhaps thearchitects decided that the performance gain was too small to justifythe additional complexity.
Table 3 summarizes the different protocols and states for the MESI, MOESI and MESIF cache coherency protocols.