ECI Cache Coherence Protocol

Note: This page tracks the ongoing effort to define the ECI CC protocol in a way that matches the collected traces

ECI Transaction control

Definition of ECI transactions


ECI Transaction is a set of all ECI messages being passed between the nodes and all internal operations being done by each node involved into the transaction aiming to either satisfy a memory request or to maintain caches (for example, invalidation/eviction transactions). Taken from [2]


Implementation of transactions


_Note: The actual implementation of the transactions is not described in the ECI specification [2] and in [1] it is only described in a simplified manner.
[6] has the most detailed description available of how messages are linked to transactions and [7] implements this in C code. However, that simulator does not take the L2 cache structure of the ThunderX into account, i.e. it treats the entire L2 cache of a node as one single TAD. The following description is thus based on a combination of the CPU reference manual for ThunderX ([4]), the technical specification of Octeon 3 ([5]), and all previous efforts made by the Enzian team that I could find._



The required meta-data for transactions is stored in in-flight buffers. One ThunderX CN88XX node has 256 such buffers, distributed equally among its 8 Tag and Data Units (TAD).
[4]


  • _ (info) The reference manual at this points only talks about the cache coherence model on a single node. It is not impossible that for CCPI there are more buffers available but I couldn't find any documentation on that._

Note that a transaction may require more than one in-flight buffer, one at the home node and one at the requester node.


On ECI messages, different identifiers are used for transactions where the requester is the home node (HReqID, 6 bit identifier) or a remote node (RReqID, 5 bit identifier). [2][3]


  • (warning) 5 bit would make sense to select one of 32 buffers associated with a TAD. But then, how are the home transactions stored? And do they use a different set of in-flight buffers?
  • Looking at the trace, the HReqID is always in the range [32,63] when the message is a forward but it can be in [0,63] otherwise. This has lead me to the idea that maybe there are 32 additional in-flight buffers that are only used for CCPI and are reserved to forward remote requests. This idea has been verified empirically on the memory test traces.


In the ECI CC model described in [1], the meta data per transaction consists of the opcode (Command) and memory address. ( Is this sufficient? What about the node id of the requester?)



In the ECI CC model simulated in [7], a simplified view of the L2 cache is presented. The meta data contains only the memory address and there is no notion of TAD IDs or nodes IDs in the code.


ECI Transaction Identifiers

Note: Mostly my own interpretation but it seems to align with the traces.

Each ECI message in an Enzian system is part of exactly one transaction, according to above definition. However, there is no global identifier which represents a transaction uniquely.

A combination of identifiers is used to track transactions across the Enzian system. To keep the mapping from messages to in-flight buffers, the following logic applies.

  • The node within the Enzian system is determined like this:
    • For a request, the sender holds the Remote Transaction meta info.
    • In case of a forward, the Home Transaction meta info is at the sender (=home node) and the Remote Transaction meta info is stored at the requester node which is an explicit field (rnode) on forward-messages.
    • For responses, the receiver node holds the transaction meta info (Home or Remote).
  • To find the correct TAD within a node, either the address is part of the message and implicitly defines the TAD. Or, if no address is contained in the message, an explicit field is added (named rtad in [3] or [ReqUnit] in [2][5])

  • The in-flight buffer within a TAD is finally determined by the Requester ID, introduced for exactly this reason. It is always stored explicitly in the field(s) RReqID and/or HReqID (depending on whether the requester is a remote or home node w.r.t. the cache line)

Mapping addresses to TADs


The reference manual [4] describes different methods of distributing physical addresses to TADs. Aliased and unaliased. The register value L2C_CTL[DISIDXALIAS] determines this.


At the moment, I do not know which in which mode the L2 controllers have been when the traces have been collected. Probably the reset value but it's not documented what that would be.

A rust command-line tool (and library) automating the translation of addresses to cache-set indices, TAD indices, and quad-group indices is available on Gitlab for both modes.

In-flight buffer usage

A first step towards completely mapping ECI message to ECI transactions is to find the time span in which an in-flight buffer is being used by which transaction.

Generally, a transaction start with a request and ends with a response. But beyond traditional requests and responses, there are also eviction messages (VICx).


VICx messages never require an answer[5] but they may be used to answer forward requests. The only example of this seems to be ECI_MRSP_VICDHI, which is a merged message representing HAKI + VICD.


Insights of trace analysis

I have put the tools I currently use to analyse the traces here

Also, I will occasionally add output of my analysis here if I think there is a chance other people could make use of it.

Message occurrence

Looking at the BDK memory test traces, the most common ECI messages are:

  1. ECI_MRSP_PEMD (31.52%)
    • Response from owning node [5]
      2. ECI_MRSP_VICD (26.48%)

    • Remote L2 evicting line, changing from owned (O or E) to invalid (I) [5]

    • Probably sent to Home each time a remote-owned cache line must be evicted (e.g. when another cache line is loaded in it's place)
    • With the Owned state disabled (ROWNED_MODE = false), this message seems to be spawned for each remote request on dirty cache lines (Modified state).
      3. ECI_MREQ_RLDX (16.91%)
    • Load allocating into Requester L2 as E [5]

    • This is basically a request for read + write access
      4. ECI_MREQ_RLDD (14.51%)
    • Remote Load Data [5]

    • Caching [5]
      5. ECI_MFWD_FEVX_EH (5.26%)

    • Forward for when home is evicting cache line in its RTG [5]

    • Home sends this to remotes, probably only when they are holding copies of the cache line
      6. ECI_MRSP_VICDHI (4.96%)
    • Response to FEVX_2H - effectively a combination of VICD+HAKI [5]
      7. ECI_MRSP_HAKI (0.32%)

    • To Home Ack [5]

    • This is essentially a negative acknowledge for a forward. It is sent when the state was (I) at the time the forward arrived.

These seven commands cover the entire analysed memory test trace (1GB_2) which captured 6'291'723 total messages.

For a full list of commands in the trace, see here Only one single message at the start of the trace is unparsable (per direction), presumably because it has been cut off.

From the observed transitions, I conclude that the Owned state has not been used (ROWNED_MODE = false) in the memory test traces, which greatly reduces the number of distinct message commands.

Transaction recreation

With all my assumptions from above, it's possible to group the ECI messages of a trace into transitions of requests / responses.

These are the transitions observed on one of the BDK memory test traces (xxx_1GB_2.bin). The most common transition is a simple eviction messages. Another big chunk is made up of requests and corresponding responses. And the remainder of matched transitions are eviction forwards which are then acknowledged by the remote. Interestingly, in the captured trace, the answer to such eviction forwards always implies that the cache line has already been in Invalid state anyway.

  • ECI_MRSP_VICD (1660000 = 41.91%)
  • ECI_MREQ_RLDX => ECI_MRSP_PEMD (1060000 = 26.76%)
  • ECI_MREQ_RLDD => ECI_MRSP_PEMD (910000 = 22.97%)
  • ECI_MFWD_FEVX_EH => ECI_MRSP_VICDHI (310000 = 7.83%)
  • ECI_MFWD_FEVX_EH => ECI_MRSP_HAKI (21000 = 0.53%)

These transitions cover as much as feasible of the trace. The only unmatched messages are at the beginning (108 responses for which we have seen no request in the trace).

GINV/GSYNC/GSDN

These messages are used in a GlobalSync operation (ThunderX HRM, 2.9.3). A GlobalSync operation is used to make sure that all cores see exactly the same data, by ensuring that all previous cache invalidating messages have been executed and write buffers have been flushed.

In the freebsd_ssh_traffic_4lane_bidir_10G trace, those three make up 99% of the traffic and they always appear together. The first two are sent as a request and the third is sent back as a response.

Example from trace:

  1. A->B {"start": 805990.560000, "end": 805990.560000, "Cmd": "ECI_MREQ_GINV", "el": "0x1", "RReqID": "0x0", "subop": "0xa", "ns": "0x1", "vmid": "0x0", "asid": "0x0", "rtad": "0x4", "ppvid": "0x14", "payloadLength": "0", "payload": ""} 2. A->B {"start": 806033.440000, "end": 806033.440000, "Cmd": "ECI_MREQ_GSYNC", "RReqID": "0x0", "rtad": "0x4", "ppvid": "0x14", "payloadLength": "0", "payload": ""} 3. B -> A {"start": 806143.360000, "end": 806143.360000, "Cmd": "ECI_MRSP_GSDN", "ns": "0x0", "rtad": "0x4", "ppvid": "0x14", "payloadLength": "0", "payload": ""} 4. A->B {"start": 816153.120000, "end": 816153.120000, "Cmd": "ECI_MREQ_GINV", "el": "0x1", "RReqID": "0x0", "subop": "0xa", "ns": "0x1", "vmid": "0x0", "asid": "0x0", "rtad": "0x6", "ppvid": "0x2e", "payloadLength": "0", "payload": ""} 5. A->B {"start": 816196.000000, "end": 816196.000000, "Cmd": "ECI_MREQ_GSYNC", "RReqID": "0x0", "rtad": "0x6", "ppvid": "0x2e", "payloadLength": "0", "payload": ""} 6. B -> A {"start": 816305.920000, "end": 816305.920000, "Cmd": "ECI_MRSP_GSDN", "ns": "0x0", "rtad": "0x6", "ppvid": "0x2e", "payloadLength": "0", "payload": ""}

We can see that there are matching rtad and ppvid fields in requests and responses.

The rtad field is commonly used by memory responses to route to the proper TAD but requests usually use an address instead. I don't know how to find out which cache-line this operates on, or if it is even on cache lines at all.


The ppvid field is 6 bit wide and defined in the OCI header as follows:
typedef struct packed {
logic [2:0] pp;
logic [2:0] bus;
} oci_ppvid_t;


It describes a physical core number which a message has originated from and a response should be returned to. This field is only used in these memory operations and otherwise for IO requests:
Types that contain this field:

  • oci_mreq_bcst_t (GINV/GSYNC)
  • oci_mrsp_bcst_t (GSDN)
  • oci_ireq_byte_t (IOBLD/IOBST/IOBSTA/IOBSTP/IOBSTPA/IAADD/IASET/IACLR/IASWP/IACAS/SLILD/SLIST)
  • oci_ireq_dma_t (IOBADDR/IOBADDRA/LMTST/LMTSTA)
  • oci_irsp_ack_t (IOBACK)
  • oci_irsp_rsp_t (IOBRSP/SLIRSP)

Sources

  • No labels