ECI - Enzian Coherent Interconnect
- Purpose
- Description
- Documentation
Purpose
Enzian Coherent Interconnect is a module that provides an information exchange channel between the Processing System (PS), the Cavium ThunderX CPU, and the Programmable Logic (PL), the Xilinx FPGA. This document describes the physical and logical implementation of this channel.
Description
Physical
Notes:
- The EBB brings the CCPI interface out on Amphenol XCede backplane connectors (4 pair, 8 column).
- The interface runs at 10.3125Gbps, using the 10Gbase-KR (backplane ethernet) channel specification (Annex 69B).
- The channel is specified from 50MHz to 15GHz, with the real requirements up to 6GHz.
- The full link is 24 bidirectional lanes (48 differential pairs, and 96 signal wires).
- Scales in blocks of 4 lanes.
- Scales down to 2.5Gbps (according to the board firmware).
- There are three XCede connectors, each carrying 2 blocks (they're about half populated, and it is possible to escape route single-sided).
- The FMC+ connector on the VCU118 provides 24 serial lanes, while the FMC provides 8.
Digital
We have Verilog sources from Cavium for the CCPI protocol. Cavium have suggested that we're better using those as a reference for the protocol than trying to synthesise them directly.
Details of our attempts to bring up CCPI on an FPGA are under Enzian/CCPIBringup.
Metaframes
Frames
OCX defines data blocks of 64 bytes in size. This is broken down into 7 64-bit data words and a 64-bit control word. The block formats are the same on the receive and transmit side. There are three different block formats: IDLE, SYNC and data block (CRED_*) which distinguishes for high or low VCs.
In the control word, the vc_n encodes the VC number of the n-th data word in the packet e.g. vc_2 = 0xa means that data2 belogs to VC 10.
constant btype = { IDLE = 0b111; CRED_LO = 0b100 CRED_HI = 0b101 SYNC = 0b110 }; datatype blk_ctrl { btype 3; ack 1; credits 8; vc_0 4; vc_1 4; vc_2 4; vc_3 4; vc_4 4; vc_5 4; vc_6 4; crc 24; }; datatype block_format { data0 64; data1 64; data2 64; data3 64; data4 64; data5 64; data6 64; control blk_ctrl; };
Data Block Format
The following block foramts are taken from the Verilog code. The Documentation seems to be different.
Note, the data words can be encrypted.
Data0 | Data | ||||||||
Data1 | Data | ||||||||
Data2 | Data | ||||||||
Data3 | Data | ||||||||
Data4 | Data | ||||||||
Data5 | Data | ||||||||
Data6 | Data | ||||||||
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="cafec7db-d68a-4c50-96f1-5bba6752ea22"><ac:plain-text-body><![CDATA[ | Control | CRED_HI | CRED_LO (3) | ACK (1) | CREDITS (8) | blk_vc[6] (6x4) | vc_x (4) | CRC (24) | ]]></ac:plain-text-body></ac:structured-macro> |
CREDI_HI / CRED_LO define whether the credits belong to hi (0-7) or (8-11) VCs. There are just those two groups, no per VC credits.
Blk_vc stores the VC for the first 3 128-bit words. The blk_vc are grouped into 6x4 groups for data words 0 to 5. blk_vc[0] = 0x3 means that the first data word belongs to VC 3.
blk_vc <= (!rrst_n || black_hole || force_idle) ? {24{1'b1}} : (!frc__stall && val_x && eob) ? {blk_vc[15:0], vc_x, 4'hf} : (!frc__stall && val_x) ? {blk_vc[15:0], vc_x, byp__blk_vc1} : (!frc__stall && eob) ? {blk_vc[15:0], byp__blk_vc1, 4'hf} : (!frc__stall) ? {blk_vc[15:0], byp__blk_vc1, byp__blk_vc0} : blk_vc; vc_x <= (black_hole) ? 4'd15 : (frc__stall) ? vc_x : (!blk_adv) ? vc_x : (val_x && eob) ? byp__blk_vc1 : byp__blk_vc0;
vc_x stores the VC for the last data word, which is either one from the inputs to this module.
It seems to be a bit werid, that given the combination, case val_x && eob, vc_x is present multiple times.
Idle Block Format
There is no credit return and all VCs are 0xf. ACK's are allowed and data is unused. IDLE is generated when the statemachine is set to RUN. No need to return credits and no increment of the TXSEQ. The IDLE packet will be issued when the idle_blk signal has been set.
Data0 | Unused | ||||||
Data1 | Unused | ||||||
Data2 | Unused | ||||||
Data3 | Unused | ||||||
Data4 | Unused | ||||||
Data5 | Unused | ||||||
Data6 | Unused | ||||||
Control | IDLE | ACK | 8'b0 | retry_otr (8) | RXSEQ | 12'b0 | CRC |
retry_ptr are the number of retries and rx_seq is incremented on the number of received blocks.
Sync Block Format
Sync blocks are issued when the link state machine is in the INIT or RETRY state.
Note, the Verilog code does not seem to transmit the data fields.
Init:
Data0 | Unused | |||||
Data1 | Unused | |||||
Data2 | Unused | |||||
Data3 | Unused | |||||
Data4 | Unused | |||||
Data5 | Unused | |||||
Data6 | Unused | |||||
Control | SYNC (3) | ACK (1) | 7'b0 | SM_REQ (1) | 28'b0 | CRC (24) |
ACK is set when the InitAck state is reached. SM_REQ is set when the the SM should transition from IREQ to IACK.
Retry:
Data0 | Unused | |||||
Data1 | Unused | |||||
Data2 | Unused | |||||
Data3 | Unused | |||||
Data4 | Unused | |||||
Data5 | Unused | |||||
Data6 | Unused | |||||
Control | SYNC (3) | ACK (1) | 7'b1 | SM_REQ (1), retry_ptr (8), rx_seq (8) | 12'b0 | CRC (24) |
SM_REQ will be TRUE if the SM is in the IREQ state or in the RREQ state, and FALSE if it's in the IACK, RUN, RACK or RPLY state.
retry_ptr are the number of retries and rx_seq is incremented on the number of received blocks.
Matching of Packets
switch(ctrl[63:60]) { case 0xf (0b111 | ACK) case 0xe (0b111 | !ACK) IDLE; break; case 0xd (0b110 | ACK) case 0xc (0b110 | !ACK) SYNC; break; case 0x8 (0b100 | ACK) case 0x9 (0b100 | !ACK) CRED_LO; break; case 0xa (0b101 | ACK) case 0xb (0b101 | !ACK) CRED_HI; break; default: INVALID break; }
The Link State Machine
The generation of the blocks above are controlled by the link state machine. There are in total six different states. The state machine is event based and reacts on different received message.
IREQ
This is the initial state from RESET or RREQ timeout. The state begins with benerating SYNC blocks.
On Transition:
- send INIT_REQ message: SYNC. Assert SM_REQ=True.
On INIT_ACK_rx:
- transition IACK State.
IACK
On ERROR_rx:
- transition to RREQ state.
On INIT_REQ_rx:
- send INIT_ACK message
On not ERROR_rx && not INIT_REQ_rx && not INIT_ACK_rx
- transition to RUN state
RUN
In this state, data id transferred, credits returned and received blocks are acknowledged.
Note: on transition IACK => RUN there are no TX VC Credits available. Therefore, the logic generates packets to return the 64 TX VC credits to the partner.
On Transition from IACK:
- generate packet to return 64 TX VC Credits to partner
On ERROR_Rx:
- send RETRY_REQ message
- transition to RREQ state
On RETRY_REQ_Rx :
- send RETRY_ACK message
- Transition RACK state
On INIT_REQ_rx:
- transition IREQ Saate
RREQ
On Transition
- set timer = 0
On TimerTick :
- if (timer < 2^24) send RETRY_REQ message
else reset statemachine && transition to IREQ state
On RETRY_REQ_rx :
- send REQ_ACK message
On RETRY_ACK_rx:
- transition to RACK state
On INIT_REQ_rx:
- transition IREQ Saate
RACK
On RETRY_REQ_rx :
- send RETRY_ACK message
On Retry_ACK_rx && deasserted && blocks available:
- transition to replay state.
On RETRY_ACK_rx && no blocks
- transition to RUN state
On INIT_REQ_rx:
- transition IREQ Saate
REPAY
On BlockAvailable:
- retransmit
On No MoreBlocks:
- transition to RUN state
On ERROR_rx:
- transition to RREQ state
On INIT_REQ_rx:
- transition IREQ Saate
Expected Initialization Sequence
Seq | Endpoint 0 | Message E0->E1 | Message E1->E0 | Endpoint 1 |
1 | rst -> IREQ | rst -> IREQ | ||
2 | IREQ | INIT_REQ_tx => | <= INIT_REQ_tx | IREQ |
3 | IREQ | INIT_ACK_tx => | <= INIT_ACK_tx | IREQ |
4 | IREQ -> IACK | IREQ -> IACK | ||
5 | IACK | SYNC => | <= SYNC | IACK |
6 | IACK -> RUN | IACK -> RUN | ||
7 | RUN | CREDITS => | <= CREDITS | RUN |
8 | RUN | DATA / IDLE => | <= DATA / IDLE | RUN |
Credits and Credit returns
Note: This paragraph is outdated and does not agree entirely with the findings in the ThunderX to ThunderX traces. But I will leave it here until we have full confirmation of the new results.
At initialization the credits are cleared and need to be returned by the link partner. Max credits:
VCs 0-5: 256; VCs 6-11: 33, VC 12: 32, VC 13: no credits are used
The credits are returned using a single bit indicated in the credits field of the packets depending on the CRED_HI or CRED_LO packet the bits in the fields correspond to vc7-0 or vc12-8. if a bit is set, this corresponds to 8 credits.
On init the links are added with the following credits:
MOC_LINK_CREDITS = 7'd32; CO_LINK_CREDITS = 8'd32; CD_LINK_CREDITS = 9'd256;
Every sent word decreases credits by 1. A full memory message decreases credits by 17 (header word + 16 payload words).
Virtual channels
Virtual channels | Description |
0 | I/O requests |
1 | I/O responses |
2 | Memory requests with data, odd addresses |
3 | Memory requests with data, even addresses |
4 | Memory responses with data, odd addresses |
5 | Memory responses with data, even addresses |
6 | Memory requests without data, odd addresses |
7 | Memory requests without data, even addresses |
8 | Memory forwards |
9 | Memory forwards |
10 | Memory responses without data, odd addresses |
11 | Memory responses without data, even addresses |
12 | Interrupt messages ??? |
13 | Link discovery messages |
ECI messages reference.pdf eci_eci_machine_readable_eci_decode.asl.txt
Notes
Cache line layout, offsets in bytes:
Cache line | |||
31 - 0 bytes | 63 - 32 bytes | 95 - 64 bytes | 127 - 96 bytes |
Sub-cache line 0 | Sub-cache line 1 | Sub-cache line 2 | Sub-cache line 3 |
- fillo - fill offset - the sub-cache line position that should be transmitted first
- ns - a non-secure access, normally should be set to 1
- A - aliased index of the cache line
Addressing:
Physical Address | |
39:7 | 6:0 |
cache line index | offset within the cache line |
Aliasing functions:
uint33_t cache_line_index_PA_to_IPA(uint33_t cache_line_index_pa) { uint33_t cache_line_index_ipa; cache_line_index_ipa = cache_line_index_pa; cache_line_index_ipa ^= (cache_line_index_pa >> 13) & 0x00ff; cache_line_index_ipa ^= (cache_line_index_pa >> 5) & 0x1f07; cache_line_index_ipa ^= (cache_line_index_pa >> 2) & 0x0018; return cache_line_index_ipa; } uint33_t cache_line_index_IPA_to_PA(uint33_t cache_line_index_ipa) { uint33_t cache_line_index_pa; cache_line_index_pa = cache_line_index_ipa; cache_line_index_pa ^= (cache_line_index_ipa >> 18) & 0x0007; cache_line_index_pa ^= (cache_line_index_ipa >> 15) & 0x0018; cache_line_index_pa ^= (cache_line_index_ipa >> 13) & 0x00ff; cache_line_index_pa ^= (cache_line_index_ipa >> 5) & 0x1f07; cache_line_index_pa ^= (cache_line_index_ipa >> 2) & 0x0018; return cache_line_index_pa; }
Link discovery
Messages in the VC 13 are used to discover the physical layout. A value written to the link data register, OCX_TLK(<local link number>)_LNK_DATA register, will be send to the other node and will be placed in the OCX_RLK(<remote link number>)_LNK_DATA. By sending known data via my known link, we can find a remote link we are connected to, by reading all the remote RLK registers and finding the one with our known data.
Cache coherency protocol
- eci_specification_ECI_Specification.pdf
- Efforts to match requests/responses of the collected traces (Started in October 2019)
Transaction examples
Message field description
Field | Description |
A | Address/Index |
be | Big-endian ? |
did | Device ID |
dmask | Dirty bitmask of sub-cache lines, 0 - clean, 1 - dirty |
el | Exception level ? |
fillo | Fill offset, offset of the first transmitted sub-cache line |
flid | |
ns | Non-secure access, 0 - secure, 1 - non-secure, |
nxm | Non-existant memory |
ppvid | Core number |
szoff | Size and offset of an accessed word within a cache line |
Sub-cache line layout and dmask and fillo fields
Words 0 - 3 (A) | Words 4 - 7 (B) | Words 8 - 11 (C) | Words 12 - 15 (D) |
dmask bit 0 | dmask bit 1 | dmask bit 2 | dmask bit 3 |
fillo 0 | fillo 1 | fillo 2 | fillo 3 |
Examples
dmask | fillo | Order |
0xf | 0x0 | ABCD |
0xf | 0x2 | CDAB |
0x1 | 0x0 | A |
0x5 | 0x0 | AC |
0xb | 0x3 | DAB |
I/O transactions
The SLI transactions are issued by the Switch Logic Interface, when the registers OCX_PP_WR_DATA, OCX_PP_RD_DATA and OCX_PP_CMD are used. The I/O Bridge transactions are issued when the I/O space is directly accessed by the CPU load/store instructions.
SLILD - Switch Logic Interface Load
Load a 64-bit word, the physical address 0x901400000000
Request:
VC | Raw packet | Description |
0 | 0xe0018880000000be | Cmd = ECI_IREQ_SLILD, ppvid = 0, flid = 0, el = 3, ns = 1, be = 0, did = 1, A = 0x80000000, szoff = 0x0e |
Response:
VC | Raw packet | Description |
1 | 0x1000000000000000 | Cmd = ECI_IRSP_SLIRSP, ppvid = 0, flid = 0, nxm = 0, size = 0 |
1 | 0x0000000000000000 | Read value |
Switch Logic Interface Store
Store a 64-bit word, the physical address 0x901400000000
Request:
VC | Raw packet | Description |
0 | 0xe8018880000000be | Cmd = ECI_IREQ_SLIST, ppvid = 0, flid = 0, el = 3, ns = 1, be = 0, did = 1, A = 0x80000000, szoff = 0x0e |
0 | 0x0000000000000000 | Value to store |
No response.
I/O Bridge Load
Load a 64-bit word, the physical address 0x901400000000
Request:
VC | Raw packet | Description |
0 | 0x00018880000000be | Cmd = ECI_IREQ_IOBLD, ppvid = 0, flid = 0, el = 3, ns = 1, be = 0, did = 1, A = 0x80000000, szoff = 0x0e |
Response:
VC | Raw packet | Description |
1 | 0x0000000000000000 | Cmd = ECI_IRSP_IOBRSP, ppvid = 0, flid = 0, nxm = 0, size = 0 |
1 | 0x0000000000000000 | Read value |
I/O Bridge Store
Store a 64-bit word, the physical address 0x901400000000
Request:
VC | Raw packet | Description |
0 | 0x10018880000000be | Cmd = ECI_IREQ_IOBST, ppvid = 0, flid = 0, el = 3, ns = 1, be = 0, did = 1, A = 0x80000000, szoff = 0x0e |
0 | 0x0000000000000000 | Value to store |
No response.
I/O Bridge Store with Acknowledge
Store a 64-bit word, the physical address 0x901400000000
Request:
VC | Raw packet | Description |
0 | 0x18018880000000be | Cmd = ECI_IREQ_IOBSTA, ppvid = 0, flid = 0, el = 3, ns = 1, be = 0, did = 1, A = 0x80000000, szoff = 0x0e |
0 | 0x0000000000000000 | Value to store |
Response:
VC | Raw packet | Description |
1 | 0x0800000000000000 | Cmd = ECI_IRSP_IOBACK, ppvid = 0, flid = 0, nxm = 0, size = 0 |
Memory transactions
Read transient (non-caching)
Request:
VC | Raw packet | Description |
7 | 1003e00400000000 | Cmd = ECI_MREQ_RLDT, A = 0x8000000, dmask = 0xf, fillo = 0, ns = 1 |
Response:
VC | Raw packet | Description |
5 | 4c03e00400000000 | Cmd = ECI_MRSP_PSHA, A = 0x8000000, dirty = 0, dmask = 0xf, fillo = 0, nxm = 1, RReqID = 0 |
5 | 0000000000000000 | Payload 1 |
5 | 0000000000000000 | Payload 2 |
5 | 0000000000000000 | Payload 3 |
5 | 0000000000000000 | Payload 4 |
5 | 0000000000000000 | Payload 5 |
5 | 0000000000000000 | Payload 6 |
5 | 0000000000000000 | Payload 7 |
5 | 0000000000000000 | Payload 8 |
5 | 0000000000000000 | Payload 9 |
5 | 0000000000000000 | Payload 10 |
5 | 0000000000000000 | Payload 11 |
5 | 0000000000000000 | Payload 12 |
5 | 0000000000000000 | Payload 13 |
5 | 0000000000000000 | Payload 14 |
5 | 0000000000000000 | Payload 15 |
5 | 0000000000000000 | Payload 16 |
Write transient (non-caching)
Request:
VC | Raw packet | Description |
3 | 4003e00400000000 | Cmd = ECI_MREQ_RSTT, dmask = 0xf, A = 0x8000000, ns = 1, fillo = 0 |
3 | 0000000000000000 | Payload 1 |
3 | 0000000000000000 | Payload 2 |
3 | 0000000000000000 | Payload 3 |
3 | 0000000000000000 | Payload 4 |
3 | 0000000000000000 | Payload 5 |
3 | 0000000000000000 | Payload 6 |
3 | 0000000000000000 | Payload 7 |
3 | 0000000000000000 | Payload 8 |
3 | 0000000000000000 | Payload 9 |
3 | 0000000000000000 | Payload 10 |
3 | 0000000000000000 | Payload 11 |
3 | 0000000000000000 | Payload 12 |
3 | 0000000000000000 | Payload 13 |
3 | 0000000000000000 | Payload 14 |
3 | 0000000000000000 | Payload 15 |
3 | 0000000000000000 | Payload 16 |
Response:
VC | Raw packet | Description |
11 | 5400200400000000 | Cmd = ECI_MRSP_PEMD, nxm = 1, fillo = 0, A = 0x8000000, dirty = 0, dmask = 0xf, RReqID = 0 |
Observed MOESI state machine and messages
Notes:
- Each state describes the state on the home node (left state) and on the remote node (right state)
- HR - Home Read
- HW - Home Write
- HE - Home Eviction
- RR - Remote Read
- RW - Remote Write
- RE - Remote Eviction
- Left arrow - a message sent from a remote node to a home node
- Right arrow - a message sent from a home node to a remote node
GlobalSync transaction
Link Discovery transaction
A request is sent when a store to a local register OCX_TLK(<local link number>)_LNK_DATA is executed. The value written is sent as the lkdata. The recipient receives a request and puts the lkdata value into a OCX_RLK(<remote link number>)_LNK_DATA register on the remote CPU.
Request:
VC | Raw packet | Description |
13 | 0x80055e6800000000 | Cmd = ECI_MDLD_LNKD, lkdata = 0x00abcd00000000 |
No response.