chain replication改进思路 #42

Aran85 · 2025-03-02T16:41:16Z

学习了源码后，发现现在的chain replication有如下2个特点：

控制和数据分离，predecessor(包括client)使用IBV_WR_SEND发送控制到successor, successor使用IBV_WR_RDMA_READ从predecessor拉取数据。
每个chain node都是串行的提交本地和forward to successor。
chunk本地提交是cow，可能会发生RMW.

结合上面特点，写流程时延可能会比较长，是否考虑过下面这个模型：

客户端扇型分发数据拉取请求(IBV_WR_SEND)到chain上所有节点。
控制信息依然走chain replication。
chain node可以提前拉取数据(甚至提前cow方式本地提交chunk)，或者等接收到写控制后根据情况选择从client或者predecessor上拉取数据。
这样做目的是让successor可以提前拉取数据，减少时延。

wangyibin-gh · 2025-03-03T03:19:16Z

indeed. if the replication chain is very long it is advised to use multicast to rapidly propagate the written data to all nodes in the chain. Thus the data can be written in parallel which can significantly reduce write latency.

myjfm · 2025-03-03T07:22:18Z

对于高吞吐场景，星型写入对客户端带宽的占用会很高。checkpoint写入，kvcache写入这种场景都是大io，看重吞吐。
如果是小文件场景，从降低io latency的角度出发确实星型写入有优势。
个人看法，并行文件系统可能更看重的还是前者吧。

Aran85 · 2025-03-03T07:41:13Z

对于高吞吐场景，星型写入对客户端带宽的占用会很高。checkpoint写入，kvcache写入这种场景都是大io，看重吞吐。如果是小文件场景，从降低io latency的角度出发确实星型写入有优势。个人看法，并行文件系统可能更看重的还是前者吧。

谢谢参与讨论，chain replication吞吐好是因为在写入时各节点流量均衡；但对于AI场景下的存储，读压力是重于写的，所以是不是客户端上行流量本来就压力不是很大，那么星型写入反而分摊了数据节点的上行压力，这样数据节点原来上行压力源自客户端读和successor读变成只服务客户端读。所以综合考虑读写并行的场景下，可能星型有更好的流量均衡。

另外想请教下，PD分离场景下，写kvcache也都是大io吗？

wangyibin-gh · 2025-03-03T07:53:17Z

对于高吞吐场景，星型写入对客户端带宽的占用会很高。checkpoint写入，kvcache写入这种场景都是大io，看重吞吐。如果是小文件场景，从降低io latency的角度出发确实星型写入有优势。个人看法，并行文件系统可能更看重的还是前者吧。

it's basically small random read operations during AI training, along with periodic checkpoint sequential write ops. For AI inference, the kvcache introduced by user prompts and answering, I think generally they are small I/Os ranging from a few hundreds of KBs to a few megabytes, and they are typically write-once-read-many scenario.
So it's better we have some hints from the application logic that tells 3FS how the data will be read/written, then 3FS can act differently during its chain replication.

myjfm · 2025-03-03T07:58:34Z

对于高吞吐场景，星型写入对客户端带宽的占用会很高。checkpoint写入，kvcache写入这种场景都是大io，看重吞吐。如果是小文件场景，从降低io latency的角度出发确实星型写入有优势。个人看法，并行文件系统可能更看重的还是前者吧。

谢谢参与讨论，chain replication吞吐好是因为在写入时各节点流量均衡；但对于AI场景下的存储，读压力是重于写的，所以是不是客户端上行流量本来就压力不是很大，那么星型写入反而分摊了数据节点的上行压力，这样数据节点原来上行压力源自客户端读和successor读变成只服务客户端读。所以综合考虑读写并行的场景下，可能星型有更好的流量均衡。

另外想请教下，PD分离场景下，写kvcache也都是大io吗？

KVCache普遍比较大，单个token的key/value都至少应该在百KB级别的（取决于n_head & d_head），加上内存中攒batch批量流式下刷，应该都是MB级别IO下到文件系统层。

端侧除了存储平面流量，还有计算平面的流量，写ckpt写kvcache的同时，计算侧因为DP/TP/PP的原因，还会有大量的GPU卡间通信，吞吐也很高，对延迟也更敏感。

myjfm · 2025-03-03T08:10:29Z

对于高吞吐场景，星型写入对客户端带宽的占用会很高。checkpoint写入，kvcache写入这种场景都是大io，看重吞吐。如果是小文件场景，从降低io latency的角度出发确实星型写入有优势。个人看法，并行文件系统可能更看重的还是前者吧。

it's basically small random read operations during AI training, along with periodic checkpoint sequential write ops. For AI inference, the kvcache introduced by user prompts and answering, I think generally they are small I/Os ranging from a few hundreds of KBs to a few megabytes, and they are typically write-once-read-many scenario. So it's better we have some hints from the application logic that tells 3FS how the data will be read/written, then 3FS can act differently during its chain replication.

能够在runtime做灵活的链式复制/星型复制切换固然很美好，但复杂度也更高，可能还是需要看投入产出比吧。如果要支持ec的话，链式复制可行性估计不高，星型复制或Y型复制可能更优先考虑一些。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chain replication改进思路 #42

chain replication改进思路 #42

Aran85 commented Mar 2, 2025 •

edited

Loading

wangyibin-gh commented Mar 3, 2025

myjfm commented Mar 3, 2025

Aran85 commented Mar 3, 2025

wangyibin-gh commented Mar 3, 2025

myjfm commented Mar 3, 2025

myjfm commented Mar 3, 2025

chain replication改进思路 #42

chain replication改进思路 #42

Comments

Aran85 commented Mar 2, 2025 • edited Loading

wangyibin-gh commented Mar 3, 2025

myjfm commented Mar 3, 2025

Aran85 commented Mar 3, 2025

wangyibin-gh commented Mar 3, 2025

myjfm commented Mar 3, 2025

myjfm commented Mar 3, 2025

Aran85 commented Mar 2, 2025 •

edited

Loading