-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chain replication改进思路 #42
Comments
indeed. if the replication chain is very long it is advised to use multicast to rapidly propagate the written data to all nodes in the chain. Thus the data can be written in parallel which can significantly reduce write latency. |
对于高吞吐场景,星型写入对客户端带宽的占用会很高。checkpoint写入,kvcache写入这种场景都是大io,看重吞吐。 |
谢谢参与讨论,chain replication吞吐好是因为在写入时各节点流量均衡;但对于AI场景下的存储,读压力是重于写的,所以是不是客户端上行流量本来就压力不是很大,那么星型写入反而分摊了数据节点的上行压力,这样数据节点原来上行压力源自客户端读和successor读变成只服务客户端读。所以综合考虑读写并行的场景下,可能星型有更好的流量均衡。 另外想请教下,PD分离场景下,写kvcache也都是大io吗? |
it's basically small random read operations during AI training, along with periodic checkpoint sequential write ops. For AI inference, the kvcache introduced by user prompts and answering, I think generally they are small I/Os ranging from a few hundreds of KBs to a few megabytes, and they are typically write-once-read-many scenario. |
KVCache普遍比较大,单个token的key/value都至少应该在百KB级别的(取决于n_head & d_head),加上内存中攒batch批量流式下刷,应该都是MB级别IO下到文件系统层。 端侧除了存储平面流量,还有计算平面的流量,写ckpt写kvcache的同时,计算侧因为DP/TP/PP的原因,还会有大量的GPU卡间通信,吞吐也很高,对延迟也更敏感。 |
能够在runtime做灵活的链式复制/星型复制切换固然很美好,但复杂度也更高,可能还是需要看投入产出比吧。如果要支持ec的话,链式复制可行性估计不高,星型复制或Y型复制可能更优先考虑一些。 |
学习了源码后,发现现在的chain replication有如下2个特点:
结合上面特点,写流程时延可能会比较长,是否考虑过下面这个模型:
这样做目的是让successor可以提前拉取数据,减少时延。
The text was updated successfully, but these errors were encountered: