Chunhai Zhang

Results 7 issues of Chunhai Zhang

In tfjob, is there a plan to support RDMA SRIOV with non hostNetwork? Using https://github.com/Mellanox/k8s-rdma-sriov-dev-plugin.

lifecycle/stale

I want to mount a sefs in two containers at the same time. But the second container reported an error: ```sh [2021-07-21T06:36:56.909Z][DEBUG][T1][#30][····Open] openat: fs_path: FsPath { Absolute("/etc/image_config.json") }, flags: 0o2000000,...

question

如果想基于OSX转发跨站点的自定义服务,比如让fate-serving复用OSX进行跨站点通信。 请问这种情况能否支持? 如果支持,该如何配置或修改代码呢? 谢谢!

当系统负载较高时,fateflow可能暂时性的访问不通eggroll 这里捕获到Exception时,直接删除SessionRecord,可能导致eggroll对应的egg_pair进程无法退出。 是否可以增加重试机制呢? https://github.com/FederatedAI/FATE/blob/87dd4f63869b995b6bef3d49b1b7d1cb346806ec/python/fate_arch/session/_session.py#L408 @zhihuiwan

bug

https://github.com/FederatedAI/eggroll/blob/da7969f5fa330489fc2a0da3aabacb70916d6987/python/eggroll/roll_pair/roll_pair.py#L64C1-L66C100 ```python self.in_memory_output = RollPairConfKeys.EGGROLL_ROLLPAIR_IN_MEMORY_OUTPUT.get_with(session.get_all_options()) if not self.default_store_type: raise ValueError(f'in_memory_output "{self.in_memory_output}" not found for roll pair') ``` 判断条件错了,应该改为: ```python self.in_memory_output = RollPairConfKeys.EGGROLL_ROLLPAIR_IN_MEMORY_OUTPUT.get_with(session.get_all_options()) if not self.in_memory_output: raise ValueError(f'in_memory_output "{self.in_memory_output}" not found...

reason=bug

https://github.com/FederatedAI/FATE-Flow/blob/c8167883fbfc69afdcfedbdded2f400f8f7b289c/python/fate_flow/utils/xthread.py#L69 这段代码,某些异常情况下会永久block,导致线程无法退出,线程数持续增长。 建议加一个timeout参数,同时捕获Empty exception 。 @zhihuiwan

I am only interested in the metadata and not concerned with lineage. Lineage is also being ingested very slowly. So, can add an option to disable the ingestion of lineage...