oneflow icon indicating copy to clipboard operation
oneflow copied to clipboard

Throw oom error

Open lixinqi opened this issue 3 years ago • 1 comments

将vm oom之类的错误通过last_error的方式抛到python层。

lixinqi avatar Jul 29 '22 16:07 lixinqi

对于如下脚本:

# filename: a.py
import oneflow as flow

print(flow.ones((1024, 1024, 1024, 1024)))

运行起来之后,不再是abort,而是抛异常:

Traceback (most recent call last):
  File "/home/lixinqi/a.py", line 3, in <module>
    print(flow.ones((1024, 1024, 1024, 1024)))
  File "/home/lixinqi/oneflow/python/oneflow/framework/tensor.py", line 54, in _str
    return self.__repr__()
  File "/home/lixinqi/oneflow/python/oneflow/framework/tensor.py", line 58, in _repr
    return tensor_str._gen_tensor_str(self)
  File "/home/lixinqi/oneflow/python/oneflow/framework/tensor_str.py", line 365, in _gen_tensor_str
    return _gen_tensor_str_template(tensor, False)
  File "/home/lixinqi/oneflow/python/oneflow/framework/tensor_str.py", line 352, in _gen_tensor_str_template
    tensor_str = _tensor_str(tensor, indent)
  File "/home/lixinqi/oneflow/python/oneflow/framework/tensor_str.py", line 276, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/lixinqi/oneflow/python/oneflow/framework/tensor_str.py", line 311, in get_summarized_data
    return flow.stack([get_summarized_data(x) for x in (start + end)])
  File "/home/lixinqi/oneflow/python/oneflow/framework/tensor_str.py", line 311, in <listcomp>
    return flow.stack([get_summarized_data(x) for x in (start + end)])
  File "/home/lixinqi/oneflow/python/oneflow/framework/tensor_str.py", line 311, in get_summarized_data
    return flow.stack([get_summarized_data(x) for x in (start + end)])
  File "/home/lixinqi/oneflow/python/oneflow/framework/tensor_str.py", line 311, in <listcomp>
    return flow.stack([get_summarized_data(x) for x in (start + end)])
  File "/home/lixinqi/oneflow/python/oneflow/framework/tensor_str.py", line 311, in get_summarized_data
    return flow.stack([get_summarized_data(x) for x in (start + end)])
  File "/home/lixinqi/oneflow/python/oneflow/framework/tensor_str.py", line 311, in <listcomp>
    return flow.stack([get_summarized_data(x) for x in (start + end)])
  File "/home/lixinqi/oneflow/python/oneflow/framework/tensor_str.py", line 302, in get_summarized_data
    (self[: PRINT_OPTS.edgeitems], self[-PRINT_OPTS.edgeitems :])
RuntimeError: can't allocate memory: you tried to allocate 4398046511104 bytes.

目前还有缺点,就是python的栈太长了,干扰了理解。

lixinqi avatar Jul 29 '22 16:07 lixinqi

建浩的这个pr能更好的解决展示异常栈的问题。https://github.com/Oneflow-Inc/oneflow/pull/8937

lixinqi avatar Sep 06 '22 03:09 lixinqi