emma icon indicating copy to clipboard operation
emma copied to clipboard

TempResultsManager deletes results prematurely if multiple top-level variables point to the same DataBag

Open ggevay opened this issue 9 years ago • 2 comments

For example, the following code fails with Flink:

var v = DataBag()
val r = v
v = DataBag()
r

The problem is that the TempResultsManager garbage collects the temp result of the 1. line after it executes the 3. line, but the 4. line then looks for the deleted file.

(A real-life example of a similar code is the inner loop of KMeans, where the last line is similar to the 2. line here. If the solution = ... line would use centroids not from the closure, but as a TempSource, then the problem would occur there.)

A solution would be to translate the val r = v line into a TempSource and an immediate TempSink.

I guess we don't want to fix this for the old backend, but we will close this issue when the backend for the new ir is done, and the problem doesn't occur there.

ggevay avatar Sep 06 '16 17:09 ggevay

Is this still relevant? What happens with temp results in FlinkDataSet currently?

joroKr21 avatar Mar 24 '17 17:03 joroKr21

The current upstream does not garbage collect temp results, but it should, and it makes sense to keep track of this issue in order to avoid it.

aalexandrov avatar Mar 24 '17 17:03 aalexandrov