cpython icon indicating copy to clipboard operation
cpython copied to clipboard

gh-117174: Add a new route in linecache to fetch interactive source code

Open pablogsal opened this issue 1 year ago • 36 comments

Using leads to confusion when linecache it's populated with entries as the inspect module heavily relies on it. An example:

./python -c "import inspect;x=eval('lambda x: x');print(inspect.getsource(x))"

The code above should trigger an exception because we should not be able to get the source of x (it's dynamically generated) but if we use as name for interactive code it will return gives the full string.

  • Issue: gh-117174

pablogsal avatar Apr 03 '24 10:04 pablogsal

This is a breaking change right? Maybe we should do the same thing as the interactive input - to secretly convert the file name back to <string> in the traceback. We should also be more careful about the name because that would be used in the future and we need to keep it for a long time. Unlike the REPL commands, this is something the users are more likely to see. This also introduces some backward-incompatibility in pdb. Basically the file name of the code object changed and that's public.

gaogaotiantian avatar Apr 03 '24 17:04 gaogaotiantian

This is a breaking change right?

Breaking how? This doesn't affect any visible thing that's guaranteed unless I am missing something.

We should also be more careful about the name because that would be used in the future and we need to keep it for a long time.

Not sure I follow, there is no stable API that guarantees this name anywhere in a way that you can use it. This is just the name of the code object that's get created when you use Python with -c. That's absolutely not a public API and it's not stable.

Basically the file name of the code object changed and that's public.

Again, not sure how you are arriving to. this conclusion. There is absolutely no guarantee on how the interactive code objects are named

pablogsal avatar Apr 03 '24 17:04 pablogsal

I have added a translation layer similarly as we do for the REPL so it looks a bit better in tracebacks

pablogsal avatar Apr 03 '24 17:04 pablogsal

Breaking how? This doesn't affect any visible thing that's guaranteed unless I am missing something.

First of all, the tests are failing.

I don't think "backward-compatibility" is only about the things we "guarantee", it's also about the behavior that has been there for a long time, even without documentation.

This change breaks the CPython test suite and I bet it will break many other library tests.

Making sure the traceback behaves as before is a very helpful step, but there might be code in the market that relies on the name of the code object. pdb will show a different stack I believe.

I'm not saying that this is the wrong path, simply suggesting that this might break other people's code because it changes a behavior that has been there for a long time. Yes I'm aware that this basically applies to any code change we do for CPython, but maybe we could give it some extra thought and compensate as we can (and like to). The translation in traceback for example, is very useful in my opinion.

gaogaotiantian avatar Apr 03 '24 18:04 gaogaotiantian

I don't think "backward-compatibility" is only about the things we "guarantee", it's also about the behavior that has been there for a long time, even without documentation.

I understand, but we cannot just lift every observable behaviour to something we must preserve. Just take into account when we balance our options.

In general I agree with you, and you have an excellent point that if the test suite is breaking in more than one place, that's a reason for concern.

But in any case our approach here is quite limited: we can basically rename the string and fix whatever we can (but there will be places like pdb that we cannot "fix") but we won't be able to change the code object, so thing may break in other ways. Another way it's to close the issue as won't fix or to deactivate the feature for -c (which is in any case not trivial as we don't distinguish down the line between -c and REPL).

pablogsal avatar Apr 03 '24 18:04 pablogsal

@lysnikolaou thoughts?

pablogsal avatar Apr 03 '24 18:04 pablogsal

I agree that we need to balance the new changes and the backward compatibility issue, otherwise we won't be able to move forward.

First of all, I think the current solution, where we change the name of the code name string + a translator in traceback is acceptable. That's simple and straightfoward.

I do have an alternative proposal which could solve this issue in a larger scope.

We can somehow expand the capability for linecache, so that when the file name starts with <, it requires a code object to confirm the corresponding text lines. This will not break any existing code using linecache, and only adds some extra logic for filename starting with < - it will basically do another layer of mapping for each code object, which is hashable and can be the key of the dict.

When we register a code object with linecache._register_code, it iterates through all the children code objects (using co_consts) and register them each. When we need to extract the source code of a dynamically generated code object, we need to explicity pass the code object itself - which is easily accessible in traceback.

So, instead of hacking the existing "filename" based linecache, we create a competely new path for dynamically generated code, which will introduce 0 backward compatibility issue.

And, we unlocked the potential to give source to all dynamically generated code, not only the ones in REPL and -c.

This will add some overhead when we register the code with linecache, but the majority of dynamically generated code is small and I don't think it'll make a big difference to the users.

gaogaotiantian avatar Apr 03 '24 19:04 gaogaotiantian

This is a tricky one. In principle, I agree that not every behavior can be promoted to something we must preserve. This appears to be one of those cases though, where this is something we should not be preserving, but enough test suites will fail that we'll get complaints about it. And, since this is the case, it might make sense for us to put in the effort to preserve the filename.

I don't understand enough about the linecache module to be able to immediately see if @gaogaotiantian's approach will work, and what the trade-offs will be, but I'd be happy to review a PR where that approach is implemented and I can dive in a bit deeper to figure it out.

lysnikolaou avatar Apr 04 '24 15:04 lysnikolaou

In gist, we ask linecache for the source code when we are printing traceback, and the argument is the filename and the line number. Because we have fully control over the traceback and linecache, we can make traceback to pass filename, line number and the code object if the filename starts with <. In linecache, when the filename starts with <, a code object is required otherwise it will return '', which is the existing behavior for all files starting with <.

In this way, we are fully backward-compatible, all the existing user code should work because the behavior does not change at all. We get source code in the traceback for anything we'd like (REPL, -c and even more in the future), because we can explicitly pass the code object. In the future, we can make the registration public and the user can register the source code by themselves.

If @pablogsal is convinced that this approach has a chance, I can do the draft implementation.

gaogaotiantian avatar Apr 04 '24 16:04 gaogaotiantian

The problem is that it's not just the traceback: it's the code object. Most test here fail because either pdb or tracemalloc are getting the wrong name because of the different name and those use different traceback machinery or directly check the code object. You approach will still break people that don't use the standard one or that inspect the code objects and expect a different thing, so it's not fully backwards compatible.

pablogsal avatar Apr 04 '24 16:04 pablogsal

In linecache, when the filename starts with <, a code object is required otherwise it will return ''

Also, just for me to understand correctly, isn't this a breaking change? If someone is registering by hand something starting with < now suddenly we will do something different, no? What am I missing?

pablogsal avatar Apr 04 '24 16:04 pablogsal

This will add some overhead when we register the code with linecache, but the majority of dynamically generated code is small and I don't think it'll make a big difference to the users.

I agree that most of the cases are like this but we don't know if this is always true to the point that it doesn't matter. This is python we are talking about and users generate a lot of edge cases

pablogsal avatar Apr 04 '24 16:04 pablogsal

The problem is that it's not just the traceback: it's the code object. Most test here fail because either pdb or tracemalloc are getting the wrong name because of the different name and those use different traceback machinery or directly check the code object. You approach will still break people that don't use the standard one or that inspect the code objects and expect a different thing, so it's not fully backwards compatible.

With my proposal, we do not change the name of the code object - which is exactly why the method is more backward compatible compared to the current one. The filename <string> is preserved for -c (it could even be preserved for REPL inputs if that matters). What I'm proposing is to use the actual code object as a key to search for source files, and that's unique.

Also, just for me to understand correctly, isn't this a breaking change? If someone is registering by hand something starting with < now suddenly we will do something different, no? What am I missing?

In my mind, a breaking change is something that breaks the existing code. There's no public way for the users to "register" anything now. If the user opt-in the new feature, then the behavior will be different - that's not breaking change, that's just new. If the user programs as before, everything will be the same.

I agree that most of the cases are like this but we don't know if this is always true to the point that it doesn't matter. This is python we are talking about and users generate a lot of edge cases

Yes, but the time to search through the code objects will always be significantly shorter than compiling them. If they chose to compile a huge file and create a lot of the code objects dynamically, the time cost is already significant. The worst case complexity for searching code objects is not worse than compiling them.

gaogaotiantian avatar Apr 04 '24 17:04 gaogaotiantian

Ok, I think I misunderstood your proposal because I clearly had the wrong idea of how it would work.

But then, what's the difference between this and having an "alternate" storage for line cache that it's only internal and only the interpreter can use and it's only retrieved from traceback machinery? Let's say the traceback machinery first checks in the normal line cache and if that fails it checks in the "secret storage" and with -c and interactive mode we put it there. In this way we don't change code objects, this doesn't interfere with regular users and we get what we need.

pablogsal avatar Apr 05 '24 09:04 pablogsal

Yes this is basically having an alternate storage in linecache. We don't need to change the code objects in any case, they are only used as keys.

If we have a reliable source for dynamically generated code, why keep it a secret? pdb definitely wants it, other debuggers might as well. This is a useful feature for the users. I even think we should make the source code registeration public, but I'm okay for only making it internally available for now.

What's the advantage to combine it with existing interface? Well, first of all, they are doing the exact same thing from high level - to get source code. They can share all the existing code like cache mechanism. More importantly, the funcion call can simply be modifed from

linecache.getline(filename, lineno)

to

linecache.getline(filename, lineno, code=code)

When filename does not start with <, we ignore the code argument. The change on the calling side is minimal, they don't need to do something like

line = linecache.getline(filename, lineno)
if not line:
    line = linecache.secret_getline(filename, lineno, code=code)

The existing user code won't be interfered as they will behave as before. The user can opt-in the feature to get dynamically generated code - I believe that was asked for before.

gaogaotiantian avatar Apr 05 '24 17:04 gaogaotiantian

If we have a reliable source for dynamically generated code, why keep it a secret?

Because that's a new feature and we are considering a bugfix. (I'm on my phone now, I can answer in more detail later but this is basically why my intent is to keep this as simple as possible)

pablogsal avatar Apr 05 '24 18:04 pablogsal

Because that's a new feature and we are considering a bugfix. (I'm on my phone now, I can answer in more detail later but this is basically why my intent is to keep this as simple as possible)

Okay that's reasonable, but can we also make it extensible for the future? I work on pdb a lot and it would be nice to have the source code for dymanically generated objects.

So strictly for the bug fix:

  1. We keep the file name <string> for -c to avoid hassles for test cases (and that's our target for the future as well)
  2. Keep the current code register interface so we do not need to touch C code at all. Also for now make it private and internal only.
  3. When we register code, put it in a different container than linecache.cache with the id of the code object as the key(different source code can build to the code objects that equal to each other).
  4. Do one of the following:
    • In traceback, try to get from that container if source is not available (or code starts with <?). this requires a new private interface for linecache
    • Change linecache.getline so that linecache handles this internally - it checks the file name, then look for it in the alternate container if applicable

I prefer the latter.

With this change, pdb can also use it. We may or may not promote this to public feature in the future, but with this pdb is already happy.

I don't think this solution is much more complicated than changing the code file name and translate it back in traceback, and I think this solution is helpful in long term.

gaogaotiantian avatar Apr 05 '24 18:04 gaogaotiantian

3. When we register code, put it in a different container than linecache.cache with the id of the code object as the key(different source code can build to the code objects that equal to each other).

How do you retrieve this? The public APIs to the line cache module work on filename, not on code object. Specially if you want linecache.getline to do this internally, we cannot change the interface.

Also, the traceback module in most places where we call line cache we don't have the code object at hand and I don't think we want to push down the code object itself or some secret ID everywhere.

pablogsal avatar Apr 10 '24 17:04 pablogsal

Like I listed in 4., we have two ways to do it - either add a private interface like _register_code, or add an extra argument to linecache.getline(). I don't see why we can't change the interface ever - it's okay if we want to keep this bug fix private and avoiding changing any public interface, but I'm proposing the new feature to make the source code available for all users, which could be done in a separate PR.

The reason I repeatedly mentioned the possibility for a new feature is that - it's backward compatible, which means it will not break any existing code, unlike the naive fix to change the name of the code object.

There's only one place in traceback.py that calls linecache.getline() - it's in FrameSummary and we only need to add a code attribute to it - which is handy because we always have the frame. Dynamically generated code does not need lazy load because it's already in memory, no disk operation is involved.

And yes we need to change FrameSummary, secretly or not.

I don't think a clean and simple fix here is possible, because we will always have the name conflict issue for dynamically generate code objects. We have to use a separate identifier for them, in order to retrieve the source code reliably. I think we had occurences where we implemented a new feature to solve a bug. This issue is new in 3.13 so we do not need any backport. Seems like a good candidate to "fix with feature".

gaogaotiantian avatar Apr 10 '24 22:04 gaogaotiantian

I am not comfortable conflating the bugfix with a new feature. This has the danger to rush the new feature and as the issue is about a bug, it doesn't allow people to find the feature request and chime in and provide feedback. I very much prefer to have a suboptimal bugfix first and then consider the new feature and when implemented fix the issue in a better way. This is because we want to allow other core devs and users the opportunity to consider the new feature and balance the API in the context of what it will provide, not focused on fixing a given bug.

Don't get me wrong. I agree with you and I appreciate that you are trying to help and I like what you propose, it's just that I don't think fixing a bug with a feature (precedent or not) it's the right thing to do in this issue.

In particular, for this bugfix I think I prefer to fix it in a way that doesn't involve passing code objects around but just implementing a second linecache storage for interactive code that it's checked if the filename starts with < and it's not in the primary cache and it's only populated via _register_code. This has the problem that if someone runs Python with -c and then registers something called <string> by hand we will get the wrong code, but that's a very rare edge case I am comfortable with it for this bugfix. We can then solve this with the new feature and remove the suboptimal fix once it's approved. I am confident we should be able to do this before beta freeze

pablogsal avatar Apr 10 '24 22:04 pablogsal

But your fix does not fix the issue. The bug is when someone using -c then tried to retrieve the source code of a lambda function, or basically the corner case you mentioned. It's never solely about linecache.cache. linecache.cache is just a way to show the problem. If you check the original description of the issue, they used inspect which used getline for the source code and the same path still exists and they will still get the same problem. The proposed fix won't solve the bug at all.

Actually, not only won't their issue be fixed, the proposed fix will break their workaround that pops the entry from linecache.cache.

gaogaotiantian avatar Apr 10 '24 23:04 gaogaotiantian

Ok, I am convinced then. Thanks for the clarification and apologies for the back and forth. Let's make the changes you propose here in a private way and then we can open an issue to make them public and add them to the documentation.

pablogsal avatar Apr 10 '24 23:04 pablogsal

I will fix this PR tomorrow to implement the code object backed check via a private API then then we could eliminate if we want when we made the feature public

pablogsal avatar Apr 10 '24 23:04 pablogsal

That sounds good, thanks for the patient discussion! We can do a separate issue/PR to make it public and polish it to the production level.

gaogaotiantian avatar Apr 10 '24 23:04 gaogaotiantian

@gaogaotiantian I have changed the PR to implement a version of what you proposed, but there is an important problem. As we would be doing check by code object, it means that any code object created inside the string itself will not have the source ready because it will not be registered in the line cache properly (at least not without changes to the compiler). For example, before this change we can get the source in the function and the module level:

>>> def foo():
...   1/0
...
>>> foo()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    foo()
    ~~~^^
  File "<stdin>", line 2, in foo
    1/0
    ~^~
ZeroDivisionError: division by zero

but with this approach:

>>> def foo():
...   1/0
...
>>> foo()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    foo()
    ~~~^^
  File "<stdin>", line 2, in foo
ZeroDivisionError: division by zero

This hints that we may need to register the interactive source still by name in the alternative location

pablogsal avatar Apr 24 '24 17:04 pablogsal

How about

def _register_code(code, string, name):
    entry = (len(string),
             None,
             [line + '\n' for line in string.splitlines()],
             name)
    stack = [code]
    while stack:
        code = stack.pop()
        for const in code.co_consts:
            if isinstance(const, types.CodeType):
                stack.append(const)
        _interactive_cache[id(code)] = entry

gaogaotiantian avatar Apr 24 '24 19:04 gaogaotiantian

Ah, good idea! I am going to tweak it a bit to not depend on types to preserve the optimizations done in 689ada79150f28b0053fa6c1fb646b75ab2cc200 but that can work indeed!

pablogsal avatar Apr 24 '24 20:04 pablogsal

This is quite tricky unfortunately. Passing down the frame to the FrameSummary extends the lifetime of the frame and that makes several tests in test_asyncio fail because coroutines are kept alive in some cases. For the time being I am going to just pass down code objects instead.

pablogsal avatar Apr 26 '24 19:04 pablogsal

Also some tests in test_gdb were failing because we are setting breakpoints in builtin_id and this is being triggered before the code that's passed as a template if we use id(code) as a key. For now I am going to use the code object itself and I will address this in a different PR.

pablogsal avatar Apr 26 '24 19:04 pablogsal

Also yet another problem: this doesn't work easily with the warnings module because the API it's based on getting a "message" (defined here https://github.com/python/cpython/blob/194fd17bc6cb73138e2fe8eb5ca34b19a6c3b25a/Lib/warnings.py#L420) and that doesn't easily get the code object that generated it. For now I think it's an OK limitation but I am not fully happy with how all of this it's turning.

pablogsal avatar Apr 26 '24 21:04 pablogsal