Add support for fine grained diff with line numbers
The problem
Imagine you have a toml file tables.toml with sql tables defined:
[tables.users]
id = "int"
name = "string"
[tables.companies]
id = "int"
name = "string"
Now you want to use these tables in 2 different jobs that process the data from the tables. You can write your own plugin that defines a new target sql_table:
sql_table(name='users', source='tables.toml', table='users')
sql_table(name='companies', source='tables.toml', table='companies')
and then add dependencies to your code with InferDependenciesRequest. So far so good.
Now imagine I add the company address = "string" in the toml:
[tables.users]
id = "int"
name = "string"
[tables.companies]
id = "int"
name = "string"
address = "address"
Now if I do pants --changed-since it will show me that both sql_table(name='users') and sql_table(name='companies') has changed, but I only want sql_table(name='companies') to change.
Solution
This pr adds logic to get information about changed line numbers from git and pass it to plugins. This pr does NOT implement the logic to figure out the targets that own these changed lines, this is delegated to plugins.
Fine grained diff might be expensive to process, so it's hidden behind a flag. If you want to use it, you have to explicitly specify all the files that need it with --changed-files-with-line-numbers:
pants --changed-since=origin/main --changed-files-with-line-numbers=tables.toml dependents
Then you need to inherit BlockOwnersRequest and add a rule for it:
@dataclass(frozen=True)
class TomlBlockOwnersRequest(BlockOwnersRequest):
pass
@rule
async def get_toml_block_owners(request: TomlBlockOwnersRequest) -> Owners:
...
def rules():
return [
*collect_rules(),
UnionRule(BlockOwnersRequest, TomlBlockOwnersRequest),
]
Here is POC repo that shows how to use this.
To clarify, when you say "You run pants --changed-since=origin/main dependents and... it doesn't work", what do you mean by "it doesn't work"? I would think that what it does is correct - it invalidates everything that should be invalidated. But you're saying that it invalidates too much?
@benjyw Exactly, it invalidates all the targets if the corresponding file has changed.
Imagine in the example I add the company address = "string" in the toml:
[tables.users]
id = "int"
name = "string"
[tables.companies]
id = "int"
name = "string"
address = "address"
Now if I do pants --changed-since it will show me that both sql_table(name='users') and sql_table(name='companies') has changed, but I want only sql_table(name='companies') to change.
Nice work. I agree this deserves addressing. I would prefer however if we could make it happen less intrusive. That is, not put the heavy lifting on the plugins and not rely on target fields for the positioning in the source file. The plugins would still need to provide location details when targets are created, but that should be the only thing it needs to do (e.g. no owner lookup hook..)
Some ideas: it is during target generation/creation we have the perfect opportunity to provide additional target meta data. Pants could leverage the very same thing when creating targets out of BUILD files. (So you can update a BUILD file and not invalidate every target defined in it would be awesome.)
Having this as an opt in/opt out feature either globally or per source file seems sensible to me, out of a performance tradeoff perspective.
The block parsing and how it integrates with the changed logic looks good.
Thank you for your comments!
Some ideas: it is during target generation/creation we have the perfect opportunity to provide additional target meta data. Pants could leverage the very same thing when creating targets out of BUILD files.
@kaos Could you please elaborate on this? What target meta data are you talking about?
(So you can update a BUILD file and not invalidate every target defined in it would be awesome.)
This is not quite the thing I'm trying to solve, I'm changing not the BUILD file, but the file in SingleSourceField of multiple targets.
For toml file in the example:
[tables.users] # owned by :users
id = "int" # owned by :users
name = "string" # owned by :users
[tables.companies] # owned by :companies
id = "int" # owned by :companies
name = "string" # owned by :companies
So If I change the toml file:
[tables.users] # owned by :users
id = "int" # owned by :users
name = "string" # owned by :users
[tables.companies] # owned by :companies
id = "int" # owned by :companies
name = "string" # owned by :companies
address = "string" # owned by :companies
I want to see :companies changed, but not :users.
This can be abstracted to any file, for example toml, yaml or even python:
table_users = { # owned by :users
'id': 'int', # owned by :users
'name': 'string', # owned by :users
} # owned by :users
table_companies = { # owned by :companies
'id': 'int', # owned by :companies
'name': 'string', # owned by :companies
} # owned by :companies
In fact, I've already implemented a plugin that does exactly that. It defines a PythonConstantTarget that has lineno and end_lineno which correspond to the range of lines this target owns.
But this implementation is not limited to ranges of lines, you define any logic for your targets, for example, they can own arbitrary sets of line numbers in a file.
Thank you for your comments!
Some ideas: it is during target generation/creation we have the perfect opportunity to provide additional target meta data. Pants could leverage the very same thing when creating targets out of BUILD files.
@kaos Could you please elaborate on this? What target meta data are you talking about?
Of course. I'll do that towards the end of this comment.
(So you can update a BUILD file and not invalidate every target defined in it would be awesome.)
This is not quite the thing I'm trying to solve, I'm changing not the BUILD file, but the file in SingleSourceField of multiple targets. [...]
Yes, I got that. I merely observed that pants does the same thing when parsing BUILD files, as your target generator does when parsing your yaml/json/python source, creating targets based on the data it finds.
And thanks for the example, that really helped show how this all fits together. (I did see your original link to the POC repo for this.)
Target meta data
Targets are created in two phases: phase one creates TargetAdaptors during BUILD file parsing and are collected into AddressFamilys. Phase two is whenever a rule requires a target, it is created by looking up the TargetAdaptor from the relevant AddressFamily based on the provided target Address. (and in case of a generated target, get the generated target from the target generator.)
When creating these adaptors we can attach additional information to them. Currently the only additional meta data pants are tracking on the adaptors is which BUILD file it was defined in (and line number), in the __description_of_origin__:
https://github.com/pantsbuild/pants/blob/b3072b8bc52894474bcb82152222f52109d8fbd8/src/python/pants/engine/internals/target_adaptor.py#L34-L40
Which is then propagated to actual Targets when created:
https://github.com/pantsbuild/pants/blob/b3072b8bc52894474bcb82152222f52109d8fbd8/src/python/pants/engine/target.py#L285
This is a kind of "target meta data", that I think we could add to a little and formalize the range(s?) of content it depends on from the source data (be it BUILD file or other data).
When pants parses BUILD files, it would include this information on the TargetAdaptor which would then be propagated to the Target, where as target generators would have the option to provide this information directly on the targets it generates.
This means that plugins would not need to have any additional hooks to support this, as we could implement it at a fundamental level.
There is already a use case in the Pants repo where we can try this out during implementation to help evaluate and iterate the concept; namely for the python_requirements target and its parsing of requirements.txt files. (besides the parsing of BUILD files one.)
The idea being that if you change (or add) a requirement to your requirements.txt, it should not invalidate targets that depend on the unchanged lines.
I have long wanted to get this fixed, so I'm very excited to see you've started to tackle this :)
@benjyw Exactly, it invalidates all the targets if the corresponding file has changed.
Imagine in the example I add the company
address = "string"in the toml:[tables.users] id = "int" name = "string" [tables.companies] id = "int" name = "string" address = "address"Now if I do
pants --changed-sinceit will show me that bothsql_table(name='users')andsql_table(name='companies')has changed, but I want onlysql_table(name='companies')to change.
OK, so what happens today is correct, just not as efficient as it could be. This is important when considering the tradeoffs.
Anyway, semantics aside, this is an exciting new capability to have!
I'll dive into the code soon, but this has a lot of potential for interesting features, particularly (as @kaos mentions) if we applied it to BUILD file changes.
But I'm also envisioning things like - if only comment lines have changed, don't rerun tests...
Cool, I'm glad you like the idea!
Targets are created in two phases: phase one creates TargetAdaptors during BUILD file parsing and are collected into AddressFamilys
Thanks for detailed explanation, I'm not familiar with all this stuff, so I'll need some time to read the code.
Cool, I'm glad you like the idea!
Targets are created in two phases: phase one creates TargetAdaptors during BUILD file parsing and are collected into AddressFamilys
Thanks for detailed explanation, I'm not familiar with all this stuff, so I'll need some time to read the code.
You're welcome. And feel free to ask for more details as needed, and possible next steps when ready if you're available to dive in fully :)
@kaos I'm trying to understand your idea. I can add blocks: tuple[Block, ...] to class Target, but blocks don't make sense without the file they correspond to, so I'll have to add the filename blocks: FrozenDict[str, Block]. But this is weird, because filename is usually provided by SingleSourceField, so maybe we don't want to patch the class Target and instead provide BlocksField?
class BlocksField(Field):
alias = "block"
value: dict[str, tuple[Block, ...]]
And then targets that want to use line-aware targets would have to define BlocksField. Specifying block start and count by hand is error prone, so most users will probably define a TargetGenerator with some custom logic and this custom generator could have it's own SingleSourceField to generate blocks from.
wdyt?
@grihabor Hi, no worries, there's quite a few moving parts to get to grips with here.
The file a target corresponds to is tracked in description_of_origin, so that information is available already. For target generators, this field is usually not provided so are derived from residence_dir instead which will be inaccurate (i.e. the directory level only not the particular file) so generators wishing to use this block feature need to provide the description_of_origin explicitly, in addition to the new stuff we'll come up with for this new feature.
As I see it, this new feature merely adds more details (the start pos/end pos range in the source file) to the already existing description_of_origin data (which describes the file name and line number where the target was declared). For generated targets, this would instead be describing "the file name and line number used to generate the target", where this makes sense (it does less so for file level generators, such as python_sources et al.)
The reason I don't want this as a field are two-fold. a) it is not something the user need/should know about, and b) I want low level support to support using this information on TargetAdaptors.
So my suggestion is to have something like this on the TargetAdaptor (and similarly on Target):
class TargetAdaptor:
origin_code_blocks: tuple[CodeBlock, ...]
Adapting your POC example, the generator would then look something like this:
return GeneratedTargets(
request.generator,
[
PythonConstantTarget(
{
**request.template,
PythonConstantNameField.alias: python_contant.python_contant,
},
request.template_address.create_generated(
python_contant.python_contant
),
description_of_origin=f"{digest_files[0].path}:{python_constant.lineno}",
origin_code_blocks=(
Block(start=python_constant.lineno, end=python_constant.end_lineno),
),
)
for python_contant in python_contants
],
)
With that in place, Pants have all the information it needs to determine which targets are affected and not when using the changed subsystem with the more detailed git diff stats with no further input from the plugin, such as when inferring owners etc.
I don't know if a target generator would ever consult multiple sources for a single target, in which case the filename could make sense to include in the Block definition. But that could also be added as a future feature in that case as needed of course.
Ah yea, re-reading your comment again, I see that I mostly agree with your suggestion about what information you've included in the suggested blocks field, only to not have it as a field, but a dedicated field on the target class itself (i.e. not as a pants target field).
[...] But this is weird, because filename is usually provided by
SingleSourceField
just noting, that I don't find this weird at all, as the source field doesn't necessarily always map to the same thing as what we want to look at for this.
Consider a target generator, the generated targets likely have source fields that map to the relevant file for this, where as regular targets declared in BUILD files have source fields that point to some arbitrary file that have no impact on the definition of the target itself at all.
@kaos Ok, here is my take on this: https://github.com/pantsbuild/pants/pull/20531/files/82f9e4baf693f00a57450e2c130517dd9a093345..3a4ea156382e91f6548c57fa816e4c0ac18f35e9
I've added a test for fine grained --changed-since with line numbers here 4115588. I had to add a dummy plugin to declare targets that have text blocks, I'm not sure if this is a good place to put it, wdyt?
A good use case to try this on is to add SourceBlocks to the *_requirements generators, I think. :)
It could be done in a separate PR based on this one, just to keep the changes separate, but showing the correctness/usability of this one before merging it.
Thanks. Will take a closer look this week.
Hey @kaos, could you take another look? Thanks
Hey @kaos, could you take another look? Thanks
ops, this must've fallen prey to my GH notifications cleanup. (had many hundreds of unread notifications, so in an attempt to get back on top, I wiped them clean.)
Will take a look.
cc @pantsbuild/maintainers Would appreciate a few more 👀 on this as well before we land it.
I've skimmed the thread but not all the review comments, so I may've missed this. This capability sounds great!
I have a questions: The PR description makes it looks like this is focused on computing dependencies for --changed-since=.... Does this influence process execution caching?
Concrete example:
- Imagine I have this structure, with two tests, that depends on the user and address tables respectively, and have run
pants test ::.
flowchart BT
u[:users sql_table] --> t[tables.toml]
a[:address sql_table] --> t
tu[:test_users python_test] --> u
ta[:test_address python_test] --> a
- If I change the
[table.address]fields, there's two testing invocations I could do:-
pants --changed-since=... --changed-files-with-line-numbers=tables.toml --changed-dependents=transitive test: AIUI, this PR means onlytest_addresswill run. -
pants test ::(potentially with extra args): will this (a) onlytest_addressrun (i.e. this PR does influence process caching), or (b) both tests run (i.e. the PR does not)?
-
Does this influence process execution caching?
Very good and relevant question. I hope I'm not wrong here, but I think the answer is no. If you run pants test :: it will use the same caching as before (i.e. the unit of test modules in the batch.)
Does this influence process execution caching?
To be honest, I don't know. Here are the things I do know:
- the pr adds a new
TargetAdaptorattributeorigin_sources_blocks - the attribute is not used by any existing pants code, so it has no effect (yet)
- when used the attribute changes the dependency relations essentially by reducing false positives, this affects the
--changed-sincelogic and everything that uses the dependencies.
Since a lot of code implicitly expects all targets to have file-based sources rather than block-based sources, I expect many things will not work with it and will require more adjustments. The good thing is that it's opt-in, so nothing will use the feature by default.
@kaos Ready for another (final?) round
- added notes
- removed source_blocks/
Sure
@benjyw Thanks for the review! I've added the logic for staged and untracked files. I didn't implement --changed-diffspec yet, I think it's not essential for the feature, we can do it in a separate pr.
@huonw Yeah, thanks, moved notes from 2.22 to 2.23
Merged! Thanks again @grihabor, great stuff!
Yay, thank you for your guidance!
I was wondering why this was not released to any of the dev tags released yesterday
I was wondering why this was not released to any of the dev tags released yesterday
What are you missing? According to the commit https://github.com/pantsbuild/pants/commit/2413448f5a65758e168935bc1cd849f056df065d it's been released since 2.23.0.dev1.