Semantic comments
Having a way to semantically describe a message would benefit tooling. It would allow tools to better inform the user what they can do in the translation, and give hints and suggestions.
Perhaps we could consider using something similar JSDoc. In particular the @param tag: http://usejsdoc.org/tags-param.html. JSDoc conveniently allows to specify the type, the description and the default value, which could be used by tools to display an example of a formatted translation.
# @param {number} [$num = 4] Number of new messages
new-messages = { $num ->
*[one] You have 1 new message.
[other] You have { $num } new messages.
}
Would it make sense to make this meta-information first-class? Rust differentiates between regular comments (//) and doc comments (///). We could do something similar by making the @ sigil special:
@param {number} [$num = 4] Number of new messages
new-messages = { $num ->
*[one] You have 1 new message.
[other] You have { $num } new messages.
}
This is possibly related to #7.
Some more thoughts on how this relates to #7: we could introduce our custom @-tags for language-specific meta information which would be ignored by tools like compare-locales:
@meta masculine
brand-name = Firefox
We could also introduce versioning for messages . This would allow making small changes to the original copy without having to change the identifier. The default and implicit revision would be revision 0:
accept = Accept the Terms and Conditions.
and some time later:
@rev 1
accept = Please accept the Terms and Conditions.
Some more thoughts on how this relates to #7: we could introduce our custom @-tags for language-specific meta information which would be ignored by tools like compare-locales
In https://github.com/projectfluent/syntax/issues/7#issuecomment-275671311 @Pike suggested that we separate semantic comments and grammatical data due to their having different owners.
I love that. One possible use case is to define maximum string length (we already use it in .lang files, e.g. for translating promotional tweets).
We could also introduce versioning for messages . This would allow making small changes to the original copy without having to change the identifier.
I'll bet you 100€ that some developer will forget to version a string and that we'll end up with significant changes in the source language that will be missing from the target languages - while you might still control this in QA somehow for Mozilla's core products, you can't count on every extension developer to be aware of the problem. The only way that works reliably in my book is the way gettext handles this: If the string content has changed, it will become fuzzy. It is annoying for localizers if strings become invalid due to a typo having been fixed in the source language, but it's the lesser of 2 evils.
In #59 @zbraniecki asked about the possibility of using semantic comments for tags. I'm pasting my reply below to keep the discussion in one place.
My understanding of the scope of semantic comments is that they would be a place to put extra data available to tools rather than the runtime. In fact they wouldn't be parsed on runtime at all. This would make it impossible to use them for tags.
Even before we move forward with this, can we get a consensus on how we'd like such comments to look like so that we can start writing such comments even without them being "semantic" yet?
I'm trying to decide between:
# Description
#
# Variables
# $num (String) - Description
and:
# Description
#
# $num {String}: Description
or sth in between? Thoughts?
Even before we move forward with this, can we get a consensus on how we'd like such comments to look like (…)?
Isn't this the exact goal of this issue? :)
Some prior art:
JSDoc:
# @param {number} $num - The $num value.
Some Python projects use this style:
# :param $num: The $num value.
# :type $num: number
I like simple ideas of the form:
# $num (Number) - Description
But without a @param or :param how would we introduce other information like max-length or revision? Could we just use @ for data not related to arguments?
# $num (Number) Description
# @max-length 140
If the goal is to parse comments, and extract information about parameters, I feel like we should enforce @param. If it's only about documenting the parameters, that your last example looks good to me.
One good use of semantic comments would be to instruct the localization tool like Pontoon on what context the string will be used in.
The particular case is where a message like this:
key = Click on <a>my</a> link to paste <a> into the textbox
The latter <a> should be displayed so we probably want to use <a>at least and without knowing if the key goes to alert() or to DOM it's impossible for the tool/localizer to know how to encode it.
Semantic comments could make it trivial:
# @env: html
key = Click on <a>my</a> link to paste <a> into the textbox
and to prevent having to place it in front of every string, we could use group comments and resource comments to annotate the whole file.
Semantic Comments v1 proposal
(updated: April 5 2018)
Description
Semantic comments is the concept which brings basic computer readable structure to comments. The idea is to design a set of patterns that can be codified which enable algorithmic interpretation of a comment. Such models allow comments to be parsed and data from them used in tooling.
The core design goal is to develop rules that are easy to naively interpret and memorize by humans with minimal overhead, while at the same time allowing computers to assign meaning.
Semantic comments may serve several high level roles:
- They can unify the way meta information is being stored and presented improving readability of the comments and reducing cognitive load on the reader.
- They can help tools present contextual information from the developer to localizers in a well organized manner.
- They can help tools interpret the translation and provide additional checks and UI options.
In principle, the nature of the data stored in the comments is limited - runtime parsers should be able to skip comments without parsing them and failure to retrieve information from the comment should not result in any serious reduction in usability of the system.
Experience from other programming languages shows that some form of semantic comments are helpful in most languages from JavaScript, Python, C++ to Rust and CSS.
Below is my initial proposal for the first version of Semantic Comments.
Title Line
It would be useful to be able to capture a short description of the section for UI tools to use when operating on long lists of strings.
A great example of such use case is the current preferences.ftl with hundreds of messages clustered into sections.
My proposal is to identify the title line of any comment as fitting into one of two conditions:
- Being the only non-empty line of the comment
- Being the first non-empty line of the comment with a blank line right below it.
That means that the following two are titles lines:
## Privacy Section - Site Data
## Privacy Section - Site Data
##
## This sections will contain several messages
## that should be translated by a lawyer if possible.
And the result in Pontoon, for example, may look like this
Meta-infrormation
Meta information by the definition should be an open ended system. It means that while we can specify the syntax around it and define a set of values that are defined and known, this system should also be open to be extended in the future with new keys. For that reason, I believe that a key-value param system would work well.
The initial uses of meta information may provide details like: in which context the message is being used? Communication style to use for such message. Are there any legal requirements associated with it (branding policy etc.), what UI toolkit it will use etc, string version, etc.
Another use case is to instruct the tooling about any soft-limitations imposed on the translation. For example we may want to instruct the localization tool that a given file/group/message should remain simple - no formatters, no variants etc.
Such limitation may result from some downdumb conversion that will happen on the file later in the release cycle, but a semantic comment may be useful early on to help CATs.
JSDoc has a nice system of block and inline tags. Copying it, it may look like this:
### Preferences
###
### @license LGPL
### @toolkit HTML
## General Section - Site Data
# @context Window Title
sitedata-exceptions-title =
.label = Exceptions
.accesskey = E
# @context Menu button
sitedata-exceptions =
.label = Exceptions
.accesskey = E
# @policy "Firefox" should be treated as a brand and kept in English,
# while "Home" and "(Default)" can be localized.
home-mode-choice-default =
.label = Firefox Home (Default)
Since the system is open ended, the only first step is to define the syntax for it. I'd like it to be:
@name value
@name value {type}
@name value {type} - description
with all three being optional.
Variables
Variables could either be a particular type of block tags, or a separate thing:
# @arg $name {string} - Name of a search engine.
search-keyword-warning-engine = You have chosen a keyword that is currently in use by “{ $name }”. Please select another.
or:
# $name {string} - Name of a search engine.
search-keyword-warning-engine = You have chosen a keyword that is currently in use by “{ $name }”. Please select another.
I'm not very opinionated here, and we could start with the former and maybe one day add the latter as a convenience mechanism ("($.*)" becomes a "@arg $1").
Syntax coloring / validation
The last item I'd like us to consider is syntax coloring and augumentation.
There are three areas where we may end up placing a syntax from another programming language:
- In the DOM Overlay
- As an argument (
style: width: 15em) - In a comment
For the DOM Overlay, I believe that block tag @toolkit HTML on file/section/message comment should be sufficient for any form of syntax highlighting to kick in.
For arguments it's a bit more tricky, and I thought we could annotate it like this:
# This string is currently used only in Firefox 60 and will be removed when not
# needed for x-channel. See bug 1445686 for details.
#
# @toolkit.style CSS
search-input =
.title = My Window Title
.style = width: 15.4em
to allow us to specify that the .style attribute is in CSS.
Finally, in the comment, I really like the RST way:
# This message will be displayed inside a :html:`<span/>` with :css:`font-weight: bold`.
#
# @arg $name {string} - Name of a search engine.
search-keyword-warning-engine = You have chosen a keyword that is currently in use by “{ $name }”. Please select another.
This could be introduced gradually - we could now specify "`" as the sygil for code only, and let tooling autoguess the syntax highlighting for it, and one day extend it with :XXX: prefix to allow for specifying the language if needed.
The total outcome might look like this
@mathjazz @Pike @stasm @flodolo @adngdb @hkasemir - can I get your feedback here please? I'd like to start unifying our comments around those concepts if they're approved.
Nice work, @zbraniecki! This is very much in line with what I had in mind for this, thanks for adding substance and providing details.
Title line
I see your point and I like the Pontoon mockup. I'm not sure this practice needs to be codified as a rule. Pontoon could simply show the first line of the comment truncated to the fit the UI and the effect would be the same, I think?
Meta information
This looks great and using @prop name makes sense to me.
Variables
I like basing the syntax on JSDoc. One thing that I didn't see in your proposal is the syntax for example values. In my original comment I used JSDoc's syntax for default values of optional params:
# @param {number} [$num = 4] Number of new messages
I'm not sure this would be a good fit for Fluent. There is no notion of optional parameters/arguments so the braces [ ] are not required. They also add visual clutter. Perhaps we could add this information to the type? Like so:
# @param $name {Type, "example value"} - Description
Which would give us:
# @param $num {Number, 4} - Number of notifications.
# @param $username {String, "Anne"} - User's first name.
An alternative inspired by TypeScript, Rust and a few others:
# @param $num: Number (4) - Number of notifications.
# @param $username: String ("Anne") - User's first name.
IIUC any such derivation will make our comments syntax incompatible with JSDoc. Should we try to maintain the compatibility? Or is that a non-goal?
Syntax coloring
I recommend sticking to Markdown rather than adding features from RST. As such, I think comment contents should simply be allowed to be valid Markdown. This would make it possible to use backticks for inline code fragments, without any syntax highlighting (like `this`). While this is a limitation, it's not a big one. GitHub comments work quite well despite it :)
Title Line
I don't see a great advantage in having a title for section comments. In case, I would prefer something more explicit than relying on position and empty lines, i.e.
## @title Privacy Section - Site Data
## @title Privacy Section - Site Data
##
## This sections will contain several messages
## that should be translated by a lawyer if possible.
Which would make it fall into the next group.
Meta-information
I'm trying to imagine how we could practically use this information, but I'm failing. For example, for us @policy is implicit, since we put brand names in specific files and paths. @context is normally part of the content of the comment itself.
I think we need some valid use cases to justify the added complexity of parsing these comments.
Variables
I agree that we should standardize this type of information, and I'm fine with the @arg approach.
We could even go as far as failing some tests if a string has placeables but not associated comments.
Syntax coloring / validation
I don't think there's value in highlighting syntax in comments (last part of the proposal). It adds a ton of complexity for little gain.
I'm not sure if there should be highlighting in strings either, but I'd be more open about that.
@toolkit.style CSS
I think this should be something more like
@validation .style CSS
Which could be used to both validate the attribute externally (compare-locales), and highlight strings in Pontoon.
(Title Line) I'm not sure this practice needs to be codified as a rule.
I think there is a value.
IIUC any such derivation will make our comments syntax incompatible with JSDoc. Should we try to maintain the compatibility? Or is that a non-goal?
non-goal
As such, I think comment contents should simply be allowed to be valid Markdown. This would make it possible to use backticks for inline code fragments, without any syntax highlighting (like
this).
I agree about backticks, but I'd be concerned if we tried to say that all markdown syntax is supported in our comments. AFAIK Markdown supports much more and tying us to markdown seems a bit excessive (and adds a strong dependency).
I don't see a great advantage in having a title for section comments. In case, I would prefer something more explicit than relying on position and empty lines, i.e.
I'm not opposed to using @title here. I think it may be redundant (since the position and white line communicate the same thing both to the human reader and can be easily parsed), but we could start with explicit param and consider adding an implicit support later.
I think we need some valid use cases to justify the added complexity of parsing these comments.
If I read your statement correctly it starts with "I don't understand" and finishes with "and thus I believe the proposal is invalid" :) I'm happy to answer your questions and explain further, but I do believe the example listed are valid.
I think this should be something more like
@validation .style CSS
Hmm, how would you denote the syntax of the value then?
If I read your statement correctly it starts with "I don't understand" and finishes with "and thus I believe the proposal is invalid" :) I'm happy to answer your questions and explain further, but I do believe the example listed are valid.
Uhm, where did I say “I don't understand”? You gave a few examples:
-
@license: I'm not sure that would work, from a legal perspective, given that we had to copy and paste the same license header to all files for a while now. - I've explained why I believe that
@contextand@policyare not going to be useful in our case.
I'm not against them, in fact I suggested to use @title (and potentially @validation), and I agree on using @arg. The disagreement is more about the open nature you're suggesting.
Hmm, how would you denote the syntax of the value then?
@validation CSS or @validation value CSS? The latter would more intuitively apply only to the value, in case there are more attributes.
On the same subject, I see these should apply only to individual strings, not file wide.
I think there is a value.
What's the vale that you're seeing? :) In particular, what is the value over what I suggested:
Pontoon could simply show the first line of the comment truncated to the fit the UI and the effect would be the same, I think?
I agree about backticks, but I'd be concerned if we tried to say that all markdown syntax is supported in our comments.
You're right, we should be explicit about only supporting a strict subset of Markdown.
I think we should split this up. This is way too big to reason about at this point.
High-level comments:
-
@foo: I don't think that JSDoc is a good example for us. There's a couple of overlapping concepts, and the most important@paramone is not well defined. That doesn't mean that we should avoid syntax overlap, but I also think we should be strict in how we talk about this. - group and resource comments: Right now, pontoon doesn't do a good job at showing group comments, and I'd like to avoid adding comment area for developers if we don't have a good way to show them.
Suggestions:
I'd recommend to have this issue focus on the @foo syntax, and how to parse that. I'd split out individual foos to individual issues. (Yes, having a use-cases helps with the general syntax, but only so far.)
I'd recommend to split out group and resource comment handling into a completely different thread, and possibly have pontoon be the driving force behind changes to that. That effort might also be something to be done post-translate-view-refactor. Or something based on more realistic interactive mock-ups that let us experience how the comments in a group get shown as you translate entity for entity.
I'm not sure that would work, from a legal perspective, given that we had to copy and paste the same license header to all files for a while now.
I've recently seen several conversations on #developers indicating that this is no longer true. I'd like to verify that so I'll seek further confirmation, but in general, it's a per-project policy and a header like that may be useful. Please remember that we're designing syntax not just for Gecko.
For example, for us @policy is implicit, since we put brand names in specific files and paths. @context is normally part of the content of the comment itself.
That's not always true. We have a lot of branding related policies in other files and assuming that all brands will end up in separate FTL files is IMHO not going to hold. Having a parameter to provide policy information seems like a low hanging fruit.
Regarding @context - the basic premise behind semantic comments is to extract pieces of the "content of the comment" into bits that are interpretable by software.
For example, while currently a comment may contain contextual information, it would be hard/impossible for Pontoon to try to reason about if such comment contain any contextual information and which part of the comment does so.
Having a @context parameter allows tooling to use a particular message that is semantically described to contain contextual information to be presented in a form that is more relevant to the reader.
Examples may be screenshots of the UI, or even more semantic information like is it a title, message, button label etc, which could be further used by the tool to improve the graphical representation of the message and help the localizer understand how to translate.
A particular example here is that knowing the context of the message may help Pontoon prioritize the messages in translation memory which share the same context over ones that have the same English value but different context. Those are of course just example.
On the same subject, I see these should apply only to individual strings, not file wide.
Agree.
I'd split out individual foos to individual issues.
Agree. I'll file issue per proposal assuming that we're past the stage where a single issue for all elements of the proposal make it easier to discuss them.
Thanks!
Please remember that we're designing syntax not just for Gecko.
I think that's one point that I tend to forget in such discussions.
And translation memory seems definitely an interesting application, the challenge would be making sure values for these are chosen consistently.
Based on conversation with Stas I added an example for meta-data about simple strings to instruct CAT to warn against using any placeables in a given file/group/message.
We talked about moduralization of fluent specs, and I think semantic comments would be a good example. Should we have a repo for just semantic comments (projectfluent/comments or projectfluent/semantic-comments), with issues to move individual aspects forward, and possibly a markdown file per spec/facet?
I'd prefer to keep everything in a single repo and use labels and projects. We can add a new file in the spec/ directory. There's already a draft there of how errors should be handled (to be revised, for sure).
Separated out into issues. Skipped colors for now.
I created a GitHub project for tracking the design and implementation of semantic comments: https://github.com/projectfluent/fluent/projects/5. @zbraniecki, should we close this issue given that we now have separate issues for each proposal?