Add missing docstring normalization
Without this docstrings with UTF8 withing [] have miscalculated lengths.
This adds (some of?) the normalization needed for [...] docstring elements. Currently this change also incorrectly applies to preformatted code block elements {[...]}. I do not know the best way to distinguish these cases in odoc, see https://github.com/ocaml-ppx/ocamlformat/issues/1735.
I don't know the best implementation strategy. The issues I am trying to resolve are, I think, similar in nature to the following example. Formatting:
val trms : t -> trm iter
(** [trms a] enumerates the maximal foreign or noninterpreted proper
subterms of [a]. Considering an arithmetic term as a polynomial,
[trms (c × (Σᵢ₌₁ⁿ cᵢ × Πⱼ₌₁ᵐᵢ Xᵢⱼ^pᵢⱼ))] is the sequence of monomials
[Πⱼ₌₁ᵐᵢ Xᵢⱼ^pᵢⱼ] for each [i]. If the arithmetic term is a monomial,
[trms (Πⱼ₌₁ᵐ Xⱼ^pⱼ)] is the sequence of factors [Xⱼ] for each [j]. *)
yields
val trms : t -> trm iter
(** [trms a] enumerates the indeterminate terms appearing in [a].
Considering an arithmetic term as a polynomial,
[trms (c × (Σᵢ₌₁ⁿ cᵢ × Πⱼ₌₁ᵐᵢ
Xᵢⱼ^pᵢⱼ))] is the sequence of terms [Xᵢⱼ] for each [i] and
[j]. *)
while with this PR this docstring is unchanged, as desired.
There seems to be something amiss where without this PR, the length of strings with UTF8 characters is not calculated correctly, where with this PR they are. @Julow, this behavior that I see doesn't match up with your explanation in the inline comment, so I'm confused. Do you see a different / better solution?
Are you using 0.18.0 as a comparison ?
I get no diff between this PR and the current master with your example but I do see it compared with 0.18.0. UTF8 length (grapheme clusters) is used on every strings (including keywords and identifiers) since https://github.com/ocaml-ppx/ocamlformat/pull/1673
Hmm, I am literally comparing this PR with master. I'll dig some more to see if I can understand what is happening.
Ugh, this is an instance of https://github.com/ocaml-ppx/ocamlformat/issues/1675. If you start with master and format
val trms : t -> trm iter
(** [trms a] enumerates the indeterminate terms appearing in [a].
Considering an arithmetic term as a polynomial,
[trms (c × (Σᵢ₌₁ⁿ cᵢ × Πⱼ₌₁ᵐᵢ
Xᵢⱼ^pᵢⱼ))] is the sequence of terms [Xᵢⱼ] for each [i] and
[j]. *)
then it is unchanged. Switch to this PR and format and it produces:
val trms : t -> trm iter
(** [trms a] enumerates the maximal foreign or noninterpreted proper
subterms of [a]. Considering an arithmetic term as a polynomial,
[trms (c × (Σᵢ₌₁ⁿ cᵢ × Πⱼ₌₁ᵐᵢ Xᵢⱼ^pᵢⱼ))] is the sequence of monomials
[Πⱼ₌₁ᵐᵢ Xᵢⱼ^pᵢⱼ] for each [i]. If the arithmetic term is a monomial,
[trms (Πⱼ₌₁ᵐ Xⱼ^pⱼ)] is the sequence of factors [Xⱼ] for each [j]. *)
The remaining question is: How should be change the formatting of code spans.
This PR allows every spaces in a code span to break. This makes it more stable (there's only one stable formatting) but it changes Odoc's AST.
(** [a
b] *)
This is parsed as "a\n\ \ \ \ b" by Odoc. (The HTML output is "a b" but not in other backends)
Perhaps we could change Odoc to normalize whitespaces inside code spans ? For now we can't break code spans.
Hmm, this feels like another case of odoc just doing something different from what is in the ocaml manual, so I don't know. To my understanding, {[...]} are described as preformatted, and hence parsing their contents in a whitespace-preserving way makes sense, while [...] are not described as preformatted, and so their contents should be normalized. If the current parsing in odoc prevents this, then it seems like an odoc improvement is needed. Does that make sense to anyone other than me?
I tend to agree that [] contents should be normalized while {[]} should be kept verbatim, is there an issue open on odoc? We will comply with whatever consensus there will be between ocaml and odoc on the language
I don't think that there is an odoc issue about this. I'm not sure of the constraints on the odoc side, and on the interop between ocamlformat and odoc. We are currently using this PR internally (this branch) with no issues, even though we do full docstring normalization. So I'm not sure what the observed problem just for odoc looks like.
I would support normalizing [ ... ] code blocks but this cannot be done yet. With the current Odoc, this means changing the AST and inserting unwanted newlines into the rendered output.
Also, we must be careful at how code blocks are used, this might break some documentations:
- (** [trim " "] is [""] *)
+ (** [trim " "] is [""] *)
I believe that this PR doesn't fix the original issue (which was fixed in https://github.com/ocaml-ppx/ocamlformat/pull/1673).
Perhaps we should open an issue on Odoc ?
In the original report I was mistaken to bring up utf length computation, that turned out not to be the issue. The issue in the comment above remains though, and is fixed by this PR.
@Julow, are you saying that any change in ocamlformat that allows every space in a code span to break would cause problems with odoc? Are the odoc ast changes you mention something that are checked in ocamlformat analogous to how we check the main parsetree is preserved by formatting? I'm curious as we have been running with this PR in production for a while with no issues.
@Julow, are you saying that any change in ocamlformat that allows every space in a code span to break would cause problems with odoc? Are the odoc ast changes you mention something that are checked in ocamlformat analogous to how we check the main parsetree is preserved by formatting? I'm curious as we have been running with this PR in production for a while with no issues.
Any change in whitespace/newlines in code spans would change the rendered documentation. Every strings in doc comments are normalized: https://github.com/ocaml-ppx/ocamlformat/blob/main/lib/Normalize.ml#L79-L83 which is wrong at least for code spans with the current implementation of Odoc.
I believe it is reasonable to change Odoc to normalize line breaks (and the following spaces) in code spans (but not consecutive spaces).