rgbds icon indicating copy to clipboard operation
rgbds copied to clipboard

Widechar charmap support

Open SnDream opened this issue 9 years ago • 17 comments

Now RGBDS can only supprot

charmap "C", $01

Can RGBDS support widechar charmap like

charmap "C", $01F0

or

charmap "C", $01F0, 2

?

SnDream avatar Sep 29 '16 15:09 SnDream

With ASCII, db "ABC" acts like db $41, $42, $43. Current master adds dw and dl support, so dw "ABC" acts like dw $41, $42, $43 (i.e. db $41, $00, $42, $00, $43, $00 since it's little-endian).

Presumably this would allow widths 1, 2, 3, or 4, since that's as many bytes as can be stored in one unsigned value.

  • How would endianness work? If you have charmap "C", $01F0, 2 and do db "C", would that act like db $01, $F0 or db $F0, $01?
  • Furthermore, how would it interact with db, dw, and dl? Should widths be handled per-character, so dw "AC" would act like db $41, $00, $F0, $01? Or should charmapping convert to a sequence of bytes first, so dw "AC" would act like dw $41, $F0, $01 (i.e. db $41, $00, $F0, $00, $01, $00)? (Or one of those options but with big-endian order for the bytes of "C"?)

Rangi42 avatar Jan 11 '21 20:01 Rangi42

With ASCII, db "ABC" acts like db $41, $42, $43. Current master adds dw and dl support, so dw "ABC" acts like dw $41, $42, $43 (i.e. db $41, $00, $42, $00, $43, $00 since it's little-endian).

Presumably this would allow widths 1, 2, 3, or 4, since that's as many bytes as can be stored in one unsigned value.

  • How would endianness work? If you have charmap "C", $01F0, 2 and do db "C", would that act like db $01, $F0 or db $F0, $01?
  • Furthermore, how would it interact with db, dw, and dl? Should widths be handled per-character, so dw "AC" would act like db $41, $00, $F0, $01? Or should charmapping convert to a sequence of bytes first, so dw "AC" would act like dw $41, $F0, $01 (i.e. db $41, $00, $F0, $00, $01, $00)? (Or one of those options but with big-endian order for the bytes of "C"?)

I don't think too much about the usage of dw and dl etc. On my side, I just want db to use Widechar. This requirement is for CJK text support. For 1, it should be like UTF-8. Although CPU is little-endian, when Widechar and Singlechar are mixed together in character processing, it is difficult for little-endian to guess the width of the next character. Of course if little-endian is used forcefully, I can just reverse it when defining the charmap. For 2, I don't have any idea.

Reference: My own rgbds 0.4.1 modification, used to translate pokecrystal: My modification does not allow for charmap "C", $00FF like this(parsing as charmap "C", $FF), but it is not necessary for my practical use.

My translation of pokecrystal, the charmap used: Singlechar is used for the original characters, and Widechar is used for the additional Chinese characters.

Example of the text in my translation of pokecrystal

SnDream avatar Jan 12 '21 12:01 SnDream

Thank you, that makes sense.

If db outputs the bytes in big-endian order, it would probably be convenient for dw/dl to do the opposite, in case someone explicitly wants the other behavior. For example:

charmap "A", $42
charmap "W", $1122, 2
TX_LF EQU -2

    db "AWAW", TX_LF ; 42 11 22 42 11 22 fe
    dw "AWAW", TX_LF ; 42 00 22 11 42 00 22 11 fe ff
    dl "AW", TX_LF   ; 42 00 00 00 22 11 00 00 fe ff ff ff

(This behavior of dw would be convenient for pokepicross.)

One other feature that could go with this: a charwidth or setcharwidth command to change the current default character width. This would make it easier to define many multi-byte characters.

newcharmap "CJK"
setcharwidth 2
charmap "啊", $04c3
charmap "阿", $04c4
charmap "埃", $04c5
...

Rangi42 avatar Jan 12 '21 13:01 Rangi42

So basically you can think of a wide char in two ways: a single multi-byte value, or short for a sequence of bytes. (Like the difference between "Σ" being Unicode code point U+03A3, or being UTF-8 encoded C3 A3.)

If characters are to consistently be seen as byte sequences—which seems like the convenient behavior for db, since db "Σ" should perhaps act like db $03, $A3 or db $A3, $03 but not like the overflowing one-byte db $03A3—then they could be specified as such.
So instead of charmap "Σ", $03a3, 2, you could do charmap "Σ", $03, $a3; or charmap "Σ", $a3, $03 if that's the order you want the bytes in.

This would be limited to taking 1 to 4 bytes. It would also make interaction with dw and dl potentially confusing, so maybe #623 could be reverted. (No need for dw "AZ" to act like db $41, $00, $5a, $00 when you could have done charmap "A", $41, $00, charmap "Z", $5a, $00, and db "AZ" instead.)

Rangi42 avatar Jan 12 '21 20:01 Rangi42

I'm starting to become concerned by the number of different data types that are overloaded into quoted strings. What would ld hl, "xy" do? What if y is a wide character?

aaaaaa123456789 avatar Jan 12 '21 22:01 aaaaaa123456789

In 0.4.2, there are two contexts where charmap_Convert gets called: constlist_8bit_entry (making a db entry for a string), and relocexpr (any context where a string gets treated as a number, which includes ld hl, "string"). Current master adds constlist_16bit_entry and constlist_32bit_entry conversion for dw and dl support, but as I mentioned above, that might be redundant with wide chars and could be removed before a release.

When a string is treated as a number, I'd expect it to work just the same as with a db: imagine it output byte by byte, and take the last 4 bytes as a number.

Rangi42 avatar Jan 12 '21 22:01 Rangi42

It sounds reasonable to implement this, but #568 actually causes issues. The following need to be considered:

  • Behavior of db "abc啊阿埃"
  • Behavior of dw "abc啊阿埃" (and dl by extension)
  • Behavior of converting "埃" to a numeric value, and uses of that value

Imo, we need charmaps to map not to a byte, not to a u32, not to a byte array, but to a u32 array. Here's a demonstration:

charmap "啊", $04, $c3
charmap "de", $6564
charmap "fghi", $66, $67, $6968

db "abc啊" ; db $61, $62, $63, $04, $C3
db "abcde" ; db $61, $62, $63, $64 ; Truncation warning
dw "abc啊" ; dw $61, $62, $63, $04, $C3
dw "abcde" ; dw $61, $62, $63, $6564
dw "abcdefghi" ; dw $61, $62, $63, $6564, $66, $67, $6968

ISSOtm avatar Apr 20 '21 13:04 ISSOtm

I was looking for a way to eliminate one of those dimensions and allow just a sequence of N bytes (charmap "de", $65, $64) or just an N-byte value (charmap "de", $6564, 2) (limiting N to 4), but on reflection you're right, there are use cases for both.

charmap ".", $2e
charmap "…", ".", ".", "."

newcharmap utf16
; remember to just use 'dw' with these
charmap "A", $0041
charmap "Z", $005A
charmap "é", $00E9
charmap "啊", $554A
charmap "😀", $D83D, $DD00 ; surrogate pair

What would the value be of strings coerced to integers? Presumably multi-byte characters would act like those bytes in sequence, e.g. ("…") == ("..."), but what if single values are over $FF?

charmap "X", $22, $01
charmap "Y", $2201
assert ("X") == $2201
assert ("Y") == $2201 ; ???
assert ("YX") == $4401 ; this?
assert ("YX") == $222301 ; or this?
assert ("YX") == $22012201 ; or this?

Rangi42 avatar Apr 20 '21 19:04 Rangi42

Maybe we should just deprecate converting strings that aren't 1 character. There's no incentive to, especially not now that CHARSUB and CHARLEN have been implemented. (This would mean postponing this to 0.5.2, however.)

EDIT: Doing so would also decrease the need for 'A', too, I believe. Which would allow reserving that syntax for something else, perhaps.

ISSOtm avatar Apr 20 '21 19:04 ISSOtm

I'd be fine with that. Deprecate string-to-number conversion in one version except for single-character strings, and also add 'character' literals from #609; then in the next not-patch version, strings wouldn't be syntactically valid in numeric expressions.

The use case I've heard for multi-char strings as numbers is FourCC values, but if the individual characters can be u32s then that's not a problem: charmap "IHDR", $49484452.

Edit: seems to me like it would increase the need for character literals, since otherwise the parser has to allow strings in general and print an error if the length is greater than 1.

Rangi42 avatar Apr 20 '21 20:04 Rangi42

What I'm saying is that we wouldn't need 'char' then, which is a plus for backwards compat, and leaves single quotes available for a separate feature.

ISSOtm avatar Apr 20 '21 20:04 ISSOtm

If string to number conversion gets deprecated, either n = "string" will have to still be a valid parse, printing an error if there's no single-character charmap "string", $1234; or such assignments should use character literals instead, n = 'string'.

Rangi42 avatar Apr 20 '21 20:04 Rangi42

I was only talking about deprecating string conversions that don't yield a single conversion i.e. string conversions that aren't char conversions.

ISSOtm avatar Apr 20 '21 20:04 ISSOtm

I understand. But that would amount to a change like this:

 relocexpr	: relocexpr_no_str
 		| string {
 			uint8_t *output = malloc(strlen($1)); /* Cannot be longer than that */
 			int32_t length = charmap_Convert($1, output);
+			if (length != 1) {
+				warning(WARNING_OBSOLETE, "String is not a single character\n");
 			uint32_t r = str2int2(output, length);
 
 			free(output);
 			rpn_Number(&$$, r);
 		}
 ;

If we're going to do that, I'd rather just introduce real character constants: there's no major demand for a different meaning for single quotes, and it would let you write 'string' to clearly mean "I expect there's a charmap for "string" with one value."

Rangi42 avatar Apr 20 '21 20:04 Rangi42

There's no major demand for single character quotes, either, moreso for a clear distinction between the "traditional" behaviour of languages, versus what RGBASM currently does. But if we replace the latter with the former, for the sake of consistency with this charmap change, then I don't think there's a need to use a different syntax for now. You thought about using quotes in another issue just today, which is where my suggestion to take it slowly is coming from.

ISSOtm avatar Apr 20 '21 20:04 ISSOtm

For those who need it, the macros in rgbds v0.5.2 using simple widechar:

MACRO w_init
    def W_PLANE_MAX = \1
    newcharmap w_length
    FOR i, 0, \1
        newcharmap w_plane_{d:i}
    ENDR
    setcharmap w_plane_0
ENDM

MACRO w_charmap
    IF _NARG >= 2 && _NARG < W_PLANE_MAX + 2
        setcharmap w_length
        charmap \1, _NARG - 1
        FOR i, 0, W_PLANE_MAX
            setcharmap w_plane_{d:i}
            IF i < _NARG - 1
                def j = i + 2
                charmap \1, \<{d:j}>
            ELSE
                charmap \1, 0
            ENDC
        ENDR
        setcharmap w_plane_0
    ELSE
        warn "Define w_char failed."
    ENDC
ENDM

MACRO w_text
    REPT _NARG
        setcharmap w_length
        FOR i, 1, charlen(\1) + 1
        setcharmap w_length
        def j = charsub(\1, i)
        if j <= W_PLANE_MAX
            FOR k, 0, j
                setcharmap w_plane_{d:k}
                db charsub(\1, i)
            ENDR
        ELSE
            warn strcat("Get w_char failed: ", charsub(\1, i))
        ENDC
        ENDR
        shift
    ENDR
    setcharmap w_plane_0
ENDM

    w_init 5

    w_charmap "<CTRL>", $fe
    w_charmap "T", $50, $60
    w_charmap "e", $50, $61, $62
    w_charmap "s", $50, $63, $64, $65
    w_charmap "t", $50, $66, $67, $68, $69

    ; db $fe, $50, $60, $50, $61, $62, $50, $63
    ; db $64, $65, $50, $66, $67, $68, $69, $fe
    w_text "<CTRL>Tes", "t<CTRL>"

Even so, it is still not as good as the native support like:

    charmap "wchar", $1345
    ld a, HIGH("wchar") ; $13
    ld h, a
    ld a, LOW("wchar"); $45
    ld l, a

    db "wchar", $50, "wchar"

SnDream avatar Jul 23 '22 06:07 SnDream

Update: A wide charmap implementation that covers the basic functionality without modifying the rgbds. just use the macros inside instead of the original expressions. Even complex projects like pokecrystal can be handled! https://github.com/SnDream/charmap_w.inc

SnDream avatar Apr 05 '24 16:04 SnDream