Widechar charmap support
Now RGBDS can only supprot
charmap "C", $01
Can RGBDS support widechar charmap like
charmap "C", $01F0
or
charmap "C", $01F0, 2
?
With ASCII, db "ABC" acts like db $41, $42, $43. Current master adds dw and dl support, so dw "ABC" acts like dw $41, $42, $43 (i.e. db $41, $00, $42, $00, $43, $00 since it's little-endian).
Presumably this would allow widths 1, 2, 3, or 4, since that's as many bytes as can be stored in one unsigned value.
- How would endianness work? If you have
charmap "C", $01F0, 2and dodb "C", would that act likedb $01, $F0ordb $F0, $01? - Furthermore, how would it interact with
db,dw, anddl? Should widths be handled per-character, sodw "AC"would act likedb $41, $00, $F0, $01? Or should charmapping convert to a sequence of bytes first, sodw "AC"would act likedw $41, $F0, $01(i.e.db $41, $00, $F0, $00, $01, $00)? (Or one of those options but with big-endian order for the bytes of"C"?)
With ASCII,
db "ABC"acts likedb $41, $42, $43. Currentmasteraddsdwanddlsupport, sodw "ABC"acts likedw $41, $42, $43(i.e.db $41, $00, $42, $00, $43, $00since it's little-endian).Presumably this would allow widths 1, 2, 3, or 4, since that's as many bytes as can be stored in one unsigned
value.
- How would endianness work? If you have
charmap "C", $01F0, 2and dodb "C", would that act likedb $01, $F0ordb $F0, $01?- Furthermore, how would it interact with
db,dw, anddl? Should widths be handled per-character, sodw "AC"would act likedb $41, $00, $F0, $01? Or should charmapping convert to a sequence of bytes first, sodw "AC"would act likedw $41, $F0, $01(i.e.db $41, $00, $F0, $00, $01, $00)? (Or one of those options but with big-endian order for the bytes of"C"?)
I don't think too much about the usage of dw and dl etc. On my side, I just want db to use Widechar. This requirement is for CJK text support.
For 1, it should be like UTF-8. Although CPU is little-endian, when Widechar and Singlechar are mixed together in character processing, it is difficult for little-endian to guess the width of the next character. Of course if little-endian is used forcefully, I can just reverse it when defining the charmap.
For 2, I don't have any idea.
Reference:
My own rgbds 0.4.1 modification, used to translate pokecrystal:
My modification does not allow for charmap "C", $00FF like this(parsing as charmap "C", $FF), but it is not necessary for my practical use.
My translation of pokecrystal, the charmap used: Singlechar is used for the original characters, and Widechar is used for the additional Chinese characters.
Thank you, that makes sense.
If db outputs the bytes in big-endian order, it would probably be convenient for dw/dl to do the opposite, in case someone explicitly wants the other behavior. For example:
charmap "A", $42
charmap "W", $1122, 2
TX_LF EQU -2
db "AWAW", TX_LF ; 42 11 22 42 11 22 fe
dw "AWAW", TX_LF ; 42 00 22 11 42 00 22 11 fe ff
dl "AW", TX_LF ; 42 00 00 00 22 11 00 00 fe ff ff ff
(This behavior of dw would be convenient for pokepicross.)
One other feature that could go with this: a charwidth or setcharwidth command to change the current default character width. This would make it easier to define many multi-byte characters.
newcharmap "CJK"
setcharwidth 2
charmap "啊", $04c3
charmap "阿", $04c4
charmap "埃", $04c5
...
So basically you can think of a wide char in two ways: a single multi-byte value, or short for a sequence of bytes. (Like the difference between "Σ" being Unicode code point U+03A3, or being UTF-8 encoded C3 A3.)
If characters are to consistently be seen as byte sequences—which seems like the convenient behavior for db, since db "Σ" should perhaps act like db $03, $A3 or db $A3, $03 but not like the overflowing one-byte db $03A3—then they could be specified as such.
So instead of charmap "Σ", $03a3, 2, you could do charmap "Σ", $03, $a3; or charmap "Σ", $a3, $03 if that's the order you want the bytes in.
This would be limited to taking 1 to 4 bytes. It would also make interaction with dw and dl potentially confusing, so maybe #623 could be reverted. (No need for dw "AZ" to act like db $41, $00, $5a, $00 when you could have done charmap "A", $41, $00, charmap "Z", $5a, $00, and db "AZ" instead.)
I'm starting to become concerned by the number of different data types that are overloaded into quoted strings. What would ld hl, "xy" do? What if y is a wide character?
In 0.4.2, there are two contexts where charmap_Convert gets called: constlist_8bit_entry (making a db entry for a string), and relocexpr (any context where a string gets treated as a number, which includes ld hl, "string"). Current master adds constlist_16bit_entry and constlist_32bit_entry conversion for dw and dl support, but as I mentioned above, that might be redundant with wide chars and could be removed before a release.
When a string is treated as a number, I'd expect it to work just the same as with a db: imagine it output byte by byte, and take the last 4 bytes as a number.
It sounds reasonable to implement this, but #568 actually causes issues. The following need to be considered:
- Behavior of
db "abc啊阿埃" - Behavior of
dw "abc啊阿埃"(anddlby extension) - Behavior of converting
"埃"to a numeric value, and uses of that value
Imo, we need charmaps to map not to a byte, not to a u32, not to a byte array, but to a u32 array. Here's a demonstration:
charmap "啊", $04, $c3
charmap "de", $6564
charmap "fghi", $66, $67, $6968
db "abc啊" ; db $61, $62, $63, $04, $C3
db "abcde" ; db $61, $62, $63, $64 ; Truncation warning
dw "abc啊" ; dw $61, $62, $63, $04, $C3
dw "abcde" ; dw $61, $62, $63, $6564
dw "abcdefghi" ; dw $61, $62, $63, $6564, $66, $67, $6968
I was looking for a way to eliminate one of those dimensions and allow just a sequence of N bytes (charmap "de", $65, $64) or just an N-byte value (charmap "de", $6564, 2) (limiting N to 4), but on reflection you're right, there are use cases for both.
charmap ".", $2e
charmap "…", ".", ".", "."
newcharmap utf16
; remember to just use 'dw' with these
charmap "A", $0041
charmap "Z", $005A
charmap "é", $00E9
charmap "啊", $554A
charmap "😀", $D83D, $DD00 ; surrogate pair
What would the value be of strings coerced to integers? Presumably multi-byte characters would act like those bytes in sequence, e.g. ("…") == ("..."), but what if single values are over $FF?
charmap "X", $22, $01
charmap "Y", $2201
assert ("X") == $2201
assert ("Y") == $2201 ; ???
assert ("YX") == $4401 ; this?
assert ("YX") == $222301 ; or this?
assert ("YX") == $22012201 ; or this?
Maybe we should just deprecate converting strings that aren't 1 character. There's no incentive to, especially not now that CHARSUB and CHARLEN have been implemented. (This would mean postponing this to 0.5.2, however.)
EDIT: Doing so would also decrease the need for 'A', too, I believe. Which would allow reserving that syntax for something else, perhaps.
I'd be fine with that. Deprecate string-to-number conversion in one version except for single-character strings, and also add 'character' literals from #609; then in the next not-patch version, strings wouldn't be syntactically valid in numeric expressions.
The use case I've heard for multi-char strings as numbers is FourCC values, but if the individual characters can be u32s then that's not a problem: charmap "IHDR", $49484452.
Edit: seems to me like it would increase the need for character literals, since otherwise the parser has to allow strings in general and print an error if the length is greater than 1.
What I'm saying is that we wouldn't need 'char' then, which is a plus for backwards compat, and leaves single quotes available for a separate feature.
If string to number conversion gets deprecated, either n = "string" will have to still be a valid parse, printing an error if there's no single-character charmap "string", $1234; or such assignments should use character literals instead, n = 'string'.
I was only talking about deprecating string conversions that don't yield a single conversion i.e. string conversions that aren't char conversions.
I understand. But that would amount to a change like this:
relocexpr : relocexpr_no_str
| string {
uint8_t *output = malloc(strlen($1)); /* Cannot be longer than that */
int32_t length = charmap_Convert($1, output);
+ if (length != 1) {
+ warning(WARNING_OBSOLETE, "String is not a single character\n");
uint32_t r = str2int2(output, length);
free(output);
rpn_Number(&$$, r);
}
;
If we're going to do that, I'd rather just introduce real character constants: there's no major demand for a different meaning for single quotes, and it would let you write 'string' to clearly mean "I expect there's a charmap for "string" with one value."
There's no major demand for single character quotes, either, moreso for a clear distinction between the "traditional" behaviour of languages, versus what RGBASM currently does. But if we replace the latter with the former, for the sake of consistency with this charmap change, then I don't think there's a need to use a different syntax for now. You thought about using quotes in another issue just today, which is where my suggestion to take it slowly is coming from.
For those who need it, the macros in rgbds v0.5.2 using simple widechar:
MACRO w_init
def W_PLANE_MAX = \1
newcharmap w_length
FOR i, 0, \1
newcharmap w_plane_{d:i}
ENDR
setcharmap w_plane_0
ENDM
MACRO w_charmap
IF _NARG >= 2 && _NARG < W_PLANE_MAX + 2
setcharmap w_length
charmap \1, _NARG - 1
FOR i, 0, W_PLANE_MAX
setcharmap w_plane_{d:i}
IF i < _NARG - 1
def j = i + 2
charmap \1, \<{d:j}>
ELSE
charmap \1, 0
ENDC
ENDR
setcharmap w_plane_0
ELSE
warn "Define w_char failed."
ENDC
ENDM
MACRO w_text
REPT _NARG
setcharmap w_length
FOR i, 1, charlen(\1) + 1
setcharmap w_length
def j = charsub(\1, i)
if j <= W_PLANE_MAX
FOR k, 0, j
setcharmap w_plane_{d:k}
db charsub(\1, i)
ENDR
ELSE
warn strcat("Get w_char failed: ", charsub(\1, i))
ENDC
ENDR
shift
ENDR
setcharmap w_plane_0
ENDM
w_init 5
w_charmap "<CTRL>", $fe
w_charmap "T", $50, $60
w_charmap "e", $50, $61, $62
w_charmap "s", $50, $63, $64, $65
w_charmap "t", $50, $66, $67, $68, $69
; db $fe, $50, $60, $50, $61, $62, $50, $63
; db $64, $65, $50, $66, $67, $68, $69, $fe
w_text "<CTRL>Tes", "t<CTRL>"
Even so, it is still not as good as the native support like:
charmap "wchar", $1345
ld a, HIGH("wchar") ; $13
ld h, a
ld a, LOW("wchar"); $45
ld l, a
db "wchar", $50, "wchar"
Update: A wide charmap implementation that covers the basic functionality without modifying the rgbds. just use the macros inside instead of the original expressions. Even complex projects like pokecrystal can be handled! https://github.com/SnDream/charmap_w.inc