Extend the characters allowed for selectors
Hello! Using the latest version of SmartFormat, I encountered an issue, where my template cannot be processed when it contains Cyrillic letters.
The code:
using SmartFormat;
namespace TestSmartFormat
{
public class AttributeData
{
public string DisplayText { get; set; }
}
public class Document
{
public Dictionary<string, AttributeData> ArchiveAttributes { get; set; }
}
public class UnifiedDocumentData
{
public Document Document { get; set; }
}
class Program
{
static void Main(string[] args)
{
// Initialize sample data with Cyrillic keys
var data = new UnifiedDocumentData
{
Document = new Document
{
ArchiveAttributes = new Dictionary<string, AttributeData>
{
{ "София", new AttributeData { DisplayText = "София" } },
{ "Пловдив", new AttributeData { DisplayText = "Пловдив" } }
}
}
};
// Define templates
string template1 = "{ArchiveAttributes[София].DisplayText}";
string template2 = "{ArchiveAttributes[Пловдив].DisplayText}";
var formatter = Smart.CreateDefaultSmartFormat();
try
{
// Perform formatting
string result1 = formatter.Format(template1, data.Document);
string result2 = formatter.Format(template2, data.Document);
// Display results
Console.WriteLine($"Template 1 Result: {result1}"); // Expected: София
Console.WriteLine($"Template 2 Result: {result2}"); // Expected: Пловдив
}
catch (Exception ex)
{
Console.WriteLine($"Unexpected Error: {ex.Message}");
}
}
}
}
The error: Unexpected Error: The format string has 5 issues: '0x421': Invalid character in the selector, '0x43E': Invalid character in the selector, '0x444': Invalid character in the selector, '0x438': Invalid character in the selector, '0x44F': Invalid character in the selector In: "{ArchiveAttributes[София].DisplayText}" At: -------------------^^^^^
However, when I change the following part to contain only English, it works perfectly well:
var data = new UnifiedDocumentData
{
Document = new Document
{
ArchiveAttributes = new Dictionary<string, AttributeData>
{
{ "Sofia", new AttributeData { DisplayText = "София" } },
{ "Plovdiv", new AttributeData { DisplayText = "Пловдив" } }
}
}
};
// Define templates
string template1 = "{ArchiveAttributes[Sofia].DisplayText}";
string template2 = "{ArchiveAttributes[Plovdiv].DisplayText}";
I am unsure if I am missing something, or if it is genuinely a bug.
Thanks in advance for your help!
The default character that are allowed for selectors are "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-". This is by design. You may, however, add more characters, if needed:
// Create a list of all Cyrillic characters
var cyrillicChars = new List<char>();
for (var c = '\u0400'; c <= '\u04FF'; c++)
{
cyrillicChars.Add(c);
}
for (var c = '\u0500'; c <= '\u052F'; c++)
{
cyrillicChars.Add(c);
}
var settings = new SmartSettings();
settings.Parser.AddCustomSelectorChars(cyrillicChars);
var formatter = Smart.CreateDefaultSmartFormat(settings);
With this extension, your sample will work.
Might be an enhancement to remove the valid character set, and just exclude specific reserved characters.
@karljj1 What do you think about extending the valid characters for selectors in general?
@karljj1 What do you think about extending the valid characters for selectors in general?
That's a good idea, especially when dealing with localization. Id expect any character to work for a selector except those that have other functions.
@axunonb That works perfectly well, it solved the issue. Thank you!
@karljj1 , @axunonb , how about allowing to use literals (e.g. {ArchiveAttributes["Sofia"]} in order to access the dictionary item with key "Sofia")? This way one can access dictionary items with different key types without extending the valid characters.
@vdachev-david Using a numeric index is already implemented, but you mean a string index here{ArchiveAttributes["Sofia"]}?
I'd expect any character to work for a selector except those that have other functions.
Characters serving special purposes, that would get excluded with this definition:
/// <summary>
/// The character literal escape character for <see cref="PlaceholderBeginChar"/> and <see cref="PlaceholderEndChar"/>,
/// but also others like for \t (TAB), \n (NEW LINE), \\ (BACKSLASH) and others defined in <see cref="EscapedLiteral"/>.
/// </summary>
internal char CharLiteralEscapeChar { get; set; } = '\\';
/// <summary>
/// The character which separates the formatter name (if any exists) from other parts of the placeholder.
/// E.g.: {Variable:FormatterName:argument} or {Variable:FormatterName}
/// </summary>
internal char FormatterNameSeparator { get; } = ':';
/// <summary>
/// The standard operator characters.
/// Contiguous operator characters are parsed as one operator (e.g. '?.').
/// </summary>
internal List<char> OperatorChars() => new()
{SelectorOperator, NullableOperator, AlignmentOperator, ListIndexBeginChar, ListIndexEndChar};
/// <summary>
/// The character which separates the selector for alignment. <c>E.g.: Smart.Format("Name: {name,10}")</c>
/// </summary>
internal char AlignmentOperator { get; } = ',';
/// <summary>
/// The character which separates two or more selectors <c>E.g.: "First.Second.Third"</c>
/// </summary>
internal char SelectorOperator { get; } = '.';
/// <summary>
/// The character which flags the selector as <see langword="nullable"/>.
/// The character after <see cref="NullableOperator"/> must be the <see cref="SelectorOperator"/>.
/// <c>E.g.: "First?.Second"</c>
/// </summary>
internal char NullableOperator { get; } = '?';
/// <summary>
/// Gets the character indicating the start of a <see cref="Placeholder"/>.
/// </summary>
internal char PlaceholderBeginChar { get; } = '{';
/// <summary>
/// Gets the character indicating the end of a <see cref="Placeholder"/>.
/// </summary>
internal char PlaceholderEndChar { get; } = '}';
/// <summary>
/// Gets the character indicating the begin of formatter options.
/// </summary>
internal char FormatterOptionsBeginChar { get; } = '(';
/// <summary>
/// Gets the character indicating the end of formatter options.
/// </summary>
internal char FormatterOptionsEndChar { get; } = ')';
/// <summary>
/// Gets the character indicating the begin of a list index, like in "{Numbers[0]}"
/// </summary>
internal char ListIndexBeginChar { get; } = '[';
/// <summary>
/// Gets the character indicating the end of a list index, like in "{Numbers[0]}"
/// </summary>
internal char ListIndexEndChar { get; } = ']';
Should "any character" include the following characters?
1. Control Characters (C0 and C1 Controls)
- These are non-printable characters that can cause issues in parsing, rendering, or data processing.
-
\x00–\x1F(ASCII 0–31) and\x7F(DEL). -
C1 controls (\x80–\x9F`), though they are less common.
2. Unicode Surrogate Pairs (Invalid UTF-16)
-
\xD800–\xDFFFare reserved for surrogate pairs in UTF-16 and should not appear alone.
3. Non-Characters (Unicode Spec)
- Unicode reserves certain codepoints as "non-characters," which should never appear in text.
- Exclude:
-
\uFFFE,\uFFFF(BOM-related) - Last two codepoints of every Unicode plane (e.g.,
U+10FFFE,U+10FFFF).
-
4. Bidirectional Control Characters
- These can cause text rendering issues (e.g., RLO, LRO).
- Exclude:
\u202A–\u202E,\u2066–\u2069.
5. Deprecated/Obsolete Characters
- Discouraged in modern usage (e.g.,
\u206A–\u206F).
6. Whitespace
- `\t`, `\n`, `\r`, `\v`, `\f`, ` ` (space).
7. Unassigned/Private-Use Characters
-
\uE000–\uF8FF(Private Use Area)
@karljj1 DisallowedSelectorChars() defines the characters that cannot be used in Selectors. Spaces and other Unicode characters would be allowed. Does this make sense to you?
/// <summary>
/// The list of characters which are delimiting a selector.
/// </summary>
internal HashSet<char> SelectorDelimitingChars() =>
[
FormatterNameSeparator,
PlaceholderBeginChar, PlaceholderEndChar,
FormatterOptionsBeginChar, FormatterOptionsEndChar
];
/// <summary>
/// Gets the set of control characters (ASCII 0-31 and 127).
/// </summary>
internal IEnumerable<char> ControlChars()
{
for (var i = 0; i <= 31; i++) yield return (char) i;
yield return (char) 127; // delete character
}
/// <summary>
/// The list of characters which are disallowed in a selector.
/// </summary>
internal HashSet<char> DisallowedSelectorChars()
{
var chars = SelectorDelimitingChars();
chars.Add(CharLiteralEscapeChar); // avoid confusion with escape sequences
foreach (var c in OperatorChars()) chars.Add(c); // no overlaps
foreach (var c in CustomOperatorChars()) chars.Add(c); // no overlaps
// Hard to visualize and debug, disallow by default - can be added back as custom selector chars
foreach (var c in ControlChars()) chars.Add(c);
// Remove characters used as custom selector chars
foreach (var c in _customSelectorChars) chars.Remove(c); // control characters, if needed
return chars;
}
Example:
[TestCase("German |öäüßÖÄÜ!")]
[TestCase("Russian абвгдеёжзийклмн")]
[TestCase("French >éèêëçàùâîô")]
[TestCase("Spanish <áéíóúñü¡¿")]
[TestCase("Portuguese !ãõáâêéíóúç")]
[TestCase("Chinese 汉字测试")]
[TestCase("Arabic مرحبا بالعالم")]
[TestCase("Turkish çğöşüİı")]
[TestCase("Hindi नमस्ते दुनिया")]
public void Selector_WorksWithAllUnicodeChars(string selector)
{
// See https://github.com/axuno/SmartFormat/issues/454
const string expected = "The Value";
// The default formatter with default settings should be able to handle any
// Unicode characters in selectors except the "magic" disallowed ones
var formatter = Smart.CreateDefaultSmartFormat();
// Use the Unicode string as a selector of the placeholder
var template = $"{{{selector}}}";
var result = formatter.Format(template, new Dictionary<string, string> { { selector, expected } });
Assert.That(result, Is.EqualTo(expected));
}
Draft: https://github.com/axuno/SmartFormat/commit/07aeaec93ecfefe10a80039241bc74fb239d996c