SmartFormat icon indicating copy to clipboard operation
SmartFormat copied to clipboard

Extend the characters allowed for selectors

Open Mirchev98 opened this issue 1 year ago • 6 comments

Hello! Using the latest version of SmartFormat, I encountered an issue, where my template cannot be processed when it contains Cyrillic letters.

The code:

using SmartFormat;
namespace TestSmartFormat
{
	public class AttributeData
	{
		public string DisplayText { get; set; }
	}

	public class Document
	{
		public Dictionary<string, AttributeData> ArchiveAttributes { get; set; }
	}

	public class UnifiedDocumentData
	{
		public Document Document { get; set; }
	}

	class Program
	{
		static void Main(string[] args)
		{
			// Initialize sample data with Cyrillic keys
			var data = new UnifiedDocumentData
			{
				Document = new Document
				{
					ArchiveAttributes = new Dictionary<string, AttributeData>
					{
						{ "София", new AttributeData { DisplayText = "София" } },
						{ "Пловдив", new AttributeData { DisplayText = "Пловдив" } }
					}
				}
			};

			// Define templates
			string template1 = "{ArchiveAttributes[София].DisplayText}";
			string template2 = "{ArchiveAttributes[Пловдив].DisplayText}";

			var formatter = Smart.CreateDefaultSmartFormat();

			try
			{
				// Perform formatting
				string result1 = formatter.Format(template1, data.Document);
				string result2 = formatter.Format(template2, data.Document);

				// Display results
				Console.WriteLine($"Template 1 Result: {result1}"); // Expected: София
				Console.WriteLine($"Template 2 Result: {result2}"); // Expected: Пловдив
			}
			catch (Exception ex)
			{
				Console.WriteLine($"Unexpected Error: {ex.Message}");
			}
		}
	}
}

The error: Unexpected Error: The format string has 5 issues: '0x421': Invalid character in the selector, '0x43E': Invalid character in the selector, '0x444': Invalid character in the selector, '0x438': Invalid character in the selector, '0x44F': Invalid character in the selector In: "{ArchiveAttributes[София].DisplayText}" At: -------------------^^^^^

However, when I change the following part to contain only English, it works perfectly well:

var data = new UnifiedDocumentData
			{
				Document = new Document
				{
					ArchiveAttributes = new Dictionary<string, AttributeData>
					{
						{ "Sofia", new AttributeData { DisplayText = "София" } },
						{ "Plovdiv", new AttributeData { DisplayText = "Пловдив" } }
					}
				}
			};

			// Define templates
			string template1 = "{ArchiveAttributes[Sofia].DisplayText}";
			string template2 = "{ArchiveAttributes[Plovdiv].DisplayText}";

I am unsure if I am missing something, or if it is genuinely a bug.

Thanks in advance for your help!

Mirchev98 avatar Dec 06 '24 13:12 Mirchev98

The default character that are allowed for selectors are "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-". This is by design. You may, however, add more characters, if needed:

// Create a list of all Cyrillic characters
var cyrillicChars = new List<char>();
for (var c = '\u0400'; c <= '\u04FF'; c++)
{
    cyrillicChars.Add(c);
}
for (var c = '\u0500'; c <= '\u052F'; c++)
{
    cyrillicChars.Add(c);
}

var settings = new SmartSettings();
settings.Parser.AddCustomSelectorChars(cyrillicChars);
var formatter = Smart.CreateDefaultSmartFormat(settings);

With this extension, your sample will work.

Might be an enhancement to remove the valid character set, and just exclude specific reserved characters.

axunonb avatar Dec 06 '24 15:12 axunonb

@karljj1 What do you think about extending the valid characters for selectors in general?

axunonb avatar Dec 06 '24 15:12 axunonb

@karljj1 What do you think about extending the valid characters for selectors in general?

That's a good idea, especially when dealing with localization. Id expect any character to work for a selector except those that have other functions.

karljj1 avatar Dec 06 '24 15:12 karljj1

@axunonb That works perfectly well, it solved the issue. Thank you!

Mirchev98 avatar Dec 09 '24 06:12 Mirchev98

@karljj1 , @axunonb , how about allowing to use literals (e.g. {ArchiveAttributes["Sofia"]} in order to access the dictionary item with key "Sofia")? This way one can access dictionary items with different key types without extending the valid characters.

vdachev-david avatar Dec 09 '24 09:12 vdachev-david

@vdachev-david Using a numeric index is already implemented, but you mean a string index here{ArchiveAttributes["Sofia"]}?

axunonb avatar Jan 06 '25 09:01 axunonb

I'd expect any character to work for a selector except those that have other functions.

Characters serving special purposes, that would get excluded with this definition:

/// <summary>
/// The character literal escape character for <see cref="PlaceholderBeginChar"/> and <see cref="PlaceholderEndChar"/>,
/// but also others like for \t (TAB), \n (NEW LINE), \\ (BACKSLASH) and others defined in <see cref="EscapedLiteral"/>.
/// </summary>
internal char CharLiteralEscapeChar { get; set; } = '\\';

/// <summary>
/// The character which separates the formatter name (if any exists) from other parts of the placeholder.
/// E.g.: {Variable:FormatterName:argument} or {Variable:FormatterName}
/// </summary>
internal char FormatterNameSeparator { get; } = ':';

/// <summary>
/// The standard operator characters.
/// Contiguous operator characters are parsed as one operator (e.g. '?.').
/// </summary>
internal List<char> OperatorChars() => new()
    {SelectorOperator, NullableOperator, AlignmentOperator, ListIndexBeginChar, ListIndexEndChar};

/// <summary>
/// The character which separates the selector for alignment. <c>E.g.: Smart.Format("Name: {name,10}")</c>
/// </summary>
internal char AlignmentOperator { get; } = ',';

/// <summary>
/// The character which separates two or more selectors <c>E.g.: "First.Second.Third"</c>
/// </summary>
internal char SelectorOperator { get; } = '.';

/// <summary>
/// The character which flags the selector as <see langword="nullable"/>.
/// The character after <see cref="NullableOperator"/> must be the <see cref="SelectorOperator"/>.
/// <c>E.g.: "First?.Second"</c>
/// </summary>
internal char NullableOperator { get; } = '?';

/// <summary>
/// Gets the character indicating the start of a <see cref="Placeholder"/>.
/// </summary>
internal char PlaceholderBeginChar { get; } = '{';

/// <summary>
/// Gets the character indicating the end of a <see cref="Placeholder"/>.
/// </summary>
internal char PlaceholderEndChar { get; } = '}';

/// <summary>
/// Gets the character indicating the begin of formatter options.
/// </summary>
internal char FormatterOptionsBeginChar { get; } = '(';

/// <summary>
/// Gets the character indicating the end of formatter options.
/// </summary>
internal char FormatterOptionsEndChar { get; } = ')';

/// <summary>
/// Gets the character indicating the begin of a list index, like in "{Numbers[0]}"
/// </summary>
internal char ListIndexBeginChar { get; } = '[';

/// <summary>
/// Gets the character indicating the end of a list index, like in "{Numbers[0]}"
/// </summary>
internal char ListIndexEndChar { get; } = ']';

Should "any character" include the following characters?

1. Control Characters (C0 and C1 Controls)

  • These are non-printable characters that can cause issues in parsing, rendering, or data processing.
  • \x00\x1F (ASCII 0–31) and \x7F (DEL).
  • C1 controls (\x80\x9F`), though they are less common.

2. Unicode Surrogate Pairs (Invalid UTF-16)

  • \xD800\xDFFF are reserved for surrogate pairs in UTF-16 and should not appear alone.

3. Non-Characters (Unicode Spec)

  • Unicode reserves certain codepoints as "non-characters," which should never appear in text.
  • Exclude:
    • \uFFFE, \uFFFF (BOM-related)
    • Last two codepoints of every Unicode plane (e.g., U+10FFFE, U+10FFFF).

4. Bidirectional Control Characters

  • These can cause text rendering issues (e.g., RLO, LRO).
  • Exclude: \u202A\u202E, \u2066\u2069.

5. Deprecated/Obsolete Characters

  • Discouraged in modern usage (e.g., \u206A\u206F).

6. Whitespace

 - `\t`, `\n`, `\r`, `\v`, `\f`, ` ` (space).

7. Unassigned/Private-Use Characters

  • \uE000\uF8FF (Private Use Area)

axunonb avatar Jul 01 '25 19:07 axunonb

@karljj1 DisallowedSelectorChars() defines the characters that cannot be used in Selectors. Spaces and other Unicode characters would be allowed. Does this make sense to you?

/// <summary>
/// The list of characters which are delimiting a selector.
/// </summary>
internal HashSet<char> SelectorDelimitingChars() =>
[
    FormatterNameSeparator,
    PlaceholderBeginChar, PlaceholderEndChar,
    FormatterOptionsBeginChar, FormatterOptionsEndChar
];

/// <summary>
/// Gets the set of control characters (ASCII 0-31 and 127).
/// </summary>
internal IEnumerable<char> ControlChars()
{
    for (var i = 0; i <= 31; i++) yield return (char) i;
    yield return (char) 127; // delete character
}

/// <summary>
/// The list of characters which are disallowed in a selector.
/// </summary>
internal HashSet<char> DisallowedSelectorChars()
{
    var chars = SelectorDelimitingChars();
    chars.Add(CharLiteralEscapeChar); // avoid confusion with escape sequences
    foreach (var c in OperatorChars()) chars.Add(c); // no overlaps
    foreach (var c in CustomOperatorChars()) chars.Add(c); // no overlaps
    // Hard to visualize and debug, disallow by default - can be added back as custom selector chars
    foreach (var c in ControlChars()) chars.Add(c);
    // Remove characters used as custom selector chars
    foreach (var c in _customSelectorChars) chars.Remove(c); // control characters, if needed
    return chars;
}

Example:

[TestCase("German |öäüßÖÄÜ!")]
[TestCase("Russian абвгдеёжзийклмн")]
[TestCase("French >éèêëçàùâîô")]
[TestCase("Spanish <áéíóúñü¡¿")]
[TestCase("Portuguese !ãõáâêéíóúç")]
[TestCase("Chinese 汉字测试")]
[TestCase("Arabic مرحبا بالعالم")]
[TestCase("Turkish çğöşüİı")]
[TestCase("Hindi नमस्ते दुनिया")]
public void Selector_WorksWithAllUnicodeChars(string selector)
{
    // See https://github.com/axuno/SmartFormat/issues/454

    const string expected = "The Value";
    // The default formatter with default settings should be able to handle any
    // Unicode characters in selectors except the "magic" disallowed ones
    var formatter = Smart.CreateDefaultSmartFormat();
    // Use the Unicode string as a selector of the placeholder
    var template = $"{{{selector}}}";
    var result = formatter.Format(template, new Dictionary<string, string> { { selector, expected } });
    Assert.That(result, Is.EqualTo(expected));
}

Draft: https://github.com/axuno/SmartFormat/commit/07aeaec93ecfefe10a80039241bc74fb239d996c

axunonb avatar Nov 06 '25 10:11 axunonb