Unidecode.NET icon indicating copy to clipboard operation
Unidecode.NET copied to clipboard

Huge Memory Consumtion

Open adamfur opened this issue 4 years ago • 3 comments

Created a dotnet 5 project with the nuget reference, the application consumed 70MiB RSS memory.

static async Task Main(string[] args)
{
    while (true)
    {
        Console.WriteLine(Guid.NewGuid().ToString().Unidecode());
        await Task.Delay(TimeSpan.FromSeconds(1));
    }
}

Thought it was quite a lot of memory for such a small library, created a new dotnet 5 project with a project reference to the source code of the master branch, and the application consumed 200 MiB RSS memory??? Removed the partial (Diff below) and the RSS memory shrunk down to 5 MiB RSS.

diff --git a/src/Unidecoder.Characters.cs b/src/Unidecoder.Characters.cs
index dd8cdac..3b4d792 100644
--- a/src/Unidecoder.Characters.cs
+++ b/src/Unidecoder.Characters.cs
@@ -27,11 +27,11 @@ using System.Collections.Generic;
 
 namespace Unidecode.NET
 {
-    public static partial class Unidecoder
+    public static class Unidecoder2
     {
-        private static readonly Dictionary<int, string[]> characters;
+        public static readonly Dictionary<int, string[]> characters;
 
-        static Unidecoder()
+        static Unidecoder2()
         {
             characters = new Dictionary<int, string[]> {
                 {0 /*0 000*/, new[]{
@@ -605,4 +605,4 @@ namespace Unidecode.NET
             }
         }
     }
-    
\ No newline at end of file
+    
diff --git a/src/Unidecoder.cs b/src/Unidecoder.cs
index 558c725..1e1653e 100644
--- a/src/Unidecoder.cs
+++ b/src/Unidecoder.cs
@@ -6,7 +6,7 @@ namespace Unidecode.NET
     /// <summary>
     /// ASCII transliterations of Unicode text
     /// </summary>
-    public static partial class Unidecoder
+    public static class Unidecoder
     {
         /// <summary>
         /// Transliterate Unicode string to ASCII string.
@@ -42,7 +42,7 @@ namespace Unidecode.NET
                 {
                     var high = c >> 8;
                     var low = c & 0xff;
-                    if (characters.TryGetValue(high, out var transliterations))
+                    if (Unidecoder2.characters.TryGetValue(high, out var transliterations))
                     {
                         sb.Append(transliterations[low]);
                     }
@@ -71,7 +71,7 @@ namespace Unidecode.NET
             {
                 var high = c >> 8;
                 var low = c & 0xff;
-                result = characters.TryGetValue(high, out var transliterations) ? transliterations[low] : "";
+                result = Unidecoder2.characters.TryGetValue(high, out var transliterations) ? transliterations[low] : "";
             }
 
             return result;

adamfur avatar Jan 29 '21 10:01 adamfur

Can you make a PR for this? Seems like a huge memory improvement

RubenMateus avatar Mar 10 '21 11:03 RubenMateus

The memory consumption issue actually turned out to be stranger than that. I have a project, that included your code with the applied patch, it seemed to be working fine. Then after randomly adding some code or nuget, that had nothing to do with the library, the memory went ballooning again.

I was actually unable to solve it in a managed way, tried using lists instead of dictionaries, then a matrix, allocate strings with string.Internal() and various other methods. Ended up using your script to generate C-code and use with p/invoke in https://github.com/adamfur/Unidecode.NET/.

You could merge if you like, but code would turn non portable as of now.

adamfur avatar Mar 10 '21 11:03 adamfur

My pull request (https://github.com/thecoderok/Unidecode.NET/pull/16) now contains also a fix for this memory usage problem.

this issue has been mentioned also here: https://github.com/dotnet/runtime/issues/54688

reading that thread I learned that the problem in the static initialization of the characters dictionary: it is a huge initialization and it generates some complex code. I applied one of the fixes suggested in that thread: i totally scrapped the Unidecode.Characters.cs source code and initialized the characters variable at runtime from data read from an embedded resource file. I have modified the python script in order to generate that embedded resource file.

you can give it a try by pulling from this branch of mine: https://github.com/csm101/Unidecode.NET/tree/array_instead_of_dictionary

this fork is also 3 times faster in decoding (which was the first reason for which I forked it)

csm101 avatar Jul 03 '23 11:07 csm101