Serialization issues with Mono 5
I get a serialization error any time I'm passing data at runtime. This happens with any calls using a lambda expression with data originating outside of the lambda expression. Using broadcast variables also gives the same error.
Gives serialization error:
string x="/path"; var results = rdd.Map(input => { Console.WriteLine(x); });
but, no serialization error here:
var results = rdd.Map(input => { Console.WriteLine("/path"); });
This is running 2.0.2 Spark, Linux Mono 5.10.1.20, built with msbuild. I've also tested Mono4.8.1 with xbuild and get the same error.
actual error:
ERROR System.Runtime.Serialization.SerializationException: Type '(MyClassName+<>c__DisplayClass4_0' in Assembly 'MyAssembly, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' is not marked as serializable.
at System.Runtime.Serialization.FormatterServices.InternalGetSerializableMembers (System.RuntimeType type) [0x00045] in <8fbafb724c144c9dad69bccfec38ae40>:0
at System.Runtime.Serialization.FormatterServices+<>c__DisplayClass9_0.<GetSerializableMembers>b__0 (System.Runtime.Serialization.MemberHolder _) [0x00000] in <8fbafb724c144c9dad69bccfec38ae40>:0
at System.Collections.Concurrent.ConcurrentDictionary2[TKey,TValue].GetOrAdd (TKey key, System.Func2[T,TResult] valueFactory) [0x00034] in <8fbafb724c144c9dad69bccfec38ae40>:0
at System.Runtime.Serialization.FormatterServices.GetSerializableMembers (System.Type type, System.Runtime.Serialization.StreamingContext context) [0x0005e] in <8fbafb724c144c9dad69bccfec38ae40>:0
Ok, so for anyone else out there thinking their pyspark style lambda expressions are going to work you are going to have a bad day. There is a fundamental issue, namely the C# compiler generates anonymous functions as non-serializable. SparkCLR does some work to serialize fully closed lambda expressions, but any lambda's that reference outside their scope will be picked up by the C# compiler and made non-serializable.
I'm sure there are ways around this to elegantly turn any anonymous function into a method on a serializable class, but it has thus far eluded me. The bottom line is that you need to make a serializable helper class and use a method from that class directly in map, etc. without any lambda expressions that are not fully closed.
From the sharp CLR samples there is a BroadcastHelper class that can help you transfer data as a broadcast variable, and you can use this architecture to send any sort of data to the worker threads by first initializing a new object with the data you want used in the delegate:
[Serializable] internal class BroadcastHelper<T,U> { private readonly Broadcast<T> broadcastVar; internal BroadcastHelper(Broadcast<T> broadcastVar) { this.broadcastVar = broadcastVar; } internal T Execute(U i) { return broadcastVar.Value; } }
now, this can work:
string parallel = "test serialization string"; var broadcast = sc.Broadcast(parallel); Console.WriteLine(test.Map(new BroadcastHelper<string,int>(broadcast).Execute).First());
whereas
string parallel = "test serialization string"; var broadcast = sc.Broadcast(parallel); Console.WriteLine(test.Map(x=>broadcast.Value).First());
== bad day
I would love to chat with anyone out there who has the experience to make these kind of lambda expressions work out of the box without making helper classes.