warcbase
warcbase copied to clipboard
WARCRecord NotSerializableException when trying to get rid of duplicate pages
I try to get rid of duplicate pages as follows:
val r = RecordLoader.loadArchives("/directory/to/arc/file.arc.gz", sc)
.keepValidPages()
.groupBy(_.getUrl).values.map(_.head) // remove duplicates
.map(r => r.getUrl)
.take(10)
but I get this exception:
java.io.NotSerializableException: org.archive.io.warc.WARCRecord
Serialization stack:
- object not serializable (class: org.archive.io.warc.WARCRecord, value: org.archive.io.warc.WARCRecord@28158a29)
- field (class: org.warcbase.spark.archive.io.GenericArchiveRecord, name: warcRecord, type: class org.archive.io.warc.WARCRecord)
- object (class org.warcbase.spark.archive.io.GenericArchiveRecord, org.warcbase.spark.archive.io.GenericArchiveRecord@4258e51d)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
Any idea? or how to achieve the same objective?
I am currently doing as follows:
val r = RecordLoader.loadArchives("/directory/to/arc/file.arc.gz", sc)
.keepValidPages()
.map(r => (r.getUrl, r.getContentString))
.reduceByKey { case (contentString1, contentString2) => contentString1 }
...