warcbase icon indicating copy to clipboard operation
warcbase copied to clipboard

WARCRecord NotSerializableException when trying to get rid of duplicate pages

Open dportabella opened this issue 9 years ago • 1 comments

I try to get rid of duplicate pages as follows:

val r = RecordLoader.loadArchives("/directory/to/arc/file.arc.gz", sc) 
.keepValidPages()
.groupBy(_.getUrl).values.map(_.head)  // remove duplicates
.map(r => r.getUrl)
.take(10)

but I get this exception:
java.io.NotSerializableException: org.archive.io.warc.WARCRecord
Serialization stack:
	- object not serializable (class: org.archive.io.warc.WARCRecord, value: org.archive.io.warc.WARCRecord@28158a29)
	- field (class: org.warcbase.spark.archive.io.GenericArchiveRecord, name: warcRecord, type: class org.archive.io.warc.WARCRecord)
	- object (class org.warcbase.spark.archive.io.GenericArchiveRecord, org.warcbase.spark.archive.io.GenericArchiveRecord@4258e51d)
	at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)

Any idea? or how to achieve the same objective?

dportabella avatar Jan 20 '17 14:01 dportabella

I am currently doing as follows:

val r = RecordLoader.loadArchives("/directory/to/arc/file.arc.gz", sc) 
.keepValidPages()
.map(r => (r.getUrl, r.getContentString))
.reduceByKey { case (contentString1, contentString2) => contentString1 }
...

dportabella avatar Jan 24 '17 14:01 dportabella