MS Word document resulted from RTFEmbeddedObject.getData() byte array cannot be opened
Hello, I'm trying to extract an MS Word file embedded in an RTF file by using RTFEmbeddedObject.getEmbeddedObjects(String file). The method returns a list with four instances, which is expected. When I check the resulting data array with Apache Tika, it returns the application/x-tika-msoffice mime type, which seems correct.
However, when I try to open the resulting file, it doesn't show the expected result on MS Word. I will attach both files on this issue.
here's the code that I'm using:
` List<List<RTFEmbeddedObject>> rtfl = RTFEmbeddedObject.getEmbeddedObjects(readLineByLine(file));
for(List<RTFEmbeddedObject> l : rtfl){
FileUtils.writeByteArrayToFile(new File
("test.doc"),
l.get(1).getData());
Tika t = new Tika();
String s = t.detect(l.get(1).getData());
System.out.println("Mimetype: " + s);
}
`
Attachments at: rtfword.zip
Thanks in advance!
Just to confirm, is the test.doc included in the zip file the original file which was embedded in the RTF, or one you have extracted yourself?
Also... if possible could you include the MPP file that the RTF came from?
Hello, the test.doc file is the one that I extracted using the library. About the RTF, it wasn't from an MPP file. It was from a database that exported OLE objects for me, and I've been able to convert them to RTF and access them as embedded objects.
Thanks for the update. Do you have a way to get the original OLE object out of the database without going through the RTF export exercise your describe? I'm looking at starting with a "known good" file which MS Word can open, then comparing that to what we're able to extract from the RTF.
I'll upload an original OLE file, but it isn't openable by MS Word. In order to be able to open it, I have to add a header and convert it to RTF. ole.zip