WebToEpub icon indicating copy to clipboard operation
WebToEpub copied to clipboard

Better handling of malformed HTML from Baka-Tsuki

Open dteviot opened this issue 9 years ago • 12 comments

Reported by dreamer2908

It seems that Baka-Tsuki would output weird html if a long section is italic/bold/etc and there's anything that is not text inside. Example: HEAVY_OBJECT:Volume11_Chapter_3#Part_12. The weirdness still remains in the generated epub.

This kind of usage of italic/bold is awfully familiar that I'm afraid it's everywhere.

dteviot avatar Jul 27 '16 20:07 dteviot

Can you do this? -Delete bold and italics tag unless they're inside a paragraph. -If there's any unclosed tag close it.

toshiya44 avatar Jul 29 '16 18:07 toshiya44

@toshiya44 We will see. I'm going to try.

dteviot avatar Jul 30 '16 05:07 dteviot

@toshiya44

In the above example, there's nothing wrong with the HTML from Baka-Tsuki that I can see. Each paragraph contains a <b> tag. If you open the chapter to edit, you can see a single <b> tag at start and closing </b> at end. So Wiki Media is making it all bold by putting a bold tag inside each paragraph.

There's not much I can do in this case. Baka-Tsuki is generating valid HTML, and WebToEpub is accepting it.

dteviot avatar Jul 30 '16 10:07 dteviot

I think this is the part he was talking about.

<p><b>
The scent of roses filled the small cockpit.
</b></p><p><b>That may have been because she had sprayed the wine of hatred on her skin and hair to cover up the rusted iron smell.
</b></p><p><b>Azureyfear Winchell wore the kind of full-body skintight suits that Pilot Elites wore. Its design was reminiscent of blue mourning clothes with a long skirt. A thin smile could be seen on the lips covered by a translucent veil and a trail of blood dripped from the corner of her mouth.
</b></p><p><b>She was not a Pilot Elite, so she could not actually pilot the Object. The control system was still reliant on the Orchestra System that used tens of thousands of people in dozens of stealth submarines and satellites.
</b></p><p><b>Normally, her presence there would have been unnecessary.
</b></p><p><b>Pilot Elites had been thoroughly optimized in unspeakable ways, but not even they could endure the burden of the inertial Gs produced by the Destruction Fes. Boarding that Object was more than just reckless; it was suicidal.
</b></p><b>
<div class="thumb tright"><div class="thumbinner" style="width:302px;"><a href="/project/index.php?title=File:HO_v11_277.jpg" class="image"><img alt="HO v11 277.jpg" src="/project/thumb.php?f=HO_v11_277.jpg&amp;width=300" class="thumbimage" srcset="/project/thumb.php?f=HO_v11_277.jpg&amp;width=450 1.5x, /project/thumb.php?f=HO_v11_277.jpg&amp;width=600 2x" data-file-width="1258" data-file-height="1798" height="429" width="300"></a>  <div class="thumbcaption"><div class="magnify"><a href="/project/index.php?title=File:HO_v11_277.jpg" class="internal" title="Enlarge"></a></div></div></div></div>
<p>Then why had Azureyfear done so?
</p><p>She sat in that tiny and cramped cockpit with her limbs sprawled out limply and her breathing far too shallow.
</p><p>Red blood dripped not just from her mouth, but from her nose, ears, and eyes too. Nevertheless, the Blue Rose smiled.
</p><p>“Honestly, what a hopeless brother.”
</p><p>The conflict between the Winchell and Vanderbilt families could not be stopped.
</p><p>It had continued for centuries and it would likely continue for centuries more.
</p><p>She had no way of fighting that, so she had reached a certain conclusion.
</p><p>“If he intends to push aside all the many other candidates and inherit the Winchell family, then he cannot be so indecisive.”
</p><p>A hint of happiness filled the Blue Rose’s face.
</p><p>“Winchell or Vanderbilt? Your own sister or your lover? The time has come to choose.”
</p><p>If she had simply brought the issue to him directly, Heivia would likely have insisted that he would protect it all, that he would not lose his family or his lover, and that he did not care if that made him indecisive.
</p><p>But that gray decision would create many enemies. He could easily find himself ostracized by both Winchell and Vanderbilt.
</p><p>She could not let that happen.
</p><p>She had said from the beginning that she was taking on the decision her brother should have made.
</p><p>“Now, brother. Please work towards your own happiness.”
</p><p>She did not have the courage to publically celebrate his marriage.
</p><p>Unlike her brother, she could not boldly claim she would bring an end to a centuries-old tradition.
</p><p>Nevertheless, she had made her decision.
</p><p>Since her indecisive brother would not, she would choose his lover over her in his stead. If the son of Winchell risked his life to protect the daughter of Vanderbilt, he would have the starting point he needed to make an attack on his seemingly impossible task.
</p></b><p><b>“Kill me and continue forward!!”
</b>
</p>

Notice that there's a <b> tag opening right before the image, and closes AFTER a <p> tag towards the end. This is not valid as <b> is not a block level element. The browser handles this fine though.

toshiya44 avatar Jul 30 '16 16:07 toshiya44

Yeah, unclosed <b> tag as toshiya44 said.

It's pretty much the same as the previous case of Utsuro_no_Hako:Volume2_May_2 where the <i> tag opens before <h3> and only closes much later, after several <p> tags.

I made some samples here.

Thumbnail, table, heading mixed in texts inside <i>, <u>, <b>, <s>all result in similarly malformed html. Inline image has no problem, though.

dreamer2908 avatar Jul 30 '16 18:07 dreamer2908

@toshiya44

Notice that there's a <b> tag opening right before the image, and closes AFTER a <p> tag towards the end. This is not valid as <b> is not a block level element. The browser handles this fine though.

D'oh! Obviously I didn't look carefully enough. Thanks.

@dreamer2908

Well, it's taken all day and it was more difficult than I expected, but WebToEpub now handles that nasty little stress test you wrote

  • without giving any error messages and
  • producing output that epubcheck is happy with.

Seriously, thanks very much for writing it.

Updated code is on the Advanced Options branch if you want to take it for a test drive. Points to note.

  • epubcheck didn't like <s> tags.
  • inline tags that are in the wrong place are deleted.

dteviot avatar Jul 31 '16 04:07 dteviot

@dreamer2908

producing output that epubcheck is happy with.

I told a lie. epubcheck doesn't like the filenames generated for some of the images.

dteviot avatar Jul 31 '16 04:07 dteviot

@dreamer2908

And NOW it's fixed.

dteviot avatar Jul 31 '16 05:07 dteviot

@dteviot

Yep, it works well now.

On the other hand, I've just made the stress test slightly nastier. Just slightly. :mischievous_face:

dreamer2908 avatar Jul 31 '16 13:07 dreamer2908

@dreamer2908

On the other hand, I've just made the stress test slightly nastier. Just slightly.

OK, It looks like you've added a hyperlink into the table, so that the cleanup code now thinks the table is the “links to next book/previous book/main page” table at the end of the page which it removes. Note, the table is only being removed because it's the last (only) table on the page and has hyperlinks.

I'm not sure what you're trying to prove with this test case. Yes, you've created a case that breaks the parser. (Well, doesn't actually break it, it just results in a table being deleted which should not. Epubcheck is fine with the output.) But I seriously doubt that there are any real pages that satisfy the conditions:

  • Do not have a “next/previous/main” table at the end of the document.
  • Have a one or more other tables in the document.
  • The last table in the document has hyperlinks.

On the other hand, if there are real Baka-Tsuki page(s) with this issue, I'll fix. Otherwise, I'm going to ignore it, as I believe I can find more beneficial uses for my time.

Side note, given the effort I've put into fixing this (which doesn't seem to give the epub viewers any problems) I should probably fix the <br> tag issue.

dteviot avatar Jul 31 '16 20:07 dteviot

@dteviot

Nah. Don't be so serious.

I just wanted to say that it needs to check if the links in the table really point to pages in Baka-Tsuki. It's the first condition that came to my mind when I see “links to next book/previous book/main page”.

dreamer2908 avatar Aug 01 '16 00:08 dreamer2908

@dreamer2908

I just wanted to say that it needs to check if the links in the table really point to pages in Baka-Tsuki. It's the first condition that came to my mind when I see “links to next book/previous book/main page”.

I also considered it. But then decided that tables are rare, hyperlinks are also rare. And the combination made the test not worth the effort, because I would have needed to figure out how to do it.

dteviot avatar Aug 01 '16 00:08 dteviot