Feature request - implement "passthrough" option to preserve selected elements
Using markdownify 0.11.6, and it is working like a charm except for one thing. I'm scraping a site that has a youtube video embedded in an iframe. In this case i need to just it unchanged from out the site had it.
an option like --skip 'iframe' for example would be great. (ideally with some criteria, such as matching an id or regex).
The following change produces the desired outcome. It's obviously just a quick hack, but it demonstrates to the functionality.
in init.py line ~143
for el in node.children:
if isinstance(el, Comment) or isinstance(el, Doctype):
continue
elif isinstance(el, NavigableString):
text += self.process_text(el)
else:
if el.name == 'iframe':
text += self.process_text(el)
else:
text += self.process_tag(el, convert_children_as_inline)
test.html
<p>Need a way to preserve the original html for a given element.</p>
<i>Please don't discard my iframe :) </i>
<div class="ratio ratio-16x9" data-video="">
<iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" class="border" data-video="" src="https://www.youtube.com/embed/EHfq0miBu8c?modestbranding=0&rel=0&showinfo=0"></iframe>
</div>
<hr />
Produces:
===================
Need a way to preserve the original html for a given element.
*Please don't discard my iframe :)*
<iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" class="border" data-video="" src="https://www.youtube.com/embed/EHfq0miBu8c?modestbranding=0&rel=0&showinfo=0"></iframe>
---```
I have a similar issue. I have custom tags that I want to retain when converting, e.g. I'd like to be able to call something like:
md("<ul><li><foo>bar</foo></li></ul>", keep=['foo'])
and get back:
* <foo>bar</foo>
instead of:
* bar
Or, alternatively (or in addition), have an option to keep all unrecognized elements.
I can handle this with a custom converter, but it seems like it should be a pretty common use case, so it'd be nice if there were a simple option for it.
How do you write custom converter for the foo tag?
I want to keep something like <span custom-style="MyStyle">
I know I can edit init.py line ~143 like this, but I want to know how to achive this in custom converter.
else:
if el.name == 'span' and el['custom-style']:
text += self.process_text(el)
else:
text += self.process_tag(el, convert_children_as_inline)
Something like this, I guess:
class MyConverter(MarkdownConverter):
def convert_span(self, el, text, convert_as_inline):
if el.get('custom-style'):
return self.process_text(el)
else:
return super().process_tag(el, text, convert_as_inline)
Then you get:
>>> MyConverter().convert('<i>hello <span custom-style="MyStyle">world</span> and <span>all</span></i>')
'*hello <span custom-style="MyStyle">world</span> and all*'