python-markdownify icon indicating copy to clipboard operation
python-markdownify copied to clipboard

Feature request - implement "passthrough" option to preserve selected elements

Open danncasey opened this issue 3 years ago • 3 comments

Using markdownify 0.11.6, and it is working like a charm except for one thing. I'm scraping a site that has a youtube video embedded in an iframe. In this case i need to just it unchanged from out the site had it.

an option like --skip 'iframe' for example would be great. (ideally with some criteria, such as matching an id or regex).

The following change produces the desired outcome. It's obviously just a quick hack, but it demonstrates to the functionality.

in init.py line ~143

        for el in node.children:
            if isinstance(el, Comment) or isinstance(el, Doctype):
                continue
            elif isinstance(el, NavigableString):
                text += self.process_text(el)
            else:
                if el.name == 'iframe':
                    text += self.process_text(el)
                else:
                    text += self.process_tag(el, convert_children_as_inline)

test.html

<p>Need a way to preserve the original html for a given element.</p>
<i>Please don't discard my iframe :) </i>
<div class="ratio ratio-16x9" data-video="">
    <iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" class="border" data-video="" src="https://www.youtube.com/embed/EHfq0miBu8c?modestbranding=0&amp;rel=0&amp;showinfo=0"></iframe>
</div>
<hr />

Produces:

===================


Need a way to preserve the original html for a given element.


*Please don't discard my iframe :)* 

<iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" class="border" data-video="" src="https://www.youtube.com/embed/EHfq0miBu8c?modestbranding=0&amp;rel=0&amp;showinfo=0"></iframe>



---```

danncasey avatar Sep 30 '22 01:09 danncasey

I have a similar issue. I have custom tags that I want to retain when converting, e.g. I'd like to be able to call something like:

md("<ul><li><foo>bar</foo></li></ul>", keep=['foo'])

and get back:

* <foo>bar</foo>

instead of:

* bar

Or, alternatively (or in addition), have an option to keep all unrecognized elements.

I can handle this with a custom converter, but it seems like it should be a pretty common use case, so it'd be nice if there were a simple option for it.

sopoforic avatar Oct 24 '22 14:10 sopoforic

How do you write custom converter for the foo tag?

I want to keep something like <span custom-style="MyStyle">

I know I can edit init.py line ~143 like this, but I want to know how to achive this in custom converter.

            else:
                if el.name == 'span' and el['custom-style']:
                    text += self.process_text(el)
                else:
                    text += self.process_tag(el, convert_children_as_inline)

ZobaJakColbert avatar Feb 03 '23 23:02 ZobaJakColbert

Something like this, I guess:

class MyConverter(MarkdownConverter):
    def convert_span(self, el, text, convert_as_inline):
        if el.get('custom-style'):
            return self.process_text(el)
        else:
            return super().process_tag(el, text, convert_as_inline)

Then you get:

>>> MyConverter().convert('<i>hello <span custom-style="MyStyle">world</span> and <span>all</span></i>')
'*hello <span custom-style="MyStyle">world</span> and all*'

sopoforic avatar Feb 07 '23 17:02 sopoforic