Figure out a sustainable way to do importers
a718b684d9f857694c1545f574e4599c39c48575 happened because there were some imports necessary for particular converters that weren't included. In general the number of importers has ballooned (which is great), but that means that there are a ton of dependencies that aren't actually used for simple hosting.
Some importers are very general (e.g. SubstackImporter) and some are more specific (e.g. ReactRouterImporter).
We should over time get this under control.
The most obvious thing is to move the importer package to a separate repo with a different requirements. In the future ideally we'd have some kind of late-binding so you don't actually need dependencies for unused importers until you actually use them. And then in the very far future we'd likely have something where each importer is a separate package and there's an import command that knows how to fetch importers when requested and find ones from some directory of known importers that is maintained.
We should also probably do something to make it easier to change the signature . Maybe have an abstract base class and use e.g. @override
My sense is that we need to have a two-stage importer pipeline:
- Stage one is open to everyone and is effectively the wild west: might even be a separate repo. The output of this stage is some JSON format (proto-library?)
- Stage two is inside of the
polymathpackage and is generic, focused on the right cleaning, chunking, etc.
There are a couple of sep concerns here IMO:
- make it easy to build an importer using shared infra. I find that I do the same thing again and again:
- give either a directory or one file to start the journey. Often want to ignore certain files.
- loop through the doc(s) and compute the url/title/desc. Obvious helpers here based on filename, url, ...
- work out how to split it up to feed chunker. So random cleanup on the content. A lot of room to do a bit more here to allow importers to be more dumb and do more cleaning in chunker. Eg. Handling formats within formats. Tried like getting
doesn't work on many HTML inputs etc.
- let an importer register itself. Invert the concern so by loading an importer it registers etc. Then you don't need dependencies unless you are actually using a particular importer.