include_dir icon indicating copy to clipboard operation
include_dir copied to clipboard

Compression

Open Michael-F-Bryan opened this issue 8 years ago • 6 comments

To avoid bloating binaries too much, let's introduce a feature flag which will allow data to be compressed when it is embedded in the binary, then lazily decompressed when it is accessed.

I imagine this feature flag would alter the include_dir::File type from this...

https://github.com/Michael-F-Bryan/include_dir/blob/0536fd6191d855154039fadf29b0f2506088c187/include_dir/src/file.rs#L8-L13

... to something like this:

pub struct File<'a> {
    path: &'a str,
    contents: FileContents<'a>,
    #[cfg(feature = "metadata")]
    metadata: Option<crate::Metadata>,
}

impl<'a> File<'a> {
  fn contents(&self) -> &[u8] { self.contents.get() }
}

struct FileContents<'a> {
  compressed: &'a [u8],
  uncompressed: OnceCell<Vec<u8>>,
}

impl<'a> FileContents<'a> {
  fn get(&self) -> &[u8] {
    self.uncompressed.get_or_init(|| decompress(self.compressed))
  }
}

fn decompress(compressed: &[u8]) -> Vec<u8> { todo!() }

Some things that need to be considered are:

  • Which compression algorithm do we use?
  • Compression support needs to be added to both the macro and the main crate
  • Decompression should be done lazily without the user knowing (i.e. use &self and interior mutability)

Michael-F-Bryan avatar Jun 07 '17 23:06 Michael-F-Bryan

Ideally, also need to provide several types of compression to choose from

anton-dutov avatar Mar 22 '21 00:03 anton-dutov

In isolation, I'd say zstd would probably serve almost every purpose you'd need: you can spend a lot of time compressing very well, or a tiny amount of time to get a decent amount of compression.

The one reason to support multiple compression algorithms: if a program already needs a specific algorithm for some other purpose, it'd be nice to use the same one and avoid having two decompression libraries present.

joshtriplett avatar Apr 28 '22 02:04 joshtriplett

The one reason to support multiple compression algorithms: if a program already needs a specific algorithm for some other purpose, it'd be nice to use the same one and avoid having two decompression libraries present.

One of my goals for the project is to not pull in unnecessary dependencies, so this lines up well.

However, one thing I'd like to ensure is that the include_dir crate is portable and will Just Work out of the box.

Without having read the zstd-sys build script too closely and just speaking in general terms, most crates that bind to native libraries will work fine for a Windows/Linux/MacOS host, but then be impossible to cross-compile. This especially the case when the target is something like iOS, Android, or WebAssembly.

We've wasted more engineering hours than I'd care to admit at work just because of C build systems :disappointed:

Michael-F-Bryan avatar Apr 28 '22 06:04 Michael-F-Bryan

https://crates.io/crates/snap might also be a good choice. Problem area might be the license.

NHodgesVFX avatar Apr 30 '22 05:04 NHodgesVFX

I used the lz4_compression crate. It seems lightweight and it's written in pure Rust (which I presume means it will run on multiple targets).

Even if this isn't the library you want to go with @Michael-F-Bryan , I thought I'd get the ball rolling.

LordRatte avatar May 05 '22 08:05 LordRatte

My few thoughts on this:

  • The feature shouldn't replace the default types, but should instead create a separate set of types (that way, if a dependency opts into compression it would still be possible to avoid it when runtime speed is a higher priority other crates).
  • Being able to use multiple algorithms would be beneficial, so the feature flags could be (e.g.) compression-zstd, compression-lz4.
  • To enable this, I suggest creating a Compression trait and making most of this crate generic over it, then providing a None implementation without any feature flags that doesn't do preprocessing. On the macros side, it will probably be best to make a separate macro for each algorithm (e.g. include_dir_zstd!, include_dir_lz4!, and the existing include_dir! returns a Dir<compression::None>), since compression would have to happen at build time and the main crate's trait won't be available for use.

In pseudo-Rust, this becomes

pub struct Dir<'a, C: Compression> { /* ... */ }
pub struct File<'a, C: Compression> { /* ... */ }
pub enum DirEntry<'a, C: Compression> { /* ... */ }

pub trait Compression {
    // using a `Cow<[u8]>` benefits the performance of `compress::None`
    fn decompress(data: &[u8]) -> Cow<[u8]>;
}

pub mod compress {
    pub enum None {}
    impl Compression for None { /* ... */ }

    #[cfg(feature = "compress-foo")]
    pub enum Foo {}
    #[cfg(feature = "compress-foo")]
    impl Compression for Foo { /* ... */ }
}

pub include_dir!;
#[cfg(feature = "compress-foo")]
pub include_dir_foo!;

It may be possible to do Dir<'a, C: Compression = compress::None> to simplify the basic case.

An unrelated open question is whether to compress each file as one unit (presumably better compression for directories with many small files) or individually (presumably better seek performance), or even leaving it to Compress implementations (compress-zip?)

zombiepigdragon avatar Aug 05 '22 06:08 zombiepigdragon