lemmy icon indicating copy to clipboard operation
lemmy copied to clipboard

Build times have deteriorated significantly

Open Nutomic opened this issue 2 years ago • 24 comments

I noticed that incremental builds are now much slower than they used to be, so I did some testing. Im doing this with the following commands:

# make sure dependencies are built, can be cancelled after it reaches lemmy_utils
cargo build
# should be a full incremental build, dont reuse any 
cargo clean -p lemmy_utils
# measure build performance, disable sccache
RUSTC_WRAPPER='' cargo build --timings

I only did one run per commit and the results arent exactly consistent, but we can still draw some conclusions:

  • current main (6310f55): 6m 42s
  • after tracingcall in spawn_try_task , rebased on main: 3m 29s

~~So https://github.com/LemmyNet/lemmy/pull/3581 doesnt improve compile time at all.~~

Edit: Looks like this actually helps a lot but I want to do more testing.

  • after connections work with both conn and pool (1d38aad9): 6m 45s
  • before connections work with both conn and pool (73492af4): 2m 47s

https://github.com/LemmyNet/lemmy/pull/3420 is the worst offender, the build time is almost than doubled. We definitely need to find a solution for this, or otherwise revert the change

  • tag 0.18.1: 179s
  • tag 0.18.1 with strip symbols and disable debug info: 118s
  • tag 0.18.0: 47s

0.18.1 is much slower, specifically for compiling lemmy_server according to the html report. This is most likely from adding additional code in that crate like prometheus metrics. ~~The solution is to move as much as possible into lemmy_routes so that it can be compiled in parallel with api and apub crates.~~

cc @dullbananas

Edit: All measurements updated so they are not polluted by sccache. They should be quite accurate now.

Nutomic avatar Jul 13 '23 13:07 Nutomic

That looks bad. Did you also try turning the [profile.dev] debug =0 option back on and see if that helps? i changed that in some pr last week and thought it didn't affect it, but not 100% sure

phiresky avatar Jul 13 '23 13:07 phiresky

screeenshot of timings on current main: image image

phiresky avatar Jul 13 '23 14:07 phiresky

Okay the timings above are all crap, turns out sccache wasnt actually disabled because I have it configured via ~/.cargo/config.toml and not env var. Gonna have to redo the measurements.

That looks bad. Did you also try turning the [profile.dev] debug =0 option back on and see if that helps? i changed that in some pr last week and thought it didn't affect it, but not 100% sure

Good point, I didnt remember that change. I tried this on 0.18.0 (for faster builds):

  • with strip symbols and debug 0: 43s
  • without: 169s

So this is a huge difference and the dev profile should definitely be added back. Debug symbols are very rarely needed, and if so the lines can be temporarily commented out.

Nutomic avatar Jul 13 '23 14:07 Nutomic

Doesn't look extremely useful, but:

$ cargo rustc --bin lemmy_server -- -Zself-profile
$ summarize summarize lemmy_server-3845390.mm_profdata  -p 3
+------------------------------+-----------+-----------------+----------+------------+-----------------------+---------------------------------+
| Item                         | Self time | % of total time | Time     | Item count | Incremental load time | Incremental result hashing time |
+------------------------------+-----------+-----------------+----------+------------+-----------------------+---------------------------------+
| run_linker                   | 19.24s    | 81.818          | 19.24s   | 1          | 0.00ns                | 0.00ns                          |
+------------------------------+-----------+-----------------+----------+------------+-----------------------+---------------------------------+
| LLVM_module_codegen_emit_obj | 1.63s     | 6.948           | 1.63s    | 42         | 0.00ns                | 0.00ns                          |
+------------------------------+-----------+-----------------+----------+------------+-----------------------+---------------------------------+
| LLVM_passes                  | 708.47ms  | 3.012           | 708.47ms | 1          | 0.00ns                | 0.00ns                          |
+------------------------------+-----------+-----------------+----------+------------+-----------------------+---------------------------------+
Total cpu time: 23.519738655s
Filtered results account for 91.778% of total time.
+----------------------------+------------------+
| Item                       | Artifact Size    |
+----------------------------+------------------+
| cgu_instructions           | 142648 bytes     |
+----------------------------+------------------+
| codegen_unit_size_estimate | 185917 bytes     |
+----------------------------+------------------+
| dep_graph                  | 21596791 bytes   |
+----------------------------+------------------+
| linked_artifact            | 2133937888 bytes |
+----------------------------+------------------+
| object_file                | 37797040 bytes   |
+----------------------------+------------------+
| query_cache                | 5940646 bytes    |
+----------------------------+------------------+
| work_product_index         | 14816 bytes      |
+----------------------------+------------------+

and:

$ cargo rustc --lib -- -Zself-profile
+---------------------------------------+-----------+-----------------+----------+------------+-----------------------+---------------------------------+
| Item                                  | Self time | % of total time | Time     | Item count | Incremental load time | Incremental result hashing time |
+---------------------------------------+-----------+-----------------+----------+------------+-----------------------+---------------------------------+
| LLVM_module_codegen_emit_obj          | 4.41s     | 37.757          | 4.41s    | 40         | 0.00ns                | 0.00ns                          |
+---------------------------------------+-----------+-----------------+----------+------------+-----------------------+---------------------------------+
| LLVM_passes                           | 2.26s     | 19.314          | 2.26s    | 1          | 0.00ns                | 0.00ns                          |
+---------------------------------------+-----------+-----------------+----------+------------+-----------------------+---------------------------------+
| codegen_module                        | 1.05s     | 8.987           | 1.85s    | 40         | 0.00ns                | 0.00ns                          |
+---------------------------------------+-----------+-----------------+----------+------------+-----------------------+---------------------------------+
| monomorphization_collector_graph_walk | 454.77ms  | 3.893           | 1.34s    | 1          | 0.00ns                | 0.00ns                          |
+---------------------------------------+-----------+-----------------+----------+------------+-----------------------+---------------------------------+
| evaluate_obligation                   | 453.54ms  | 3.882           | 471.67ms | 14493      | 0.00ns                | 0.00ns                          |
+---------------------------------------+-----------+-----------------+----------+------------+-----------------------+---------------------------------+
Total cpu time: 11.681932637s
Filtered results account for 73.833% of total time.
+----------------------------+-----------------+
| Item                       | Artifact Size   |
+----------------------------+-----------------+
| cgu_instructions           | 417456 bytes    |
+----------------------------+-----------------+
| codegen_unit_size_estimate | 880614 bytes    |
+----------------------------+-----------------+
| crate_metadata             | 2808877 bytes   |
+----------------------------+-----------------+
| dep_graph                  | 49658931 bytes  |
+----------------------------+-----------------+
| linked_artifact            | 303985238 bytes |
+----------------------------+-----------------+
| object_file                | 74554352 bytes  |
+----------------------------+-----------------+
| query_cache                | 16796196 bytes  |
+----------------------------+-----------------+
| work_product_index         | 14786 bytes     |
+----------------------------+-----------------+

phiresky avatar Jul 13 '23 14:07 phiresky

Hey just out of curiosity are you guys building your dev environments directly from your device or is there some standardization?

I measured the impact of #3420 with a more consistent measurement method.

I made one folder contain the contents of the repository before the merge, and another folder with the contents after the merge:

cd lemmy
git checkout 73492af
git worktree add ../lemmy-after 1d38aad
git -C ../lemmy-after* submodule update
cd ..

I created this script called time.sh:

set -o errexit -o nounset
cd $1

# Parallelism might cause worse consistency. For example, jobs could become ide for different durations, and crates could be compiled in different parallel groups.
export CARGO_BUILD_JOBS=1

# Disk space runs out unless I set these environment variables.
export CARGO_PROFILE_DEV_DEBUG=0
export CARGO_BUILD_INCREMENTAL=false

cargo clean

cargo build --bin lemmy_server --timings

I ran test.sh for both folders in parallel and made the outputs visible in text files:

((bash time.sh lemmy) &> out1.txt) &
((bash time.sh lemmy-after*) &> out2.txt) &

Running one compilation at a time could cause the 2 results to be affected by different conditions (such as different CPU usage of other processes).

For both compilations, the timing was extremely similar for all dependencies and some lemmy crates. When switching the editor view between out1.txt and out2.txt, I wasn't totally sure that they were different processes until I saw the inconsistent ordering.

The build time of lemmy_apub changed from 72.2 secs to 78.8 secs. I couldn't find significant differences in other crates.

lemmy timings.zip

dullbananas avatar Jul 13 '23 20:07 dullbananas

@dullbananas You are testing full builds while Im testing incremental builds so its a completely different thing. Also the env vars you are setting might affect the results. And of course running two builds in parallel will slow both of them down too. So you should check with the steps described above.

One problem Im having is that the timings are extremely inconsistent, one run is over 5 minutes and the next identical build is only 2 minutes. Almost like theres still some caching active.

@phiresky What does that mean?

@JustAnotherSoftwareDeveloper I dont understand, this requires only cargo build as dev environment.

Nutomic avatar Jul 14 '23 08:07 Nutomic

@phiresky What does that mean?

Nothing changeable for this issue I guess, you can ignore it.

I also had the thing where incremental builds sometimes took 2min and sometimes 4min, even though there was no real difference. Also sometimes it started rebuilding from scratch when changing these settings and sometimes it didn't.

phiresky avatar Jul 14 '23 22:07 phiresky

I ran it again without CARGO_BUILD_INCREMENTAL=false. Here's the big increases in seconds:

db_views: 25.1 to 390.5

db_views_actor: 20.5 to 196.8

lemmy incremental timings.zip

dullbananas avatar Jul 15 '23 00:07 dullbananas

I changed time.sh to run this after cd ..:

# Delete previous output and succeed even if no files match
rm *.mm_profdata || true

# Running `clean -p lemmy_db_views` then building that crate with `cargo rustc` doesn't cause a lengthy rebuild
cargo +nightly clean
RUSTFLAGS="-Zself-profile" CARGO_PROFILE_DEV_DEBUG=0 cargo +nightly build --lib

After compiling twice in parallel like before, the profiling data for a single crate can be compared by running this:

# Install `summarize` (stable branch is not used because it does a segmentation fault)
cargo install --git https://github.com/rust-lang/measureme.git summarize

# Compare for `lemmy_db_views`
summarize diff {lemmy,lemmy-after*}/lemmy_db_views-*.mm_profdata > diff.txt

Here's the top part of the diff for lemmy_db_views:

+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| Item                                                | Self Time       | Self Time Change | Time            | Time Change | Item count | Cache hits | Blocked time | Incremental load time | Incremental hashing time |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| evaluate_obligation                                 | +211.016411783s | +1882.86%        | +211.186271064s | +1829.21%   | +2808      | +0         | +0ns         | +0ns                  | +1.894187ms              |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| type_op_prove_predicate                             | +61.559385085s  | +9331.05%        | +73.60123312s   | +8078.20%   | +562       | +0         | +0ns         | +0ns                  | +111.482µs               |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| normalize_projection_ty                             | +50.426422858s  | +5991.19%        | +78.595032244s  | +5779.51%   | +226       | +0         | +0ns         | +0ns                  | +706.391µs               |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| type_op_ascribe_user_type                           | +2.099559137s   | +1226.18%        | +2.104304975s   | +1210.39%   | +19        | +0         | +0ns         | +0ns                  | +48.362µs                |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| typeck                                              | +1.144748815s   | +6.12%           | +172.281216328s | +583.21%    | +0         | +0         | +0ns         | +0ns                  | +153.797µs               |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| incr_comp_encode_dep_graph                          | +319.49277ms    | +227.74%         | +319.49277ms    | +227.74%    | +214448    | +0         | +0ns         | +0ns                  | +0ns                     |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| specialization_graph_of                             | -273.819877ms   | -81.77%          | -495.97361ms    | -67.81%     | +0         | +0         | +0ns         | +0ns                  | -3.74261ms               |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| LLVM_module_codegen_emit_obj                        | +254.782767ms   | +67.54%          | +254.782767ms   | +67.54%     | +1         | +0         | +0ns         | +0ns                  | +0ns                     |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| implementations_of_trait                            | -142.80069ms    | -74.84%          | -125.632089ms   | -54.91%     | +0         | +0         | +0ns         | +0ns                  | -71.97774ms              |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| impl_trait_ref                                      | -140.059498ms   | -78.04%          | -152.183986ms   | -56.85%     | +4         | +0         | +0ns         | +0ns                  | -136.168477ms            |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| incr_comp_intern_dep_graph_node                     | -133.852969ms   | -56.58%          | -102.34653ms    | -33.99%     | +9516      | +0         | +0ns         | +0ns                  | +0ns                     |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| expand_proc_macro                                   | -123.962352ms   | -65.35%          | -123.963342ms   | -65.35%     | +0         | +0         | +0ns         | +0ns                  | +0ns                     |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| metadata_register_crate                             | -117.498081ms   | -78.83%          | -262.029345ms   | -71.98%     | +0         | +0         | +0ns         | +0ns                  | +0ns                     |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| LLVM_passes                                         | +91.742232ms    | +51.62%          | +91.742232ms    | +51.62%     | +0         | +0         | +0ns         | +0ns                  | +0ns                     |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| mir_borrowck                                        | +83.383766ms    | +45.83%          | +308.602548347s | +6752.16%   | +0         | +0         | +0ns         | +0ns                  | -15.478µs                |
+-----------------------------------------------------+-----------------+------------------+-----------------+-------------+------------+------------+--------------+-----------------------+--------------------------+
| metadata_decode_entry_exported_symbols              | -75.321233ms    | -40.33%          | -75.321233ms    | -40.33%     | +0         | +0         | +0ns         | +0ns                  | +0ns                     |

lemmy_db_views diff.zip

dullbananas avatar Jul 15 '23 04:07 dullbananas

Maybe this is because of the increased amount of lifetimes caused by replacing &'a DbPool with &'a mut DbPool<'b>. I will try to fix this by create a type alias that equals &'a mut DbPool<'a> and remove the need for 2 separate lifetimes by reborrowing the reference inside of DbPool whenever a smaller lifetime is needed.

dullbananas avatar Jul 15 '23 04:07 dullbananas

This is a difficult one for sure. Spent a few hours trying some tricks and nothing made a dent. My guess is some weird interaction with Diesel types.

googleben avatar Jul 16 '23 10:07 googleben

It's not perfect, but I came up with a solution that compiles and I think should work, with the caveat that the unsafe is only sound as long as you always treat its origin as borrowed (not great, I know). Took my compile times on a Ryzen 5600x from around 150 seconds for lemmy_db_views_actor down to about 15 seconds, and ~280 to ~19 for lemmy_db_views. Here's the new DbPool with accoutrements:

pub type ActualDbPool = Pool<AsyncPgConnection>;

pub struct DbPool {
  pub pool: ActualDbPool,
  conn: Option<DbPoolConn>
}

pub enum DbPoolConn {
  Orig(PooledConnection<AsyncPgConnection>),
  Ref(*mut AsyncPgConnection)
}

unsafe impl Send for DbPoolConn {}
unsafe impl Sync for DbPoolConn {}

impl Deref for DbPoolConn {
  type Target = AsyncPgConnection;
  fn deref(&self) -> &Self::Target {
    match self {
      Self::Orig(conn) => conn,
      Self::Ref(conn) => unsafe { &**conn }
    }
  }
}

impl DerefMut for DbPoolConn {
  fn deref_mut(&mut self) -> &mut Self::Target {
    match self {
      Self::Orig(conn) => conn,
      Self::Ref(conn) => unsafe { &mut **conn }
    }
  }
}

impl DbPool {
  pub fn new(pool: ActualDbPool) -> Self {
    Self {
      pool,
      conn: None
    }
  }
  pub async fn connect(&mut self) -> Result<DbPoolConn, DieselError> {
    if self.conn.is_none() {
        self.conn = Some(DbPoolConn::Orig(self.pool.get().await.map_err(|e| QueryBuilderError(e.into()))?));
    }
    Ok(self.conn.take().unwrap())
  }
  pub fn is_connected(&self) -> bool {
    self.conn.is_some()
  }
  pub fn from_connection(pool: &DbPool, conn: DbPoolConn) -> Self {
    Self {
      pool: pool.pool.clone(),
      conn: Some(conn)
    }
  }
  /// # Safety
  /// You MUST treat `conn` as fully mutably borrowed (unusable) for the lifetime of the returned `DbPool`.
  pub unsafe fn from_raw_connection(pool: &DbPool, conn: &mut AsyncPgConnection) -> Self {
    Self {
      pool: pool.pool.clone(),
      conn: Some(DbPoolConn::Ref(conn as *mut AsyncPgConnection))
    }
  }
}

#[macro_export]
macro_rules! try_join_with_pool {
  ($pool:ident => ($($func:expr),+)) => {{
    // Check type
    let _: &mut $crate::utils::DbPool = $pool;

    if $pool.is_connected() {
      // Run sequentially
      async {
        Ok(($({
          // `?` prevents the error type from being inferred in an `async` block, so `match` is used instead
          match ($func)($pool).await {
            ::core::result::Result::Ok(__v) => __v,
            ::core::result::Result::Err(__v) => return ::core::result::Result::Err(__v),
          }
        }),+))
      }.await
    } else {
      // Run concurrently with `try_join`
      ::futures::try_join!(
        $(async {
          let mut tmp = $crate::utils::DbPool::new($pool.pool.clone());
          ($func)(&mut tmp).await
        }),+
      )
    }
  }};
}

It mostly works like the old version. And the unsafe+raw pointer are only necessary for one thing as far as I can tell, which is when you need to use a DbPool inside conn.build_transaction().run(|conn| {}), e.g. https://github.com/LemmyNet/lemmy/blob/main/crates/db_schema/src/impls/federation_blocklist.rs#L24. Maybe someone better than me at Rust can fix the unsoundness problem, but I'm really not sure it's possible without either closure wizardry or the lifetime parameters that caused this issue.

Also I don't have an environment set up yet so I haven't tested it - there may be bugs.

googleben avatar Jul 16 '23 19:07 googleben

Unsafe code should be a last resort.

I'm currently trying to reduce compile time by removing use of TypedBuilder, which generates lots of code. It's almost finished.

Also, I recently realized that &mut AsyncPgConnection can be converted to tokio::sync::Mutex<&mut AsyncPgConnection>. Maybe &mut DbPool<'_> can be replaced with DbPool<'_> if DbPool is defined like this:

#[derive(Clone, Copy)]
enum DbPool<'a> {
  Pool(&'a DbPool),
  Conn(&'a Mutex<&'a mut AsyncPgConnection>),
}

dullbananas avatar Jul 16 '23 21:07 dullbananas

@dullbananas Maybe instead of those structs it would be easier to make the get_conn function a trait? something like...

trait GetConn {
     fn get_conn(&mut self) -> &mut AsyncPgConnection;
}
impl GetConn for Pool<AsyncPgConnection> { ...self.get() }
impl GetConn for Object<AsyncPgConnection> { &mut self }

and then just pass around &dyn GetConn? That would get rid of all the monomorphization and complexity for a bit of overhead.

phiresky avatar Jul 17 '23 00:07 phiresky

#3637 is a major improvement, it improves build time from 2m 16s to 42s in my measurements. This completely resolves the issue in my view, although it can never hurt to improve compile time further. Using unsafe is not an option though because we have zero experience with that, and its too risky in a web server which is exposed to the internet.

Nutomic avatar Jul 17 '23 10:07 Nutomic

I think this is making it slow: https://github.com/diesel-rs/diesel/issues/3223

dullbananas avatar Jul 17 '23 15:07 dullbananas

It looks like removing TypedBuilder completely fixed the promblem caused by #3420. This is supported by the lack of significant change in build time of db_views_moderator, which didn't use TypedBuilder. So changing DbPool is unlikely to improve build time.

dullbananas avatar Jul 23 '23 02:07 dullbananas

Gonna close this issue then.

Nutomic avatar Jul 26 '23 10:07 Nutomic