simplefs icon indicating copy to clipboard operation
simplefs copied to clipboard

Use hash func to boost file creation and lookup

Open RoyWFHuang opened this issue 2 months ago • 9 comments

Previously, SimpleFS used a sequential insertion method to create files, which worked efficiently when the filesystem contained only a small number of files. However, in real-world use cases, filesystems often manage a large number of files, making sequential search and insertion inefficient. Inspired by Ext4’s hash-based directory indexing, this change adopts a hash function to accelerate file indexing and improve scalability.

Change: Implemented hash-based file index lookup Improved scalability for large directory structures

hash_code = file_hash(file_name);

extent index = hash_code / SIMPLEFS_MAX_BLOCKS_PER_EXTENT block index = hash_code % SIMPLEFS_MAX_BLOCKS_PER_EXTENT;

 inode
  +-----------------------+
  | i_mode = IFDIR | 0755 |      block 123 (simplefs_file_ei_block)
  | ei_block = 123    ----|--->  +----------------+
  | i_size = 4 KiB        |      | nr_files  = 7  |
  | i_blocks = 1          |      |----------------|
  +-----------------------+    0 | ee_block  = 0  |
              (extent index = 0) | ee_len    = 8  |
                                 | ee_start  = 84 |--->  +-------------+ block 84(simplefs_dir_block)
                                 | nr_file   = 2  |      |nr_files = 2 | (block index = 0)
                                 |----------------|      |-------------|
                               1 | ee_block  = 8  |    0 | inode  = 24 |
              (extent index = 1) | ee_len    = 8  |      | nr_blk = 1  |
                                 | ee_start  = 16 |      | (foo)       |
                                 | nr_file   = 5  |      |-------------|
                                 |----------------|    1 | inode  = 45 |
                                 | ...            |      | nr_blk = 14 |
                                 |----------------|      | (bar)       |
                             341 | ee_block  = 0  |      |-------------|
            (extent index = 341) | ee_len    = 0  |      | ...         |
                                 | ee_start  = 0  |      |-------------|
                                 | nr_file   = 12 |   14 | 0           |
                                 +----------------+      +-------------+ block 85(simplefs_dir_block)
                                                         |nr_files = 2 | (block index = 1)
                                                         |-------------|
                                                       0 | inode  = 48 |
                                                         | nr_blk = 15 |
                                                         | (foo1)      |
                                                         |-------------|
                                                       1 | inode  = 0  |
                                                         | nr_blk = 0  |
                                                         |             |
                                                         |-------------|
                                                         | ...         |
                                                         |-------------|
                                                      14 | 0           |
                                                         +-------------+

Performance test

  • Random create 30600 files into filesystem

legacy:

         168140.12 msec task-clock                       #    0.647 CPUs utilized
            111367      context-switches                 #  662.346 /sec
             40917      cpu-migrations                   #  243.351 /sec
           3736053      page-faults                      #   22.220 K/sec
      369091680702      cycles                           #    2.195 GHz
      168751830643      instructions                     #    0.46  insn per cycle
       34044524391      branches                         #  202.477 M/sec
         768151711      branch-misses                    #    2.26% of all branches

     259.842753513 seconds time elapsed
      23.000247000 seconds user
     150.380145000 seconds sys

full_name_hash

         167926.13 msec task-clock                       #    0.755 CPUs utilized
            110631      context-switches                 #  658.808 /sec
             43835      cpu-migrations                   #  261.037 /sec
           3858617      page-faults                      #   22.978 K/sec
      392878398961      cycles                           #    2.340 GHz
      207287412692      instructions                     #    0.53  insn per cycle
       42556269864      branches                         #  253.423 M/sec
         840868990      branch-misses                    #    1.98% of all branches

     222.274028604 seconds time elapsed
      20.794966000 seconds user
     151.941876000 seconds sys

  • Random remove 30600 files into filesystem

legacy:

         104332.44 msec task-clock                       #    0.976 CPUs utilized
             56514      context-switches                 #  541.672 /sec
              1174      cpu-migrations                   #   11.252 /sec
           3796962      page-faults                      #   36.393 K/sec
      258293481279      cycles                           #    2.476 GHz
      153853176926      instructions                     #    0.60  insn per cycle
       30434271757      branches                         #  291.705 M/sec
         532967347      branch-misses                    #    1.75% of all branches

     106.921706288 seconds time elapsed
      16.987883000 seconds user
      91.268661000 seconds sys

full_name_hash

          83278.61 msec task-clock                       #    0.967 CPUs utilized
             52431      context-switches                 #  629.585 /sec
              1309      cpu-migrations                   #   15.718 /sec
           3796501      page-faults                      #   45.588 K/sec
      199894058328      cycles                           #    2.400 GHz
      110625460371      instructions                     #    0.55  insn per cycle
       20325767251      branches                         #  244.069 M/sec
         490549944      branch-misses                    #    2.41% of all branches

      86.132655220 seconds time elapsed
      19.180209000 seconds user
      68.476075000 seconds sys
  • Random check (ls -la filename) 30600 files into filesystem Use perf stat ls -la to measure the query time for each file and sum up all elapsed times to calculate the total lookup cost.

Legacy : min time: 0.00171 s max time: 0.03799 s avg time: 0.00423332 s tot time: 129.539510 s

full_name_hash: min time: 0.00171 s max time: 0.03588 s avg time: 0.00305601 s tot time: 93.514040 s


Summary by cubic

Switched SimpleFS to hash-based directory indexing using full_name_hash to speed up file creation and lookup by mapping filenames to extent/block slots. On 30.6k files: create ~33% faster, delete ~12% faster, lookup ~41% faster.

  • New Features

    • Hash-guided placement/lookup in create, link, symlink, rename, and unlink.
    • Added fast __file_lookup and early-stop iteration using per-extent counts; reclaims empty extents on unlink/rename.
  • Refactors

    • Added hash.c (simplefs_hash via full_name_hash).

Written for commit 304f288ff74da622abb395c43971995327ae8924. Summary will update automatically on new commits.

RoyWFHuang avatar Nov 10 '25 03:11 RoyWFHuang

How can you determine which hash function is the most suitable?

jserv avatar Nov 10 '25 03:11 jserv

How can you determine which hash function is the most suitable?

I’m not sure if "fnv" is the most suitable, but index in SimpleFS is relatively small, using a more complex algorithm might not provide significant benefits. I think fnv is a reasonable balance between simplicity and performance.

RoyWFHuang avatar Nov 10 '25 21:11 RoyWFHuang

I saw that your PR description includes some performance benchmarks, but the commit message lacks any performance numbers to support your improvements. Please improve the commit message.

Added all hash results into the commit.

RoyWFHuang avatar Nov 13 '25 22:11 RoyWFHuang

Quoted from patch 2:

Align the print function with the Simplefs print format for consistency. Also adjust variable declarations to fix compiler warnings when building under the C90 standard.

I'm unsure which Linux kernel versions simplefs currently intends to support, but AFAIK, the Linux kernel currently uses gnu c11 as its standard.

SimpleFS targets kernel 5.10, and since 5.10 recommends using C11, this commit is no longer needed. I have removed it.

Furthermore, the word "Also" is often a sign that the change should be in a separate patch. In my view, you are performing two distinct actions here:

a) Changing printk -> pr_err. b) Fixing a compiler warning.

I also remain confused as to whether the printk to pr_err change is truly warranted, and what relevance it has to the PR's title, which is "Use hash func to boost file creation and lookup".

OK, I agree. I’ve removed these commits. If needed, I will create another PR to address it.

As far as I know, the pr_info rule was already established in the initial commit, and printk appeared only in later changes. I think we should follow the original rule and keep it consistent.

RoyWFHuang avatar Nov 16 '25 02:11 RoyWFHuang

Hmmm, I see my comments were marked as resolved, but I don't see any changes or replies. Since you re-requested a review, did you perhaps forget to push the new changes, or am I missing something?

visitorckw avatar Nov 21 '25 17:11 visitorckw

Hmmm, I see my comments were marked as resolved, but I don't see any changes or replies. Since you re-requested a review, did you perhaps forget to push the new changes, or am I missing something?

I replied to the comments to clarify why I added ei_nr and why I am still using simplefs_hash. Could you please help check it?

RoyWFHuang avatar Nov 22 '25 03:11 RoyWFHuang

I checked my email and didn't see any replies between the two review requests. I also checked the gihub web page and expanded the 'show resolved' threads on my last two comments, but I still don't see a response there. t seems like somehow your reply didn't go through, or maybe I just can't see it on my end?

visitorckw avatar Nov 22 '25 06:11 visitorckw

I checked my email and didn't see any replies between the two review requests. I also checked the gihub web page and expanded the 'show resolved' threads on my last two comments, but I still don't see a response there. t seems like somehow your reply didn't go through, or maybe I just can't see it on my end?

Sorry, I forgot to submit these comments

RoyWFHuang avatar Nov 22 '25 08:11 RoyWFHuang

@cubic-dev-ai Continue analyzing and warning until:

  1. Fix all buffer head leaks
  2. Add missing mark_buffer_dirty() calls
  3. Fix the rename error path panic
  4. Extract wraparound logic into helper function

jserv avatar Nov 28 '25 04:11 jserv