RMI icon indicating copy to clipboard operation
RMI copied to clipboard

Off-by-one-Error in `train_two_layer` implementation

Open Jonas-Heinrich opened this issue 4 years ago • 0 comments

Hi,

I encountered an assert issue when training RMI on very small (100 items) datasets for testing purposes:

thread '<unnamed>' panicked at 'start index was 100 but end index was 100', [...]/RMI/rmi_lib/src/train/two_layer.rs:27:5

Upon closer investigation, I think I have found an off-by-one error in the train_two_layer function implementation. I did not take a look at the context beyond the function, therefore take what I am about to say with a grain of salt:


Link to source file

  1. The value of split_idx is calculated here:

https://github.com/learnedsystems/RMI/blob/5fdff45d0929beaccf6bc56f8f4c0d82baf10304/rmi_lib/src/train/two_layer.rs#L132

  1. split_idx should be in the interval [0, md_container.len())

https://github.com/learnedsystems/RMI/blob/5fdff45d0929beaccf6bc56f8f4c0d82baf10304/rmi_lib/src/train/two_layer.rs#L139

Now lets look at the case where split_idx == md_container.len() - 1, which is valid per [2.]:

  1. The else branch is taken, since split_idx < md_container.len()

https://github.com/learnedsystems/RMI/blob/5fdff45d0929beaccf6bc56f8f4c0d82baf10304/rmi_lib/src/train/two_layer.rs#L147

  1. split_idx + 1 (== md_container.len()) is passed to build_models_from as start_idx
|| build_models_from(&md_container, &top_model, layer2_model,
                                   split_idx + 1, md_container.len(),
                                   split_idx_target,
                                   second_half_models)
fn build_models_from<T: TrainingKey>(data: &RMITrainingData<T>,
                                    top_model: &Box<dyn Model>,
                                    model_type: &str,
                                    start_idx: usize, end_idx: usize,
                                    first_model_idx: usize,
                                    num_models: usize) -> Vec<Box<dyn Model>>

Link 4.1

Link 4.2

  1. Assert fails, since md_container.len() > md_container.len() is false
assert!(end_idx > start_idx,
        "start index was {} but end index was {}",
        start_idx, end_idx
);

Link 5


An obvious fix would be to change the condition in [3.] to split_idx >= md_container.len() - 1, however I am not entirely certain whether that leads to issues in other contexts. I guess a similar issue would happen if similar_idx == 0, only for the first call. I changed the condition in my local version and re-ran the tests - it seems to work just fine:

Running test cache_fix_osm
Test cache_fix_osm finished.
Running test cache_fix_wiki
Test cache_fix_wiki finished.
Running test max_size_wiki
Test max_size_wiki finished.
Running test radix_model_wiki
Test radix_model_wiki finished.
Running test simple_model_osm
Test simple_model_osm finished.
Running test simple_model_wiki
Test simple_model_wiki finished.
============== TEST RESULTS ===============
python3 report.py
PASS cache_fix_osm
PASS cache_fix_wiki
PASS max_size_wiki
PASS radix_model_wiki
PASS simple_model_osm
PASS simple_model_wiki

I can open a pull request with that fix if you would like.

Jonas-Heinrich avatar Jul 13 '21 07:07 Jonas-Heinrich