vec2text Reproducing results from paper

Hi Jack,

Thanks for the great work and sharing the code! I am trying to reproduce results from the paper and want to confirm if I am doing it correctly.

Specifically I ran the below code

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/gtr__nq__32__correct",
    use_less_data=-1 # use all data

)

train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 20
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

And got the below results {'eval_loss': 0.6015774011611938, 'eval_pred_num_tokens': 31.0, 'eval_true_num_tokens': 32.0, 'eval_token_set_precision': 0.9518449167645596, 'eval_token_set_recall': 0.9564611513833035, 'eval_token_set_f1': 0.9538292487776809, 'eval_token_set_f1_sem': 0.004178347129611342, 'eval_n_ngrams_match_1': 23.128, 'eval_n_ngrams_match_2': 20.244, 'eval_n_ngrams_match_3': 18.212, 'eval_num_true_words': 24.308, 'eval_num_pred_words': 24.286, 'eval_bleu_score': 83.32868888524891, 'eval_bleu_score_sem': 1.1145241315071208, 'eval_rouge_score': 0.9550079258714326, 'eval_exact_match': 0.578, 'eval_exact_match_sem': 0.022109039310618563, 'eval_emb_cos_sim': 0.9910151958465576, 'eval_emb_cos_sim_sem': 0.0038230661302804947, 'eval_emb_top1_equal': 0.75, 'eval_emb_top1_equal_sem': 0.11180339753627777, 'eval_runtime': 253.6454, 'eval_samples_per_second': 1.971, 'eval_steps_per_second': 0.126}

Are the numbers here supposed to correspond to "GTR - NQ - Vec2Text [20 steps]" in table 1 (row 7)? I think most of the numbers are close except for exact match for which i got a higher number (57.8 v.s. 40.2 in the paper).

Thanks again!

Mar 19 '24 15:03 carriex

Yep, this looks right to me. I think we trained the model for more steps after submission which is why the scores went up a little bit. To get the higher score, you have to set sequence beam width to 8 and the number of steps to 50.

Mar 19 '24 16:03 jxmorris12

awesome, thanks for the quick response!

Mar 19 '24 16:03 carriex

One follow-up question -- how are the train / dev split for NQ experiments constructed (are they split randomly at article level or truncated passage level)?

The dev dataset looks like randomly sampled passages from different articles (i.e. the second row is not the continuation of first row).

Screenshot 2024-03-19 at 1 27 19 PM

A bit more background on this is that I am trying to test the model on longer sequences (e.g. 2x length for wikipedia passages) so was thinking of simply concatenating the passages in the dev set (which i think only makes sense if they are consecutive). It seems like there are some experiments in the paper (table 2) that is looking at decoding from length longer than the training sequences. I'd appreciate it if you can provide pointers to how to reproduce some of the results there too!

Thanks a lot!

Mar 19 '24 18:03 carriex

Hi! I took the train and validation sets from DPR (https://arxiv.org/abs/2004.04906 / https://github.com/facebookresearch/DPR). I'll send you a message offline to discuss further.

Mar 20 '24 01:03 jxmorris12

Oh but I don't think table 2 is decoding from any length longer than training sequences. I train on sequences up to 128 and use those for testing too. I never test on embedded sequences of more than 128 tokens, but that sounds really interesting!

Mar 20 '24 01:03 jxmorris12

oh i see! is the results in table 2 reported for the model trained for OpenAI embeddings on MSMARCO dataset with a mixed of different sequence lengths (looking at below section) then?

thanks again!

Mar 20 '24 01:03 carriex

yes, the MSMarco longer-sequence-length dataset included sequences from 1 to 128 tokens

Mar 20 '24 01:03 jxmorris12

Hi there!

I am trying to reproduce results for OpenAI model trained on MSMARCO (up to 128, last section in table 1). Is the below the correct command/model to run?

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/vec2text__openai_ada002__msmarco__msl128__corrector",
    use_less_data=-1 # use all data

)

train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 20
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

I am currently running into some error (hard-coded path not found, etc.), but wanted to make sure this is the right model / set-up to look at. Thanks!

Screenshot 2024-04-16 at 2 08 49 PM

Apr 16 '24 19:04 carriex

Hi @carriex -- this looks right! I'm pretty sure that's the right model. Can you share the error with me? Or maybe we can work out of a Colab to get this figured out. Sorry for the hardcoded path; I'm not sure where it is but I will remove it for you!

Apr 17 '24 22:04 jxmorris12

Sorry for the late reply! Here is a colab notebook showing the error.

Apr 29 '24 16:04 carriex

Ok there was something weird with the pre-trained model from HuggingFace which I will look into. For now, I developed a workaround; here's some code that properly loads the hypothesizer model from its pre-trained checkpoint:

import torch

from vec2text.analyze_utils import args_from_config
from vec2text.models.config import InversionConfig
from vec2text.run_args import DataArguments, ModelArguments, TrainingArguments

from vec2text import experiments

def load_experiment_and_trainer_from_pretrained(name: str, use_less_data: int = 1000):
    config = InversionConfig.from_pretrained(name)
    model_args = args_from_config(ModelArguments, config)
    data_args = args_from_config(DataArguments, config)
    training_args = args_from_config(TrainingArguments, config)

    data_args.use_less_data = use_less_data
    #######################################################################
    from accelerate.state import PartialState

    training_args._n_gpu = 1 if torch.cuda.is_available() else 0  # Don't load in DDP
    training_args.bf16 = 0  # no bf16 in case no support from GPU
    training_args.local_rank = -1  # Don't load in DDP
    training_args.distributed_state = PartialState()
    training_args.deepspeed_plugin = None  # For backwards compatibility
    # training_args.dataloader_num_workers = 0  # no multiprocessing :)
    training_args.corrector_model_from_pretrained = "jxm/vec2text__openai_ada002__msmarco__msl128__hypothesizer"
    training_args.use_wandb = False
    training_args.report_to = []
    training_args.mock_embedder = False
    training_args.output_dir = "saves/" + name.replace("/", "__")
    ########################################################################

    experiment = experiments.experiment_from_args(
      model_args, 
      data_args, 
      training_args
    )
    trainer = experiment.load_trainer()
    trainer.model = trainer.model.__class__.from_pretrained(name)
    trainer.model.to(training_args.device)
    return experiment, trainer
  
experiment, trainer = load_experiment_and_trainer_from_pretrained(
    "jxm/vec2text__openai_ada002__msmarco__msl128__corrector",
    use_less_data=1000,

)

print(" >>>> test ")
train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


print(" >>>> loaded datasets ")

trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 50
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

Apr 29 '24 18:04 jxmorris12

(The only line I changed was adding this:)

    training_args.corrector_model_from_pretrained = "jxm/vec2text__openai_ada002__msmarco__msl128__hypothesizer"

Apr 29 '24 18:04 jxmorris12

Hi, I want to know if this is the right command to reproduce the gtr-nq-32-50iter-sbeam?

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/gtr__nq__32__correct"
)
train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)

val_datasets = experiment._load_val_datasets_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)
trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 4
trainer.num_gen_recursive_steps = 50
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

I got this:

Jun 22 '24 13:06 Hannibal046

Hmm, the command looks right and the numbers are close but a little low. Oddly the dataset looks different -- I've never seen that example ("Toonimo Toonimo is a...") before. Are you using the proper MSMarco split? Maybe a newer dataset version was uploaded or something else changed that's dropping the score a bit.

Also how many samples are you using from the validation set?

Jun 24 '24 16:06 jxmorris12

Hi, thanks for response! If I understand this correctly, "jxm/gtr__nq__32__correct" would be a nq split for test, not msmarco? I didn't change the tested number and it directly load trainer state from "jxm/gtr__nq__32__correct".

To clarify, what is the expected number of this model? Is it the last row in the figure? Thanks in advance.

Jun 24 '24 18:06 Hannibal046

Yep it should be the last number in the figure, the one you highlighted. And you're right -- it should be the NQ validation set (not MSMARCO, my mistake). Something else must have changed between your setup and mine because the numbers in red are correct. I will put some thought into what it may be.

Jun 24 '24 20:06 jxmorris12

Hi, @jxmorris12 , do you think this might be relevant? https://github.com/ielab/vec2text-dense_retriever-threat/issues/1 The default value of return_best_hypothesis is set to False in the snippet above. After manually setting it to True, this is what I got: Looks much better now, but still a little bit lower. I want to know if this is the exact data split reported in the paper: the first 500 samples from jxm/nq_corpus_dpr, dev split.

Jun 27 '24 05:06 Hannibal046