vec2text icon indicating copy to clipboard operation
vec2text copied to clipboard

Reproducing results from paper

Open carriex opened this issue 1 year ago • 17 comments

Hi Jack,

Thanks for the great work and sharing the code! I am trying to reproduce results from the paper and want to confirm if I am doing it correctly.

Specifically I ran the below code

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/gtr__nq__32__correct",
    use_less_data=-1 # use all data

)

train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 20
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

And got the below results {'eval_loss': 0.6015774011611938, 'eval_pred_num_tokens': 31.0, 'eval_true_num_tokens': 32.0, 'eval_token_set_precision': 0.9518449167645596, 'eval_token_set_recall': 0.9564611513833035, 'eval_token_set_f1': 0.9538292487776809, 'eval_token_set_f1_sem': 0.004178347129611342, 'eval_n_ngrams_match_1': 23.128, 'eval_n_ngrams_match_2': 20.244, 'eval_n_ngrams_match_3': 18.212, 'eval_num_true_words': 24.308, 'eval_num_pred_words': 24.286, 'eval_bleu_score': 83.32868888524891, 'eval_bleu_score_sem': 1.1145241315071208, 'eval_rouge_score': 0.9550079258714326, 'eval_exact_match': 0.578, 'eval_exact_match_sem': 0.022109039310618563, 'eval_emb_cos_sim': 0.9910151958465576, 'eval_emb_cos_sim_sem': 0.0038230661302804947, 'eval_emb_top1_equal': 0.75, 'eval_emb_top1_equal_sem': 0.11180339753627777, 'eval_runtime': 253.6454, 'eval_samples_per_second': 1.971, 'eval_steps_per_second': 0.126}

Are the numbers here supposed to correspond to "GTR - NQ - Vec2Text [20 steps]" in table 1 (row 7)? I think most of the numbers are close except for exact match for which i got a higher number (57.8 v.s. 40.2 in the paper).

Thanks again!

carriex avatar Mar 19 '24 15:03 carriex

Yep, this looks right to me. I think we trained the model for more steps after submission which is why the scores went up a little bit. To get the higher score, you have to set sequence beam width to 8 and the number of steps to 50.

jxmorris12 avatar Mar 19 '24 16:03 jxmorris12

awesome, thanks for the quick response!

carriex avatar Mar 19 '24 16:03 carriex

One follow-up question -- how are the train / dev split for NQ experiments constructed (are they split randomly at article level or truncated passage level)?

The dev dataset looks like randomly sampled passages from different articles (i.e. the second row is not the continuation of first row).

Screenshot 2024-03-19 at 1 27 19 PM

A bit more background on this is that I am trying to test the model on longer sequences (e.g. 2x length for wikipedia passages) so was thinking of simply concatenating the passages in the dev set (which i think only makes sense if they are consecutive). It seems like there are some experiments in the paper (table 2) that is looking at decoding from length longer than the training sequences. I'd appreciate it if you can provide pointers to how to reproduce some of the results there too!

Thanks a lot!

carriex avatar Mar 19 '24 18:03 carriex

Hi! I took the train and validation sets from DPR (https://arxiv.org/abs/2004.04906 / https://github.com/facebookresearch/DPR). I'll send you a message offline to discuss further.

jxmorris12 avatar Mar 20 '24 01:03 jxmorris12

Oh but I don't think table 2 is decoding from any length longer than training sequences. I train on sequences up to 128 and use those for testing too. I never test on embedded sequences of more than 128 tokens, but that sounds really interesting!

jxmorris12 avatar Mar 20 '24 01:03 jxmorris12

oh i see! is the results in table 2 reported for the model trained for OpenAI embeddings on MSMARCO dataset with a mixed of different sequence lengths (looking at below section) then?

Screenshot 2024-03-19 at 8 25 35 PM

thanks again!

carriex avatar Mar 20 '24 01:03 carriex

yes, the MSMarco longer-sequence-length dataset included sequences from 1 to 128 tokens

jxmorris12 avatar Mar 20 '24 01:03 jxmorris12

Hi there!

I am trying to reproduce results for OpenAI model trained on MSMARCO (up to 128, last section in table 1). Is the below the correct command/model to run?

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/vec2text__openai_ada002__msmarco__msl128__corrector",
    use_less_data=-1 # use all data

)

train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 20
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

I am currently running into some error (hard-coded path not found, etc.), but wanted to make sure this is the right model / set-up to look at. Thanks!

Screenshot 2024-04-16 at 2 08 49 PM

carriex avatar Apr 16 '24 19:04 carriex

Hi @carriex -- this looks right! I'm pretty sure that's the right model. Can you share the error with me? Or maybe we can work out of a Colab to get this figured out. Sorry for the hardcoded path; I'm not sure where it is but I will remove it for you!

jxmorris12 avatar Apr 17 '24 22:04 jxmorris12

Sorry for the late reply! Here is a colab notebook showing the error.

carriex avatar Apr 29 '24 16:04 carriex

Ok there was something weird with the pre-trained model from HuggingFace which I will look into. For now, I developed a workaround; here's some code that properly loads the hypothesizer model from its pre-trained checkpoint:

import torch

from vec2text.analyze_utils import args_from_config
from vec2text.models.config import InversionConfig
from vec2text.run_args import DataArguments, ModelArguments, TrainingArguments

from vec2text import experiments

def load_experiment_and_trainer_from_pretrained(name: str, use_less_data: int = 1000):
    config = InversionConfig.from_pretrained(name)
    model_args = args_from_config(ModelArguments, config)
    data_args = args_from_config(DataArguments, config)
    training_args = args_from_config(TrainingArguments, config)

    data_args.use_less_data = use_less_data
    #######################################################################
    from accelerate.state import PartialState

    training_args._n_gpu = 1 if torch.cuda.is_available() else 0  # Don't load in DDP
    training_args.bf16 = 0  # no bf16 in case no support from GPU
    training_args.local_rank = -1  # Don't load in DDP
    training_args.distributed_state = PartialState()
    training_args.deepspeed_plugin = None  # For backwards compatibility
    # training_args.dataloader_num_workers = 0  # no multiprocessing :)
    training_args.corrector_model_from_pretrained = "jxm/vec2text__openai_ada002__msmarco__msl128__hypothesizer"
    training_args.use_wandb = False
    training_args.report_to = []
    training_args.mock_embedder = False
    training_args.output_dir = "saves/" + name.replace("/", "__")
    ########################################################################

    experiment = experiments.experiment_from_args(
      model_args, 
      data_args, 
      training_args
    )
    trainer = experiment.load_trainer()
    trainer.model = trainer.model.__class__.from_pretrained(name)
    trainer.model.to(training_args.device)
    return experiment, trainer
  
experiment, trainer = load_experiment_and_trainer_from_pretrained(
    "jxm/vec2text__openai_ada002__msmarco__msl128__corrector",
    use_less_data=1000,

)

print(" >>>> test ")
train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


print(" >>>> loaded datasets ")

trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 50
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

jxmorris12 avatar Apr 29 '24 18:04 jxmorris12

(The only line I changed was adding this:)

    training_args.corrector_model_from_pretrained = "jxm/vec2text__openai_ada002__msmarco__msl128__hypothesizer"

jxmorris12 avatar Apr 29 '24 18:04 jxmorris12

Hi, I want to know if this is the right command to reproduce the gtr-nq-32-50iter-sbeam?

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/gtr__nq__32__correct"
)
train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)

val_datasets = experiment._load_val_datasets_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)
trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 4
trainer.num_gen_recursive_steps = 50
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

I got this: image

Hannibal046 avatar Jun 22 '24 13:06 Hannibal046

Hmm, the command looks right and the numbers are close but a little low. Oddly the dataset looks different -- I've never seen that example ("Toonimo Toonimo is a...") before. Are you using the proper MSMarco split? Maybe a newer dataset version was uploaded or something else changed that's dropping the score a bit.

Also how many samples are you using from the validation set?

jxmorris12 avatar Jun 24 '24 16:06 jxmorris12

Hi, thanks for response! If I understand this correctly, "jxm/gtr__nq__32__correct" would be a nq split for test, not msmarco? I didn't change the tested number and it directly load trainer state from "jxm/gtr__nq__32__correct".

To clarify, what is the expected number of this model? Is it the last row in the figure? Thanks in advance. image

Hannibal046 avatar Jun 24 '24 18:06 Hannibal046

Yep it should be the last number in the figure, the one you highlighted. And you're right -- it should be the NQ validation set (not MSMARCO, my mistake). Something else must have changed between your setup and mine because the numbers in red are correct. I will put some thought into what it may be.

jxmorris12 avatar Jun 24 '24 20:06 jxmorris12

Hi, @jxmorris12 , do you think this might be relevant? https://github.com/ielab/vec2text-dense_retriever-threat/issues/1 The default value of return_best_hypothesis is set to False in the snippet above. After manually setting it to True, this is what I got: image Looks much better now, but still a little bit lower. I want to know if this is the exact data split reported in the paper: the first 500 samples from jxm/nq_corpus_dpr, dev split.

Hannibal046 avatar Jun 27 '24 05:06 Hannibal046