KBLaM icon indicating copy to clipboard operation
KBLaM copied to clipboard

Questions concerning train.py and its parameters

Open ThomasHoppe opened this issue 9 months ago • 1 comments

  1. Running training with the following call does not store the final model:

python train.py --dataset_dir ../datasets --train_dataset enron --N 120000 --B 20 --total_steps 601 --encoder_spec OAI --use_oai_embd --key_embd_src key --use_data_aug --use_cached_embd --hf_token MYTOKEN

Which argument needs to be given to store the trained model?

  1. What is the meaning semantics of key_embd_src?

In the paper you are talking about name-property-value triple. I understand that the key is composed from the name embedding and the property embedding.

I understand that with key_embd_src == "key" key_embd is set to the embedding of name but is the value_embd is set according to the construction in the paper to the embedding of property? Or as its name suggests to the embedding of the value part of the knowledge triple?

What is then the meaning of key_embd_src == "answer" resp. key_embd_src == "question" mean?

  1. The enron and synthetic datasets consists of entries for

    • name
    • description_type
    • description_type
    • Q
    • A
    • key_string
    • (extended_q)
    • (extended_a) Which parts are used for the name-property-value pairs? How are they related to the former query?
  2. As I could figure out from the code of train.py, the latter two related to the argument "use_extended_qa" and a corresponding dataset is loaded in line 857 + 858

if use_extended_qa: dataset = json.load(open(os.path.join(dataset_dir, f"{dataset_name}_augmented.json")))

But the corresponding file "synthetic_augmented.json" is missing in the datasets directory Does it suffice to copy the file and rename it accordingly?

ThomasHoppe avatar May 01 '25 10:05 ThomasHoppe

Answer to

  1. The total_steps needs to be a multiple of the save_intervall in the train.py code.

ThomasHoppe avatar Jun 24 '25 16:06 ThomasHoppe