training shortlist on large-ish dataset
hello, I have gotten this model working on training custom data. it works through the surrogate and extreme steps, but during the calculation of the ANN/shortlist, it seems to be trying to pickle an extremely large file. it throws this error:
File "/data/ebay/notebooks/chmo/astec/programs/deepxml/deepxml/main.py", line 428, in main
output = train(model, params)
File "/data/ebay/notebooks/chmo/astec/programs/deepxml/deepxml/main.py", line 107, in train
surrogate_mapping=params.surrogate_mapping)
File "/data/ebay/notebooks/chmo/astec/programs/deepxml/deepxml/libs/model.py", line 528, in fit
use_intermediate_for_shorty, precomputed_intermediate)
File "/data/ebay/notebooks/chmo/astec/programs/deepxml/deepxml/libs/model.py", line 337, in _fit
self.save_checkpoint(model_dir, epoch+1)
File "/data/ebay/notebooks/chmo/astec/programs/deepxml/deepxml/libs/model.py", line 695, in save_checkpoint
model_dir, self.tracking.saved_checkpoints[-1]['ANN']))
File "/data/ebay/notebooks/chmo/astec/programs/deepxml/deepxml/libs/shortlist.py", line 105, in save
self.knn.save(fname+'.knn')
File "/home/chmo/.local/lib/python3.7/site-packages/xclib-0.97-py3.7-linux-x86_64.egg/xclib/utils/shortlist.py", line 396, in save
'space': self.space}, open(fname+".metadata", 'wb'))
OverflowError: cannot serialize a bytes object larger than 4 GiB
is this where it is saving the knn checkpoint? why is the object so large that it is > 4GB? I am wondering because in the paper it is reported that the model was trained on a dataset with a label space of 62 million. I am also running this training with p100 and 61 workers
stats on my dataset: number of features: 111,786 number of labels: 699,001
and my config:
{
"global": {
"dataset": "leaf",
"feature_type": "sparse",
"num_labels": 699001,
"arch": "Astec",
"A": 0.55,
"B": 1.5,
"use_reranker": true,
"surrogate_threshold": 512,
"surrogate_method": 1,
"embedding_dims": 391,
"top_k": 100,
"beta": 0.3,
"save_predictions": true,
"trn_label_fname": "trn_X_Y.txt",
"val_label_fname": "tst_X_Y.txt",
"tst_label_fname": "tst_X_Y.txt",
"trn_feat_fname": "trn_X_Xf.txt",
"val_feat_fname": "tst_X_Xf.txt",
"tst_feat_fname": "tst_X_Xf.txt"
},
"surrogate": {
"num_epochs": 20,
"dlr_factor": 0.5,
"learning_rate": 0.01,
"batch_size": 255,
"dlr_step": 14,
"normalize": true,
"init": "token_embeddings",
"optim": "Adam",
"embeddings": "fasttextB_embeddings_391d.npy",
"validate": true,
"save_intermediate": true
},
"extreme": {
"num_epochs": 20,
"dlr_factor": 0.5,
"learning_rate": 0.007,
"batch_size": 255,
"dlr_step": 14,
"ns_method": "ensemble",
"num_centroids": 1,
"efC": 300,
"efS": 400,
"M": 100,
"num_nbrs": 500,
"ann_threads": 18,
"beta": 0.5,
"surrogate_mapping": true,
"num_clf_partitions": 1,
"optim": "Adam",
"freeze_intermediate": true,
"validate": true,
"model_method": "shortlist",
"normalize": true,
"shortlist_method": "hybrid",
"init": "intermediate",
"use_shortlist": true,
"use_intermediate_for_shorty": true
},
"reranker": {
"num_epochs": 20,
"dlr_factor": 0.5,
"learning_rate": 0.005,
"batch_size": 255,
"dlr_step": 10,
"beta": 0.6,
"num_clf_partitions": 1,
"optim": "Adam",
"validate": true,
"model_method": "reranker",
"shortlist_method": "static",
"surrogate_mapping": true,
"normalize": true,
"use_shortlist": true,
"init": "token_embeddings",
"save_intermediate": false,
"keep_invalid": true,
"freeze_intermediate": false,
"update_shortlist": false,
"use_pretrained_shortlist": true,
"embeddings": "fasttextB_embeddings_391d.npy"
}
}
Things should be fine even when the model size is greater than 4 GB. This seems to be a pickle issue.
Can you please confirm:
- Python version
- Number of training points.
- Embedding dimensionality: Is it 391D?
- python 3.7.10
- 4,196,900 training points
- yes, embedding dim is 391
it is probably a pickle issue. the problem does not arise when i se ns_method to kcentroid, so it seems to be related to the ensemble ns_method
The issue was with pickle. pyxclib now uses highest protocol (to store) - things should be fixed now. Please take a look and let me know.
thanks for taking a look into it. when the ns_method in the config is set to ensemble, it still gives this error:
~/.local/lib/python3.7/site-packages/xclib-0.97-py3.7-linux-x86_64.egg/xclib/utils/shortlist.py in save(self, fname)
394
395 def query(self, data, *args, **kwargs):
--> 396 indices, distances = self.index.predict(data)
397 indices = indices.astype(np.int64)
398 indices, similarities = self._remap(indices, distances)
OverflowError: cannot serialize a bytes object larger than 4 GiB
however when ns_method is kcentroid, it works fine
Will it be possible for you to share the size of label matrix and nnz of the label matrix? I'll try to create a dummy label matrix to debug this.
nnz is set to 100 (i think that's the default), number of labels is 533213, number of training data points is 328641
Thanks, I got the dimensions/shape. I meant the size (as in GB or MB) of the label matrix, I want to create a dummy matrix of same size and use that to debug.
Note that the KNN graph needs to save the train label matrix and that's where it is throwing the error.
got it. the label file for training is 261 MB (trn_X_Y.txt)
Thank you. Would it be possible to share the size in memory or nnz in the label matrix? I should be able to create a dummy csr_matrix with that info.
the size of the label matrix is 64. this is how i got the size:
from xclib.data.data_utils import read_sparse_file
label_path = "trn_X_Y.txt"
label_matrix = read_sparse_file(label_path, safe_read=False)
import sys
sys.getsizeof(label_matrix)
is that the info you are looking for?