Training data format

Open NitinAggarwal1 opened this issue 2 years ago • 0 comments

Hi , as per the documentation this is mentioned as way to create the training data set.

DatasetName (e.g. LF-AmazonTitles-131K) │ trn_X.txt (text for trn documents, one text in each line) | tst_X.tst (text for tst documents, one text in each line) | Y.txt (text for labels, one text in each line) │ trn_X_Y.txt (trn labels in spmat format) | tst_X_Y.txt (tst labels in spmat format) | filter_labels_test.txt (filter labels where label and test documents are same) │ └───XXCondensedData (embeddings for tst, trn documents and labels, for benchmark datasets, XX=DX[Astec]) │ trn_point_embs.npy (2D numpy matrix for trn document embeddings) │ tst_point_embs.npy (2D numpy matrix for tst document embeddings) | label_embs.npy (2D numpy matrix for label embeddings)

I could not understand the trn labels in spmat format . Is there a script that creates that from input documents like ( trn_X.txt and tst_X.txt and Y.txt ) . This is for the case we want to use the label embeddings as well.

I want to generate it for my custom dataset.

Jun 28 '23 04:06 NitinAggarwal1