Try using doc2vec as similarity algorithm
After #637, we should try with doc2vec too and see what works better.
@marco-c is this issue resolved or still on? I'm new to open source so need your help to make some contributions.
No, it's not resolved yet. It depends on #637, so that one must be fixed first before fixing this one. Once that is fixed, something similar needs to be done but using doc2vec instead of word2vec.
@Goku2699 #637 is done. Feel free to give this a try.
I'm trying to run the word2vec code, but bugzilla.get_bugs() in init function returns nothing and because of that error pops up.
You should train a model first, so that the bugs DB will be downloaded.
python3 run.py --goal defect --train
I downloaded db with that code but still it gets nothing.
Can you post the output of ls -al data/?
What error pops up?
Try to give as much info as possible, otherwise it'll be hard to help you :)
This is the result of ls
total 6278256
drwxr-xr-x 10 yunseoblee staff 320 Jul 2 14:22 .
drwxr-xr-x 34 yunseoblee staff 1088 Jul 12 22:25 ..
-rw-r--r-- 1 yunseoblee staff 1410490733 Jul 2 14:22 bugs.json
-rw-r--r-- 1 yunseoblee staff 1 Jun 21 15:53 bugs.json.version
-rw-r--r-- 1 yunseoblee staff 235073932 Jul 2 14:22 bugs.json.zst
-rw-r--r-- 1 yunseoblee staff 34 Jul 2 14:22 bugs.json.zst.etag
-rw-r--r-- 1 yunseoblee staff 1398900302 Jul 2 14:22 commits.json
-rw-r--r-- 1 yunseoblee staff 1 Jun 21 15:53 commits.json.version
-rw-r--r-- 1 yunseoblee staff 150652308 Jul 2 14:22 commits.json.zst
-rw-r--r-- 1 yunseoblee staff 34 Jul 2 14:22 commits.json.zst.etag
This is the error traceback
Traceback (most recent call last):
File "/Users/yunseoblee/Desktop/bugbug/scripts/evaluate_similarity.py", line 38, in <module>
main(parse_args(sys.argv[1:]))
File "/Users/yunseoblee/Desktop/bugbug/scripts/evaluate_similarity.py", line 32, in main
model = Word2VecWmdSimilarity()
File "/Users/yunseoblee/Desktop/bugbug/bugbug/similarity.py", line 201, in __init__
self.w2vmodel = Word2Vec(self.corpus, size=100, min_count=5)
File "/Users/yunseoblee/Desktop/bugbug/venv/lib/python3.7/site-packages/gensim/models/word2vec.py", line 783, in __init__
fast_version=FAST_VERSION)
File "/Users/yunseoblee/Desktop/bugbug/venv/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 763, in __init__
end_alpha=self.min_alpha, compute_loss=compute_loss)
File "/Users/yunseoblee/Desktop/bugbug/venv/lib/python3.7/site-packages/gensim/models/word2vec.py", line 910, in train
queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks)
File "/Users/yunseoblee/Desktop/bugbug/venv/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 1081, in train
**kwargs)
File "/Users/yunseoblee/Desktop/bugbug/venv/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 536, in train
total_words=total_words, **kwargs)
File "/Users/yunseoblee/Desktop/bugbug/venv/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 1187, in _check_training_sanity
raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model
@probaku1234 Are you in the correct directory? bugzilla.get_bugs() works only when you are in bugbug/. From the root directory of this project, run python scripts/evaluate_similarity.py --algorithm=word2vec_wmd.
If I run python scripts/evaluate_similarity.py --algorithm=word2vec_wmd or python3 scripts/evaluate_similarity.py --algorithm=word2vec_wmd, I get the error below
Traceback (most recent call last):
File "scripts/evaluate_similarity.py", line 11, in <module>
from bugbug.similarity import LSISimilarity, NeighborsSimilarity, Word2VecWmdSimilarity
ModuleNotFoundError: No module named 'bugbug'
@probaku1234 Have you added bugbug to your pythonpath? If you're on linux, open the .bashrc file and add
export PYTHONPATH=path_to_bugbug_directory.
@ayush-1506 So....the error is gone now. But output on console shows below result and the program keep running.
INFO:gensim.summarization.textcleaner:'pattern' package not found; tag filters are not available for English
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/yunseoblee/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
@probaku1234 Yes! That's because the vocabulary is pretty big, so it takes time to build the word2vec model. For testing, you can use a smaller portion of the bug dataset.
@ayush-1506 Oh..I see. Then how can I set to use small portion of the dataset?
@probaku1234 Add from itertools import islicein the similarity.py file and replace https://github.com/mozilla/bugbug/blob/master/bugbug/similarity.py#L191 with for bug in islice(bugzilla.get_bugs(), 15000), so this will use only the first 15000 bugs.
(There isn't an option, you'll have to do it manually)
@probaku1234 Are you in the correct directory?
bugzilla.get_bugs()works only when you are inbugbug/. From the root directory of this project, runpython scripts/evaluate_similarity.py --algorithm=word2vec_wmd.
@ayush-1506
I get the following error on trying to run that command -
Traceback (most recent call last):
File "/home/lemvig/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/lemvig/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/lemvig/Desktop/github/bugbug/scripts/trainer.py", line 123, in <module>
main()
File "/home/lemvig/Desktop/github/bugbug/scripts/trainer.py", line 119, in main
retriever.go(args)
File "/home/lemvig/Desktop/github/bugbug/scripts/trainer.py", line 60, in go
metrics = model_obj.train()
File "/home/lemvig/Desktop/github/bugbug/bugbug/model.py", line 317, in train
classes, self.class_names = self.get_labels()
AttributeError: 'BugModel' object has no attribute 'get_labels'
What do I do?
@aditya-hari Are you on master? Looks like you're a few commits behind master.
Oh whoops, sorry. Didn't notice. Will get back to you soon.
You should train a model first, so that the bugs DB will be downloaded.
python3 run.py --goal defect --train
Has that changed now, because I can't find a run.py...
Yes, there is a trainer script now (see the README for updated info).
@marco-c I would like to work on this.
@sd1998 feel free to work on any open issue (following the rules in CONTRIBUTING.md), no need to ask.
@sd1998 Are you still working on the issue?
@Divya063 all issues are unassigned until there is a PR open to fix them, so feel free to work on this if it's interesting for you.
@marco-c I would like to work on this problem. Can you please provide some details on how to approach it? I'm new to open source contribution.
@srajgure there is quite a few info in this issue report, just read through the comments and ask if you have specific questions.
Hello @marco-c
I have read the above discussions.
In order to fix this issue, I will have to create a class Doc2VecSimilarity , and then compute the similarity of two documents.
Am I right?
It might be better to add an option to the already existing word2vec classes to use doc2vec instead of word2vec (in the end they're very similar).
@marco-c Hi there! I would love to work on this issue if the issue is still relevant. I have gone through the scripts of similarity algorithms (more specifically the word2vec implementation) to get insight into the project. I have few queries regarding the implementation of doc2vec -
- Should I use the PV-DM doc2vec algorithm? (default algorithm in
gensim) - Would you like to have the cosine similarity of document vectors as the similarity function between two docs?