bugbug Try using doc2vec as similarity algorithm

After #637, we should try with doc2vec too and see what works better.

Jul 05 '19 11:07 marco-c

@marco-c is this issue resolved or still on? I'm new to open source so need your help to make some contributions.

Jul 08 '19 06:07 Goku2699

No, it's not resolved yet. It depends on #637, so that one must be fixed first before fixing this one. Once that is fixed, something similar needs to be done but using doc2vec instead of word2vec.

Jul 08 '19 09:07 marco-c

@Goku2699 #637 is done. Feel free to give this a try.

Jul 08 '19 10:07 ashridh

I'm trying to run the word2vec code, but bugzilla.get_bugs() in init function returns nothing and because of that error pops up.

Jul 13 '19 21:07 probaku1234

You should train a model first, so that the bugs DB will be downloaded.

python3 run.py --goal defect --train

Jul 13 '19 22:07 marco-c

I downloaded db with that code but still it gets nothing.

Jul 13 '19 22:07 probaku1234

Can you post the output of ls -al data/? What error pops up?

Try to give as much info as possible, otherwise it'll be hard to help you :)

Jul 14 '19 00:07 marco-c

This is the result of ls

total 6278256
drwxr-xr-x  10 yunseoblee  staff         320 Jul  2 14:22 .
drwxr-xr-x  34 yunseoblee  staff        1088 Jul 12 22:25 ..
-rw-r--r--   1 yunseoblee  staff  1410490733 Jul  2 14:22 bugs.json
-rw-r--r--   1 yunseoblee  staff           1 Jun 21 15:53 bugs.json.version
-rw-r--r--   1 yunseoblee  staff   235073932 Jul  2 14:22 bugs.json.zst
-rw-r--r--   1 yunseoblee  staff          34 Jul  2 14:22 bugs.json.zst.etag
-rw-r--r--   1 yunseoblee  staff  1398900302 Jul  2 14:22 commits.json
-rw-r--r--   1 yunseoblee  staff           1 Jun 21 15:53 commits.json.version
-rw-r--r--   1 yunseoblee  staff   150652308 Jul  2 14:22 commits.json.zst
-rw-r--r--   1 yunseoblee  staff          34 Jul  2 14:22 commits.json.zst.etag

This is the error traceback

Traceback (most recent call last):
  File "/Users/yunseoblee/Desktop/bugbug/scripts/evaluate_similarity.py", line 38, in <module>
    main(parse_args(sys.argv[1:]))
  File "/Users/yunseoblee/Desktop/bugbug/scripts/evaluate_similarity.py", line 32, in main
    model = Word2VecWmdSimilarity()
  File "/Users/yunseoblee/Desktop/bugbug/bugbug/similarity.py", line 201, in __init__
    self.w2vmodel = Word2Vec(self.corpus, size=100, min_count=5)
  File "/Users/yunseoblee/Desktop/bugbug/venv/lib/python3.7/site-packages/gensim/models/word2vec.py", line 783, in __init__
    fast_version=FAST_VERSION)
  File "/Users/yunseoblee/Desktop/bugbug/venv/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 763, in __init__
    end_alpha=self.min_alpha, compute_loss=compute_loss)
  File "/Users/yunseoblee/Desktop/bugbug/venv/lib/python3.7/site-packages/gensim/models/word2vec.py", line 910, in train
    queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks)
  File "/Users/yunseoblee/Desktop/bugbug/venv/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 1081, in train
    **kwargs)
  File "/Users/yunseoblee/Desktop/bugbug/venv/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 536, in train
    total_words=total_words, **kwargs)
  File "/Users/yunseoblee/Desktop/bugbug/venv/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 1187, in _check_training_sanity
    raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model

Jul 14 '19 01:07 probaku1234

@probaku1234 Are you in the correct directory? bugzilla.get_bugs() works only when you are in bugbug/. From the root directory of this project, run python scripts/evaluate_similarity.py --algorithm=word2vec_wmd.

Jul 14 '19 10:07 ashridh

If I run python scripts/evaluate_similarity.py --algorithm=word2vec_wmd or python3 scripts/evaluate_similarity.py --algorithm=word2vec_wmd, I get the error below

Traceback (most recent call last):
  File "scripts/evaluate_similarity.py", line 11, in <module>
    from bugbug.similarity import LSISimilarity, NeighborsSimilarity, Word2VecWmdSimilarity
ModuleNotFoundError: No module named 'bugbug'

Jul 14 '19 14:07 probaku1234

@probaku1234 Have you added bugbug to your pythonpath? If you're on linux, open the .bashrc file and add export PYTHONPATH=path_to_bugbug_directory.

Jul 14 '19 16:07 ashridh

@ayush-1506 So....the error is gone now. But output on console shows below result and the program keep running.

INFO:gensim.summarization.textcleaner:'pattern' package not found; tag filters are not available for English
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yunseoblee/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Jul 14 '19 17:07 probaku1234

@probaku1234 Yes! That's because the vocabulary is pretty big, so it takes time to build the word2vec model. For testing, you can use a smaller portion of the bug dataset.

Jul 14 '19 17:07 ashridh

@ayush-1506 Oh..I see. Then how can I set to use small portion of the dataset?

Jul 14 '19 17:07 probaku1234

@probaku1234 Add from itertools import islicein the similarity.py file and replace https://github.com/mozilla/bugbug/blob/master/bugbug/similarity.py#L191 with for bug in islice(bugzilla.get_bugs(), 15000), so this will use only the first 15000 bugs.

Jul 14 '19 18:07 ashridh

(There isn't an option, you'll have to do it manually)

Jul 14 '19 18:07 ashridh

@probaku1234 Are you in the correct directory? bugzilla.get_bugs() works only when you are in bugbug/. From the root directory of this project, run python scripts/evaluate_similarity.py --algorithm=word2vec_wmd.

@ayush-1506
I get the following error on trying to run that command -

Traceback (most recent call last):
  File "/home/lemvig/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/lemvig/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/lemvig/Desktop/github/bugbug/scripts/trainer.py", line 123, in <module>
    main()
  File "/home/lemvig/Desktop/github/bugbug/scripts/trainer.py", line 119, in main
    retriever.go(args)
  File "/home/lemvig/Desktop/github/bugbug/scripts/trainer.py", line 60, in go
    metrics = model_obj.train()
  File "/home/lemvig/Desktop/github/bugbug/bugbug/model.py", line 317, in train
    classes, self.class_names = self.get_labels()
AttributeError: 'BugModel' object has no attribute 'get_labels'

What do I do?

Sep 14 '19 03:09 aditya-hari

@aditya-hari Are you on master? Looks like you're a few commits behind master.

Sep 14 '19 05:09 ashridh

Oh whoops, sorry. Didn't notice. Will get back to you soon.

Sep 15 '19 17:09 aditya-hari

You should train a model first, so that the bugs DB will be downloaded.
python3 run.py --goal defect --train

Has that changed now, because I can't find a run.py...

Sep 17 '19 03:09 aditya-hari

Yes, there is a trainer script now (see the README for updated info).

Sep 17 '19 08:09 marco-c

@marco-c I would like to work on this.

Dec 25 '19 08:12 shashvat-kedia

@sd1998 feel free to work on any open issue (following the rules in CONTRIBUTING.md), no need to ask.

Jan 02 '20 10:01 marco-c

@sd1998 Are you still working on the issue?

Jan 14 '20 13:01 Divya063

@Divya063 all issues are unassigned until there is a PR open to fix them, so feel free to work on this if it's interesting for you.

Jan 14 '20 23:01 marco-c

@marco-c I would like to work on this problem. Can you please provide some details on how to approach it? I'm new to open source contribution.

Jan 29 '20 14:01 srajgure

@srajgure there is quite a few info in this issue report, just read through the comments and ask if you have specific questions.

Feb 03 '20 21:02 marco-c

Hello @marco-c I have read the above discussions. In order to fix this issue, I will have to create a class Doc2VecSimilarity , and then compute the similarity of two documents. Am I right?

Jul 18 '20 04:07 bhushan-borole

It might be better to add an option to the already existing word2vec classes to use doc2vec instead of word2vec (in the end they're very similar).

Jul 20 '20 09:07 marco-c

@marco-c Hi there! I would love to work on this issue if the issue is still relevant. I have gone through the scripts of similarity algorithms (more specifically the word2vec implementation) to get insight into the project. I have few queries regarding the implementation of doc2vec -

Should I use the PV-DM doc2vec algorithm? (default algorithm in gensim)
Would you like to have the cosine similarity of document vectors as the similarity function between two docs?

Jun 14 '21 07:06 rock420