grape Methods for generating node embeddings from word embeddings

While updating NEAT to use the most recent grape release, @justaddcoffee and @hrshdhgd and I took a look at what we're using to generate node embeddings based on pretrained word embeddings like BERT etc. : https://github.com/Knowledge-Graph-Hub/NEAT/blob/main/neat/graph_embedding/graph_embedding.py

We know we can run something like get_okapi_tfidf_weighted_textual_embedding() on a graph, but is there a more "on demand" way to run this in grape now for an arbitrary graph?

Jun 09 '22 17:06 caufieldjh

Thanks @caufieldjh - specifically what we are looking for @LucaCappelletti94 @zommiommy is something like this:

g = Ensmallen.from_csv(**my_graph_params)
my_embedddings = get_okapi_tfidf_weighted_textual_embedding(g)

If I understand correctly (which I might not), the only way to do this now is:

get_okapi_tfidf_weighted_textual_embedding("KGCOVID19") # <- goes to KG-Hub and downloads graph files, gets text from nodes file, and gets embeddings from name and description columns

Jun 09 '22 17:06 justaddcoffee

Hello @justaddcoffee and @caufieldjh, while there are methods already parametrized for the various repositories, the one you have reported here is the most generic one and does not work on graphs, but on generic CSVs. It requires the path of the CSV to parse: you can see its documentation by either using the help python function or by using the SHIFT+TAB shortcut in a Jupiter Notebook.

Jun 09 '22 17:06 LucaCappelletti94

Okay, great - thanks @LucaCappelletti94

@caufieldjh can you have a look and see if this provides what we need in NEAT to switch to Grape for text embeddings? I think it should

Jun 09 '22 17:06 justaddcoffee

It looks like it should work, though there is some kind of name collision between Embiggen's transformers and the transformers providing the tokenizer:

>>> get_okapi_tfidf_weighted_textual_embedding(path)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/harry/neat-env/lib/python3.8/site-packages/cache_decorator/cache.py", line 613, in wrapped
    result = function(*args, **kwargs)
  File "/home/harry/neat-env/lib/python3.8/site-packages/ensmallen/datasets/get_okapi_tfidf_weighted_textual_embedding.py", line 88, in get_okapi_tfidf_weighted_textual_embedding
    from transformers import AutoTokenizer
ImportError: cannot import name 'AutoTokenizer' from 'transformers' (/home/harry/neat-env/lib/python3.8/site-packages/embiggen/transformers/__init__.py)

Jun 09 '22 21:06 caufieldjh

That's extremely odd, I'll look into it.

Jun 10 '22 07:06 LucaCappelletti94

Ok so, I have managed to reproduce it and tried to resolve this collision for a while. This has turned out to be quite cursed, so I will fall-back to the "I'm just going to rename that" option.

I'm thinking about what name could fit that better. It's the submodule that given a node embedding and a graph gets you the edge embedding or any of the likes. A name like graph_processing seems too vague. Do you have any proposals?

Jun 12 '22 12:06 LucaCappelletti94

Maybe embedding_transformers?

Jun 12 '22 12:06 LucaCappelletti94

I have renamed it for now from transformers to embedding_transformers. If we can find a better name, I'm absolutely up for it. At least for now there won't be a collision.

Jun 12 '22 12:06 LucaCappelletti94

I think that should work fine - at least I can't see a package on Pypi with that name so it shouldn't create the same kind of collision

Jun 13 '22 15:06 caufieldjh

This issue should be now resolved, @caufieldjh could you confirm?

Jun 15 '22 14:06 LucaCappelletti94