SemScale Cannot find embeddings

Hi, thank you so much for providing this code! Unfortunatetly I am having issues running SemScale. In Anaconda Promptshell I ran:

python scaler.py C:\Users\SemScale\embeddings\wiki.big-five.mapped.vec C:\Users\SemScale\datadir_test C:\Users\SemScale\output.txt

However, this always yields the error: WARNING:tensorflow:From C:\Users\Documents\Python\envs\semscale\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead. Error: File containing pre-trained word embeddings not found.

Is the embedding not working anymore? Thank you very much in advance for your help!!

Jan 12 '24 14:01 econinomista

Hi! It seems there's an issue with the path of the embedding file. Could you check two things:

if the file is correctly in that folder. You should download it from here
If you need to write the path differently, for instance like this: python scaler.py C:/Users/SemScale/embeddings/wiki.big-five.mapped.vec C:/Users/SemScale/datadir_test C:/Users/SemScale/output.txt I don't have a Window PC with me, but it might be that it's simply something to do with the filepath specification.

Jan 12 '24 16:01 fedenanni

Thank you very much for the quick response! Unfortunately, I could not resolve the issue. After restarting my device, the path and file was found, however the programm tells me, that the embeddings file contains errors and I am unsure how to deal with that.

File "C:\Users\noske\Documents\Nikola\SemScale\embeddings\wiki.big-five.mapped.vec", line 9 en__' -0.17489 -0.13695 0.13345 -0.07282 0.038794 0.13294 0.0015304 -0.071056 -0.20026 -0.045437 -0.0019054 -0.17913 0.18241 -0.058909 -0.0088248 0.060522 0.1872 0.2255 -0.11638 0.080349 -0.33614 -0.035788 -0.21518 -0.062891 -0.1322 -0.09628 0.065516 0.16418 -0.014492 0.11139 -0.25025 0.25303 -0.20538 -0.027447 -0.18057 -0.13118 -0.36836 0.055097 0.23968 -0.17034 0.26393 0.30392 -0.18615 0.13712 -0.012511 0.11977 0.00017869 0.059385 -0.05704 -0.046391 0.012484 -0.067036 0.20004 -0.34513 -0.16117 -0.082885 -0.043013 0.031685 -0.01498 0.11803 0.068215 -0.18596 0.11503 -0.020593 -0.15533 0.031101 0.1294 0.038285 -0.075081 -0.095411 0.13559 -0.13448 -0.092657 -0.39257 -0.1617 -0.06562 0.069601 0.26207 -0.039711 0.39187 0.16218 0.053275 -0.066056 0.10139 -0.076679 -0.059841 -0.069376 0.21551 -0.029553 -0.123 0.011586 0.16999 0.17508 0.090918 0.10799 0.085566 -0.0042548 0.097031 0.18012 -0.24137 -0.1599 0.018539 -0.1056 -0.052341 -0.034019 -0.13327 -0.15889 0.033714 0.079085 -0.01673 0.062222 0.16459 -0.021192 0.014571 -0.017858 0.17836 0.13005 0.27747 0.056348 0.13513 0.4205 0.024011 0.18547 0.030009 0.119 -0.058 -0.092228 0.025134 0.003047 -0.024764 0.11025 0.21792 0.12071 0.26308 0.13265 0.058854 -0.36855 -0.04149 0.10599 0.25175 -0.028787 -0.043812 -0.036435 0.0089733 0.066932 0.1702 0.1665 0.094226 -0.14053 -0.18362 -0.035076 0.11685 -0.08793 -0.17653 -0.24763 0.12285 0.0053936 -0.048667 0.23958 0.17958 -0.21611 0.08723 -0.17605 0.17473 0.14182 0.081131 -0.087419 0.071543 0.21449 -0.061005 -0.07196 -0.23685 -0.11879 -0.0071595 -0.071583 0.049396 -0.02676 0.068993 0.0073673 -0.038216 0.16864 0.16553 0.01517 0.15875 -0.1054 0.05747 0.13809 -0.019921 0.36033 0.21684 0.063086 -0.11092 0.35303 0.30894 0.12569 -0.008461 0.25211 -0.073476 -0.442 0.022188 -0.0423 -0.018912 -0.15181 0.19475 0.043222 -0.23028 -0.25009 0.011266 0.14797 0.22005 0.40872 -0.13427 -0.18417 0.011872 -0.1966 -0.18597 0.13815 -0.22767 -0.17908 0.10512 -0.057826 0.071071 -0.23812 -0.0067891 0.036996 -0.029889 -0.17022 0.14456 0.040532 -0.029142 -0.012301 0.2311 -0.14316 -0.22666 -0.19614 0.15429 -0.023078 0.015926 -0.077029 0.065054 -0.30557 0.13245 0.068753 0.11286 0.14658 0.2298 0.18136 0.22165 0.1076 0.0045102 0.1825 0.10714 0.027691 0.13585 0.07148 0.033098 0.030476 -0.13848 0.23759 -0.26323 0.095756 0.15745 0.099187 0.013283 -0.030978 0.10267 0.030753 0.22487 -0.014633 -0.16486 -0.30891 0.0551 -0.15767 -0.11141 0.034447 -0.054475 0.33544 -0.0042994 0.27241 -0.15068 0.096341 0.14226 0.097858 0.00082821 -0.0092396 0.10388 0.18306 0.39652 0.21525 -0.01238 -0.040262 -0.1476 -0.0018151 -0.040134 -0.17208 -0.225 -0.18652 0.13567 0.20318 0.10497 ^ SyntaxError: unterminated string literal (detected at line 9)

Jan 13 '24 07:01 econinomista

Can you re-download the embeddings file making sure it is downloaded properly? (it seems the file is broken there). Note that the file size should be around 1.3G

Jan 16 '24 11:01 fedenanni

Yes, it is downloaded correctly and 1.3GB is also correct.

Federico Nanni @.***> schrieb am Di. 16. Jan. 2024 um 12:58:

Can you re-download the embeddings file making sure it is downloaded properly? (it seems the file is broken there). Note that the file size should be around 1.3G

— Reply to this email directly, view it on GitHub https://github.com/umanlp/SemScale/issues/3#issuecomment-1893600576, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUAYD6OD7BUNG6EF2DFFSXLYOZTQDAVCNFSM6AAAAABBYHNVF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJTGYYDANJXGY . You are receiving this because you authored the thread.Message ID: @.***>

Jan 16 '24 11:01 econinomista

From here it is a bit hard to debug. I have just reinstalled it all and it seems to be working for me using that input embedding file and the textual data from the online appendix Screenshot 2024-01-16 at 12 17 55

I'm tagging @irehbein because she might be working on Windows on this (I've just tested on Mac and Linux and in both cases it loaded embeddings just fine). Sorry, but it has been a long time since we last worked on this!

Jan 16 '24 12:01 fedenanni

Ah - check the order of the commands! You should have:

input folder (where your documents sit)
embedding file
output file

Your examples has embeddings first and input folder second:

python scaler.py C:\Users\SemScale\embeddings\wiki.big-five.mapped.vec C:\Users\SemScale\datadir_test C:\Users\SemScale\output.txt

Jan 16 '24 12:01 fedenanni

I just noticed that this is wrong in the documentation. Above we say the correct order, but here they are inverted! Sorry for this, I'll fix it now:

Jan 16 '24 12:01 fedenanni

Fixed it - let me know if this works now:

Jan 16 '24 12:01 fedenanni

Thank you so much! This is working now. However, I am still having some issues with the application. I want to use semscale for a csv datafile, that I have containing tweets of German parliament politicians. Since it contains tweets of many years, I do have about a million txt files now. Therefore, I tried running the code a few times now, but it seems that due to memory limitations it is never able to finish. Do I see it correctly, that I need a txt file of every tweet in the beginning, starting with "de /n (text)"? And do you have any advice on how I could use the package more efficiently?

Am Di., 16. Jan. 2024 um 13:25 Uhr schrieb Federico Nanni < @.***>:

Fixed it - let me know if this works now: Screenshot.2024-01-16.at.12.25.24.png (view on web) https://github.com/umanlp/SemScale/assets/8415204/f0c2c767-8c9b-4eb1-8aaa-d5bd1ff86feb

— Reply to this email directly, view it on GitHub https://github.com/umanlp/SemScale/issues/3#issuecomment-1893642426, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUAYD6MMMDNLZNATB3FTGVTYOZWVHAVCNFSM6AAAAABBYHNVF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJTGY2DENBSGY . You are receiving this because you authored the thread.Message ID: @.***>

Jan 28 '24 19:01 econinomista

I see, maybe you could group tweets together by author to reduce the number of files. So one file for each user - this way you'll be scaling users, not single tweets

Jan 29 '24 09:01 fedenanni