Cannot find embeddings
Hi, thank you so much for providing this code! Unfortunatetly I am having issues running SemScale. In Anaconda Promptshell I ran:
python scaler.py C:\Users\SemScale\embeddings\wiki.big-five.mapped.vec C:\Users\SemScale\datadir_test C:\Users\SemScale\output.txt
However, this always yields the error:
WARNING:tensorflow:From C:\Users\Documents\Python\envs\semscale\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead. Error: File containing pre-trained word embeddings not found.
Is the embedding not working anymore? Thank you very much in advance for your help!!
Hi! It seems there's an issue with the path of the embedding file. Could you check two things:
- if the file is correctly in that folder. You should download it from here
- If you need to write the path differently, for instance like this:
python scaler.py C:/Users/SemScale/embeddings/wiki.big-five.mapped.vec C:/Users/SemScale/datadir_test C:/Users/SemScale/output.txtI don't have a Window PC with me, but it might be that it's simply something to do with the filepath specification.
Thank you very much for the quick response! Unfortunately, I could not resolve the issue. After restarting my device, the path and file was found, however the programm tells me, that the embeddings file contains errors and I am unsure how to deal with that.
File "C:\Users\noske\Documents\Nikola\SemScale\embeddings\wiki.big-five.mapped.vec", line 9 en__' -0.17489 -0.13695 0.13345 -0.07282 0.038794 0.13294 0.0015304 -0.071056 -0.20026 -0.045437 -0.0019054 -0.17913 0.18241 -0.058909 -0.0088248 0.060522 0.1872 0.2255 -0.11638 0.080349 -0.33614 -0.035788 -0.21518 -0.062891 -0.1322 -0.09628 0.065516 0.16418 -0.014492 0.11139 -0.25025 0.25303 -0.20538 -0.027447 -0.18057 -0.13118 -0.36836 0.055097 0.23968 -0.17034 0.26393 0.30392 -0.18615 0.13712 -0.012511 0.11977 0.00017869 0.059385 -0.05704 -0.046391 0.012484 -0.067036 0.20004 -0.34513 -0.16117 -0.082885 -0.043013 0.031685 -0.01498 0.11803 0.068215 -0.18596 0.11503 -0.020593 -0.15533 0.031101 0.1294 0.038285 -0.075081 -0.095411 0.13559 -0.13448 -0.092657 -0.39257 -0.1617 -0.06562 0.069601 0.26207 -0.039711 0.39187 0.16218 0.053275 -0.066056 0.10139 -0.076679 -0.059841 -0.069376 0.21551 -0.029553 -0.123 0.011586 0.16999 0.17508 0.090918 0.10799 0.085566 -0.0042548 0.097031 0.18012 -0.24137 -0.1599 0.018539 -0.1056 -0.052341 -0.034019 -0.13327 -0.15889 0.033714 0.079085 -0.01673 0.062222 0.16459 -0.021192 0.014571 -0.017858 0.17836 0.13005 0.27747 0.056348 0.13513 0.4205 0.024011 0.18547 0.030009 0.119 -0.058 -0.092228 0.025134 0.003047 -0.024764 0.11025 0.21792 0.12071 0.26308 0.13265 0.058854 -0.36855 -0.04149 0.10599 0.25175 -0.028787 -0.043812 -0.036435 0.0089733 0.066932 0.1702 0.1665 0.094226 -0.14053 -0.18362 -0.035076 0.11685 -0.08793 -0.17653 -0.24763 0.12285 0.0053936 -0.048667 0.23958 0.17958 -0.21611 0.08723 -0.17605 0.17473 0.14182 0.081131 -0.087419 0.071543 0.21449 -0.061005 -0.07196 -0.23685 -0.11879 -0.0071595 -0.071583 0.049396 -0.02676 0.068993 0.0073673 -0.038216 0.16864 0.16553 0.01517 0.15875 -0.1054 0.05747 0.13809 -0.019921 0.36033 0.21684 0.063086 -0.11092 0.35303 0.30894 0.12569 -0.008461 0.25211 -0.073476 -0.442 0.022188 -0.0423 -0.018912 -0.15181 0.19475 0.043222 -0.23028 -0.25009 0.011266 0.14797 0.22005 0.40872 -0.13427 -0.18417 0.011872 -0.1966 -0.18597 0.13815 -0.22767 -0.17908 0.10512 -0.057826 0.071071 -0.23812 -0.0067891 0.036996 -0.029889 -0.17022 0.14456 0.040532 -0.029142 -0.012301 0.2311 -0.14316 -0.22666 -0.19614 0.15429 -0.023078 0.015926 -0.077029 0.065054 -0.30557 0.13245 0.068753 0.11286 0.14658 0.2298 0.18136 0.22165 0.1076 0.0045102 0.1825 0.10714 0.027691 0.13585 0.07148 0.033098 0.030476 -0.13848 0.23759 -0.26323 0.095756 0.15745 0.099187 0.013283 -0.030978 0.10267 0.030753 0.22487 -0.014633 -0.16486 -0.30891 0.0551 -0.15767 -0.11141 0.034447 -0.054475 0.33544 -0.0042994 0.27241 -0.15068 0.096341 0.14226 0.097858 0.00082821 -0.0092396 0.10388 0.18306 0.39652 0.21525 -0.01238 -0.040262 -0.1476 -0.0018151 -0.040134 -0.17208 -0.225 -0.18652 0.13567 0.20318 0.10497 ^ SyntaxError: unterminated string literal (detected at line 9)
Can you re-download the embeddings file making sure it is downloaded properly? (it seems the file is broken there). Note that the file size should be around 1.3G
Yes, it is downloaded correctly and 1.3GB is also correct.
Federico Nanni @.***> schrieb am Di. 16. Jan. 2024 um 12:58:
Can you re-download the embeddings file making sure it is downloaded properly? (it seems the file is broken there). Note that the file size should be around 1.3G
— Reply to this email directly, view it on GitHub https://github.com/umanlp/SemScale/issues/3#issuecomment-1893600576, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUAYD6OD7BUNG6EF2DFFSXLYOZTQDAVCNFSM6AAAAABBYHNVF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJTGYYDANJXGY . You are receiving this because you authored the thread.Message ID: @.***>
From here it is a bit hard to debug. I have just reinstalled it all and it seems to be working for me using that input embedding file and the textual data from the online appendix
I'm tagging @irehbein because she might be working on Windows on this (I've just tested on Mac and Linux and in both cases it loaded embeddings just fine). Sorry, but it has been a long time since we last worked on this!
Ah - check the order of the commands! You should have:
- input folder (where your documents sit)
- embedding file
- output file
Your examples has embeddings first and input folder second:
python scaler.py C:\Users\SemScale\embeddings\wiki.big-five.mapped.vec C:\Users\SemScale\datadir_test C:\Users\SemScale\output.txt
I just noticed that this is wrong in the documentation. Above we say the correct order, but here they are inverted! Sorry for this, I'll fix it now:
Fixed it - let me know if this works now:
Thank you so much! This is working now. However, I am still having some issues with the application. I want to use semscale for a csv datafile, that I have containing tweets of German parliament politicians. Since it contains tweets of many years, I do have about a million txt files now. Therefore, I tried running the code a few times now, but it seems that due to memory limitations it is never able to finish. Do I see it correctly, that I need a txt file of every tweet in the beginning, starting with "de /n (text)"? And do you have any advice on how I could use the package more efficiently?
Am Di., 16. Jan. 2024 um 13:25 Uhr schrieb Federico Nanni < @.***>:
Fixed it - let me know if this works now: Screenshot.2024-01-16.at.12.25.24.png (view on web) https://github.com/umanlp/SemScale/assets/8415204/f0c2c767-8c9b-4eb1-8aaa-d5bd1ff86feb
— Reply to this email directly, view it on GitHub https://github.com/umanlp/SemScale/issues/3#issuecomment-1893642426, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUAYD6MMMDNLZNATB3FTGVTYOZWVHAVCNFSM6AAAAABBYHNVF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJTGY2DENBSGY . You are receiving this because you authored the thread.Message ID: @.***>
I see, maybe you could group tweets together by author to reduce the number of files. So one file for each user - this way you'll be scaling users, not single tweets