Proposal and feedback requested: Wikipedia RAG GenAIExamples
I am mentoring some college students with LAION, one of the students is working on embeddings for wikipedia, and its not yet ready to be pushed to OPEA yet, but I want to collect feedback about an issue we discussed.
Do you guys prefer to have the entire article text be in the vectordb, or would you prefer to have only the article abstract be in the vector db. Also I had asked him to follow the example on the huggingface datasets, with regards to using the hugginface FaissIndex and elasticsearch index, but I want to confirm that this is the method that works best for you guys.
@sleepingcat4 is the college student. and his WIP repository is located here https://github.com/sleepingcat4/wikidataset and his WIP dataset is here https://huggingface.co/datasets/laion/Wikipedia_11_23_BGE-M3_Embeddings (but I have told him he needs to rework both of these, so be aware that this information is going to change.
I prefer entire text. Btw, OPEA is microservice-based, please think how to contribute to OPEA.
Okay, then the entire text :)
@kevinintel what do you think of using Late chunk for generating our full-text wiki's embeddings
https://colab.research.google.com/drive/1IIAHEomlhUAisIz1NJTdVdtq-2A0SSCS?usp=sharing\
It's specifically designed to leverage higher context window made available by recent mother Embedding models. And allows easily to capture the semantic relationships between sentences in different chunks easily.
I was thinking we could use this method to generate our full-text embeddings (it's not done for making big datasets yet)
https://github.com/HabanaAI/vllm-fork/pull/144
This week I am going to continue with attempting to get llama 405b with speculative decoding with llama 8b working, and process some Wikipedia datasets and embeddings, I am going to use recursive summarization and sliding window embeddings on the article text where the article text exceeds the embedding window. Ammar is producing the abstract embeddings right now
https://huggingface.co/datasets/laion/Wikipedia-X https://huggingface.co/datasets/laion/Wikipedia-X-Full
the datasets for both Abstract and full of wikipedia in 17 different languages are created. Embeddings are being run on a 3090 server. (my repo has the updated code to compile the dataset.
Please try to create PR first, Late Chunk may not better than current embedding, but you are welcome to expand the functionalities
The abstracts embeddings are still running. https://huggingface.co/datasets/laion/Wikipedia-M3
I made a repository for searching through the embeddings, but I am working on the embedding generation scripts, to recursively summarize and chunk the embeddings.
https://github.com/endomorphosis/laion-embeddings/tree/master
@endomorphosis @kevinintel
https://huggingface.co/datasets/laion/Wikipedia-M3
Wikipedia M3 is done. In this dataset, I made Abstracts' embeddings for 10 most widely spoken and active research group languages.
These languages are:
- English
- German
- Polish
- French
- Spanish
- Portuguese
- Italian
- Russian
- Hebrew
- Chinese
Initial focus for embeddings was North American, South American and European languages exception being Chinese. We plan to expand into Japanese and Korean in our next iteration with a different and more advanced Embedding model such as JINA AI (8K) model from Germany and JINA AI, COLBERT embedding models.
@endomorphosis @sleepingcat4
Are you still working on this? Will you contribute your PRs?
Yeah, I am working on it with protocol labs and LAION, but I am entirely self funded. The people at the Libp2p project have asked me to find out if anyone would like libp2p integrated into their project, and I suggested that maybe OPEA could use p2p integration, and they also wanted to pay to send me out to some conferences to teach people how to use the libp2p based projects (e.g. IPFS / filecoin / ethereum ) to discuss maybe implementing things like tensor / pipeline parallelism or agent orchestration.
https://laion.ai/blog/laion-intel-cooperation/ https://www.intel.com/content/www/us/en/developer/articles/technical/bud-e-ai-assisted-education-for-all.html
Here is the model server that I am working on right now, to get full coverage on every single hugging face model class on every single hardware platform in python and node js / client javascript, which is a part of a MLOPS system I am building around IPFS / libp2p / filecoin because there is roughly 4 exabytes of data on that network.
https://github.com/endomorphosis/hallucinate_app/wiki/IPFS-HuggingFace-Bridge-Architecture https://github.com/users/endomorphosis/projects/1/views/9
Here is the current webnn / openvino / cuda / apple, qualcomm, rocm model server and endpoint multiplexer, that I'm improving on for the new "swissknife" mlops platform based on IPFS / Libp2p (the previous being a closed source kubernetes / docker based system)
https://github.com/endomorphosis/ipfs_accelerate_py/tree/main/test/fixed_web_platform python code generating javascript webnn code https://github.com/endomorphosis/ipfs_accelerate_py/tree/main/test/web_platform_integration https://github.com/endomorphosis/ipfs_accelerate_py/blob/main/test/README_WEB_PLATFORM_SUPPORT.md https://github.com/endomorphosis/ipfs_accelerate_py/blob/main/test/PHASE16_IMPLEMENTATION_SUMMARY.md
Here are some law text datasets that I have been working on, with the intent of scraping the entire legal corpus, put it into a graphrag system, and then use the graphrag system and reasoning model to create a domain specific first order (fuzzy) logic system of equations to represent the legal corpus structure.
https://huggingface.co/datasets/laion/Caselaw_Access_Project_embeddings https://huggingface.co/datasets/the-ride-never-ends/american_law (my new jr dev)
and I helped with the datasets that we are using to help train a new audio language model, that I'm hoping to add with pose data, by tokenizing the audio and the pose vertices, to reduce what used to take me 5 models and distill it into one model. https://huggingface.co/datasets/laion/LAION-Audio-300M https://huggingface.co/datasets/laion/laions_got_talent https://huggingface.co/datasets/laion/synthetic_vocal_bursts
On Mon, Mar 3, 2025 at 4:57 PM xiguiw @.***> wrote:
@endomorphosis https://github.com/endomorphosis @sleepingcat4 https://github.com/sleepingcat4
Are you still working on this? Will you contribute your PRs?
— Reply to this email directly, view it on GitHub https://github.com/opea-project/GenAIExamples/issues/603#issuecomment-2695911711, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ7LESHR5Y7C75OGY2Y6E32ST25ZAVCNFSM6AAAAABMSRVXPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJVHEYTCNZRGE . You are receiving this because you were mentioned.Message ID: @.***> [image: xiguiw]xiguiw left a comment (opea-project/GenAIExamples#603) https://github.com/opea-project/GenAIExamples/issues/603#issuecomment-2695911711
@endomorphosis https://github.com/endomorphosis @sleepingcat4 https://github.com/sleepingcat4
Are you still working on this? Will you contribute your PRs?
— Reply to this email directly, view it on GitHub https://github.com/opea-project/GenAIExamples/issues/603#issuecomment-2695911711, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ7LESHR5Y7C75OGY2Y6E32ST25ZAVCNFSM6AAAAABMSRVXPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJVHEYTCNZRGE . You are receiving this because you were mentioned.Message ID: @.***>
[image: image.png]
On Mon, Mar 3, 2025 at 6:24 PM benjamin barber @.***> wrote:
Yeah, I am working on it with protocol labs and LAION, but I am entirely self funded. The people at the Libp2p project have asked me to find out if anyone would like libp2p integrated into their project, and I suggested that maybe OPEA could use p2p integration, and they also wanted to pay to send me out to some conferences to teach people how to use the libp2p based projects (e.g. IPFS / filecoin / ethereum ) to discuss maybe implementing things like tensor / pipeline parallelism or agent orchestration.
https://laion.ai/blog/laion-intel-cooperation/
https://www.intel.com/content/www/us/en/developer/articles/technical/bud-e-ai-assisted-education-for-all.html
Here is the model server that I am working on right now, to get full coverage on every single hugging face model class on every single hardware platform in python and node js / client javascript, which is a part of a MLOPS system I am building around IPFS / libp2p / filecoin because there is roughly 4 exabytes of data on that network.
https://github.com/endomorphosis/hallucinate_app/wiki/IPFS-HuggingFace-Bridge-Architecture https://github.com/users/endomorphosis/projects/1/views/9
Here is the current webnn / openvino / cuda / apple, qualcomm, rocm model server and endpoint multiplexer, that I'm improving on for the new "swissknife" mlops platform based on IPFS / Libp2p (the previous being a closed source kubernetes / docker based system)
https://github.com/endomorphosis/ipfs_accelerate_py/tree/main/test/fixed_web_platform python code generating javascript webnn code
https://github.com/endomorphosis/ipfs_accelerate_py/tree/main/test/web_platform_integration
https://github.com/endomorphosis/ipfs_accelerate_py/blob/main/test/README_WEB_PLATFORM_SUPPORT.md
https://github.com/endomorphosis/ipfs_accelerate_py/blob/main/test/PHASE16_IMPLEMENTATION_SUMMARY.md
Here are some law text datasets that I have been working on, with the intent of scraping the entire legal corpus, put it into a graphrag system, and then use the graphrag system and reasoning model to create a domain specific first order (fuzzy) logic system of equations to represent the legal corpus structure.
https://huggingface.co/datasets/laion/Caselaw_Access_Project_embeddings https://huggingface.co/datasets/the-ride-never-ends/american_law (my new jr dev)
and I helped with the datasets that we are using to help train a new audio language model, that I'm hoping to add with pose data, by tokenizing the audio and the pose vertices, to reduce what used to take me 5 models and distill it into one model. https://huggingface.co/datasets/laion/LAION-Audio-300M https://huggingface.co/datasets/laion/laions_got_talent https://huggingface.co/datasets/laion/synthetic_vocal_bursts
On Mon, Mar 3, 2025 at 4:57 PM xiguiw @.***> wrote:
@endomorphosis https://github.com/endomorphosis @sleepingcat4 https://github.com/sleepingcat4
Are you still working on this? Will you contribute your PRs?
— Reply to this email directly, view it on GitHub https://github.com/opea-project/GenAIExamples/issues/603#issuecomment-2695911711, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ7LESHR5Y7C75OGY2Y6E32ST25ZAVCNFSM6AAAAABMSRVXPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJVHEYTCNZRGE . You are receiving this because you were mentioned.Message ID: @.***> [image: xiguiw]xiguiw left a comment (opea-project/GenAIExamples#603) https://github.com/opea-project/GenAIExamples/issues/603#issuecomment-2695911711
@endomorphosis https://github.com/endomorphosis @sleepingcat4 https://github.com/sleepingcat4
Are you still working on this? Will you contribute your PRs?
— Reply to this email directly, view it on GitHub https://github.com/opea-project/GenAIExamples/issues/603#issuecomment-2695911711, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ7LESHR5Y7C75OGY2Y6E32ST25ZAVCNFSM6AAAAABMSRVXPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJVHEYTCNZRGE . You are receiving this because you were mentioned.Message ID: @.***>
This issue is opened for a long time.
Anything need I do for it? If no, I'll close it.