CPU only installation
I've been using unstructured for a while in a 100% cpu machine. I've noticed a lot of nvidia files (+2gb) in my venv folder coming from PyTorch (possible one of unstructured's dependencies).
Can I install a cpu-only version of unstructured? Because I've been partitioning for a while and no gpu used.
Here is my requirements.in file:
uvicorn[standard]==0.25.0
fastapi==0.111.0
pyyaml==6.0.1
injector==0.21.0
overrides==7.7.0
langchain==0.2.5
langchain-google-genai==1.0.6
json-repair==0.9.0
unstructured[pptx,image,docx,pdf]==0.14.9
opencv-python-headless==4.9.0.80
jq==1.6.0
pytesseract==0.3.10
pymilvus==2.3.6
langchain-openai==0.1.8
scikit-learn==1.5.0
ruff==0.3.1
pandas==2.2.1
llama-index==0.10.33
python-multipart==0.0.9
llama-index-vector-stores-milvus==0.1.10
playwright==1.43.0
python-magic==0.4.27
llama-index-llms-gemini==0.1.11
opencv-python==4.9.0.80
llama-index-llms-anthropic==0.1.11
llama-index-llms-ollama==0.1.5
llama-index-embeddings-ollama==0.1.2
pymupdf==1.24.4
pypdf[image]==4.2.0
llama-index-multi-modal-llms-ollama==0.1.3
llama-index-llms-groq==0.1.4
gensim==3.6.0
firebase-admin==6.5.0
demjson3==3.0.6
langchain-community==0.2.5
jsonschema==4.22.0
pypdf2==3.0.1
fpdf==1.7.2
moviepy==1.0.3
neo4j==5.21.0
llama-index-graph-stores-neo4j==0.2.5
pylatex==1.4.2
reportlab==4.2.0
psutil==5.9.8
fastapi-utils==0.7.0
colorama==0.4.6
humanize==4.9.0
objgraph==3.6.1
imgkit==1.2.3
pyppeteer==2.0.0
wkhtmltopdf==0.2
llama-agents==0.0.3
click==8.1.7
mypy==1.10.1
Note that there's no torch on it
Thanks for the suggested @arthurbrenno . We'll take a look at this. I think this would have the side benefit of reducing the size of our CPU images.
Tysm! It would save us about 3gb of storage.
@arthurbrenno see here #2976
Installing torch-cpu before the unstructured libs should be of help. This will not install the nvidia gpu libs for pytorch.
This is what i Have been doing to build lambda images.
Thank you, @sidatcd!
@sidatcd i have a need to accelerate the unstructured IO , can it support GPU ? if yes what are the steps to make it use GPU
Installing torch-cpu before the unstructured libs should be of help. This will not install the nvidia gpu libs for pytorch. This is what i Have been doing to build lambda images.
For anyone who uses poetry, you can accomplish this in your pyproject.toml with these commands:
$ poetry source add --priority=explicit pytorch-cpu https://download.pytorch.org/whl/cpu
$ poetry add --source pytorch-cpu torch
The result in your pyrpoject.toml will look like this
onnxruntime = "^1.18.1"
torch = {version = "^2.5.0+cpu", source = "pytorch-cpu"}
unstructured = {extras = ["csv", "doc", "docx", "pdf", "ppt", "pptx", "xlsx"], version = "^0.16.3"}
[[tool.poetry.source]]
name = "pytorch-cpu"
url = "https://download.pytorch.org/whl/cpu"
priority = "explicit"
Sources: https://github.com/python-poetry/poetry/issues/7685 https://github.com/python-poetry/poetry/pull/8246/commits/948f3a9b95a200525223b897beaa92c8b255a444
That side - I +1 having a CPU only unstructured option to handle this.
@sidatcd i have a need to accelerate the unstructured IO , can it support GPU ? if yes what are the steps to make it use GPU
@jaideep11061982 Are you able to accelerate ? if yes How you do.
Thanks
