Python-snappy not found during execution of CSVExampleGen
System information
- Have I specified the code to reproduce the issue (Yes, No): Yes
- Environment in which the code is executed: Zorin OS (Ubuntu 22.04) Interactive Notebook, Google Cloud, etc):
- TensorFlow version: 2.15.1
- TFX Version: 1.15.1
- Python version: 3.10.14
- Python dependencies (from
pip freezeoutput):absl-py==1.4.0 annotated-types==0.7.0 anyio==4.4.0 apache-beam==2.56.0 argon2-cffi==23.1.0 argon2-cffi-bindings==21.2.0 array_record==0.5.1 arrow==1.3.0 asttokens==2.4.1 astunparse==1.6.3 async-lru==2.0.4 async-timeout==4.0.3 attrs==23.2.0 Babel==2.15.0 backcall==0.2.0 beautifulsoup4==4.12.3 bleach==6.1.0 cachetools==5.3.3 certifi==2024.6.2 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==2.2.1 colorama==0.4.6 comm==0.2.2 contourpy==1.2.1 cramjam==2.8.3 crcmod==1.7 cycler==0.12.1 debugpy==1.8.1 decorator==5.1.1 defusedxml==0.7.1 dill==0.3.1.1 dm-tree==0.1.8 dnspython==2.6.1 docker==4.4.4 docopt==0.6.2 docstring_parser==0.16 etils==1.7.0 exceptiongroup==1.2.1 executing==2.0.1 facets-overview==1.1.1 fastavro==1.9.4 fasteners==0.19 fastjsonschema==2.20.0 flatbuffers==24.3.25 fonttools==4.53.0 fqdn==1.5.1 fsspec==2024.6.0 gast==0.5.4 google-api-core==2.19.0 google-api-python-client==1.12.11 google-apitools==0.5.31 google-auth==2.30.0 google-auth-httplib2==0.2.0 google-auth-oauthlib==1.2.0 google-cloud-aiplatform==1.56.0 google-cloud-bigquery==3.24.0 google-cloud-bigquery-storage==2.25.0 google-cloud-bigtable==2.24.0 google-cloud-core==2.4.1 google-cloud-dataproc==5.9.3 google-cloud-datastore==2.19.0 google-cloud-dlp==3.18.0 google-cloud-language==2.13.3 google-cloud-pubsub==2.21.3 google-cloud-pubsublite==1.10.0 google-cloud-recommendations-ai==0.10.10 google-cloud-resource-manager==1.12.3 google-cloud-spanner==3.47.0 google-cloud-storage==2.17.0 google-cloud-videointelligence==2.13.3 google-cloud-vision==3.7.2 google-crc32c==1.5.0 google-pasta==0.2.0 google-resumable-media==2.7.1 googleapis-common-protos==1.63.1 grpc-google-iam-v1==0.13.0 grpc-interceptor==0.15.4 grpcio==1.64.1 grpcio-status==1.48.2 h11==0.14.0 h5py==3.11.0 hdfs==2.7.3 httpcore==1.0.5 httplib2==0.22.0 httpx==0.27.0 idna==3.7 immutabledict==4.2.0 importlib_resources==6.4.0 ipykernel==6.29.4 ipython==8.25.0 ipython-genutils==0.2.0 ipywidgets==8.1.3 isoduration==20.11.0 jedi==0.19.1 Jinja2==3.1.4 joblib==1.4.2 Js2Py==0.74 json5==0.9.25 jsonpickle==3.2.1 jsonpointer==3.0.0 jsonschema==4.22.0 jsonschema-specifications==2023.12.1 jupyter-events==0.10.0 jupyter-lsp==2.2.5 jupyter_client==8.2.0 jupyter_core==5.7.2 jupyter_server==2.14.1 jupyter_server_terminals==0.5.3 jupyterlab==4.2.2 jupyterlab_pygments==0.3.0 jupyterlab_server==2.27.2 jupyterlab_widgets==3.0.11 keras==2.15.0 keras-tuner==1.4.7 kiwisolver==1.4.5 kt-legacy==1.0.5 kubernetes==12.0.1 libclang==18.1.1 lxml==5.2.2 Markdown==3.6 MarkupSafe==2.1.5 matplotlib==3.9.0 matplotlib-inline==0.1.7 mistune==3.0.2 ml-dtypes==0.3.2 ml-metadata==1.15.0 ml-pipelines-sdk==1.15.1 mplcyberpunk==0.7.1 nbclient==0.10.0 nbconvert==7.16.4 nbformat==5.10.4 nest-asyncio==1.6.0 nltk==3.8.1 notebook==7.2.1 notebook_shim==0.2.4 numpy==1.26.4 oauth2client==4.1.3 oauthlib==3.2.2 objsize==0.7.0 opt-einsum==3.3.0 orjson==3.10.5 overrides==7.7.0 packaging==24.1 pandas==1.5.3 pandocfilters==1.5.1 parso==0.8.4 pathlib==1.0.1 pexpect==4.9.0 pickleshare==0.7.5 pillow==10.3.0 platformdirs==4.2.2 portalocker==2.8.2 portpicker==1.6.0 prometheus_client==0.20.0 promise==2.3 prompt_toolkit==3.0.47 proto-plus==1.23.0 protobuf==3.20.3 psutil==5.9.8 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==10.0.1 pyarrow-hotfix==0.6 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycparser==2.22 pydantic==2.7.4 pydantic_core==2.18.4 pydot==1.4.2 pyfarmhash==0.3.2 Pygments==2.18.0 pyjsparser==2.7.1 pymongo==4.7.3 pyparsing==3.1.2 python-dateutil==2.9.0.post0 python-json-logger==2.0.7 python-snappy==0.7.2 pytz==2024.1 PyYAML==6.0.1 pyzmq==26.0.3 redis==5.0.6 referencing==0.35.1 regex==2024.5.15 requests==2.32.3 requests-oauthlib==2.0.0 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rouge_score==0.1.2 rpds-py==0.18.1 rsa==4.9 sacrebleu==2.4.2 scipy==1.12.0 seaborn==0.13.2 Send2Trash==1.8.3 shapely==2.0.4 simple_parsing==0.1.5 six==1.16.0 sniffio==1.3.1 soupsieve==2.5 sqlparse==0.5.0 stack-data==0.6.3 tabulate==0.9.0 tensorboard==2.15.2 tensorboard-data-server==0.7.2 tensorflow==2.15.1 tensorflow-data-validation==1.15.1 tensorflow-datasets==4.9.6 tensorflow-estimator==2.15.0 tensorflow-hub==0.15.0 tensorflow-io-gcs-filesystem==0.37.0 tensorflow-metadata==1.15.0 tensorflow-serving-api==2.15.1 tensorflow-transform==1.15.0 tensorflow_model_analysis==0.46.0 termcolor==2.4.0 terminado==0.18.1 tfx==1.15.1 tfx-bsl==1.15.1 timeloop==1.0.2 tinycss2==1.3.0 toml==0.10.2 tomli==2.0.1 tornado==6.4.1 tqdm==4.66.4 traitlets==5.14.3 types-python-dateutil==2.9.0.20240316 typing_extensions==4.12.2 tzlocal==5.2 uri-template==1.3.0 uritemplate==3.0.1 urllib3==2.2.2 wcwidth==0.2.13 webcolors==24.6.0 webencodings==0.5.1 websocket-client==1.8.0 Werkzeug==3.0.3 widgetsnbextension==4.0.11 wrapt==1.14.1 zipp==3.19.2 zstandard==0.22.0
Current Behavior
I have a simple pipeline consisting of only one component (CSVExampleGen) to ingest csv files and convert them to TFRecords.
However, upon running the pipeline I get the following warning:
WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
I have already installed python-snappy and the corresponding C library using the following commands:
sudo apt-get install libsnappy-dev
pip install python-snappy
Expected behavior: The execution of this simple pipeline should be much faster and no such warnings should be produced.
Standalone code to reproduce the issue
Download any moderately sized csv file with numerical data and run the following code:
from tfx.proto import example_gen_pb2
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
from tfx.orchestration.pipeline import Pipeline
from tfx.orchestration.local.local_dag_runner import LocalDagRunner
from tfx.orchestration import metadata
pipeline_root = 'artifacts'
data_dir = 'data'
input_config = example_gen_pb2.Input(
splits=[
example_gen_pb2.Input.Split(name='data', pattern='data.csv')
]
)
output_config = example_gen_pb2.Output(
split_config=example_gen_pb2.SplitConfig(
splits=[
example_gen_pb2.SplitConfig.Split(name='train', hash_buckets=8),
example_gen_pb2.SplitConfig.Split(name='eval', hash_buckets=2)
]
)
)
example_gen = CsvExampleGen(
input_base=data_dir,
input_config=input_config,
output_config=output_config,
)
pipeline = Pipeline(
pipeline_name='testing pipeline',
pipeline_root=pipeline_root,
components=[
example_gen
],
enable_cache=True,
metadata_connection_config=metadata.sqlite_metadata_connection_config(
os.path.join(pipeline_root, 'metadata.sqlite')
)
)
LocalDagRunner().run(pipeline)
Hi, sorry for responding this issue so late.
The warning appears when it fails to import python snappy, as per https://github.com/apache/beam/blob/v2.56.0/sdks/python/apache_beam/io/tfrecordio.py#L48.
Could you please test running python3 -c 'import snappy' if it is properly imported? Thanks!
This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.
This issue was closed due to lack of activity after being marked stale for past 7 days.