tfx Python-snappy not found during execution of CSVExampleGen

System information

Have I specified the code to reproduce the issue (Yes, No): Yes
Environment in which the code is executed: Zorin OS (Ubuntu 22.04) Interactive Notebook, Google Cloud, etc):
TensorFlow version: 2.15.1
TFX Version: 1.15.1
Python version: 3.10.14
Python dependencies (from pip freeze output): absl-py==1.4.0 annotated-types==0.7.0 anyio==4.4.0 apache-beam==2.56.0 argon2-cffi==23.1.0 argon2-cffi-bindings==21.2.0 array_record==0.5.1 arrow==1.3.0 asttokens==2.4.1 astunparse==1.6.3 async-lru==2.0.4 async-timeout==4.0.3 attrs==23.2.0 Babel==2.15.0 backcall==0.2.0 beautifulsoup4==4.12.3 bleach==6.1.0 cachetools==5.3.3 certifi==2024.6.2 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==2.2.1 colorama==0.4.6 comm==0.2.2 contourpy==1.2.1 cramjam==2.8.3 crcmod==1.7 cycler==0.12.1 debugpy==1.8.1 decorator==5.1.1 defusedxml==0.7.1 dill==0.3.1.1 dm-tree==0.1.8 dnspython==2.6.1 docker==4.4.4 docopt==0.6.2 docstring_parser==0.16 etils==1.7.0 exceptiongroup==1.2.1 executing==2.0.1 facets-overview==1.1.1 fastavro==1.9.4 fasteners==0.19 fastjsonschema==2.20.0 flatbuffers==24.3.25 fonttools==4.53.0 fqdn==1.5.1 fsspec==2024.6.0 gast==0.5.4 google-api-core==2.19.0 google-api-python-client==1.12.11 google-apitools==0.5.31 google-auth==2.30.0 google-auth-httplib2==0.2.0 google-auth-oauthlib==1.2.0 google-cloud-aiplatform==1.56.0 google-cloud-bigquery==3.24.0 google-cloud-bigquery-storage==2.25.0 google-cloud-bigtable==2.24.0 google-cloud-core==2.4.1 google-cloud-dataproc==5.9.3 google-cloud-datastore==2.19.0 google-cloud-dlp==3.18.0 google-cloud-language==2.13.3 google-cloud-pubsub==2.21.3 google-cloud-pubsublite==1.10.0 google-cloud-recommendations-ai==0.10.10 google-cloud-resource-manager==1.12.3 google-cloud-spanner==3.47.0 google-cloud-storage==2.17.0 google-cloud-videointelligence==2.13.3 google-cloud-vision==3.7.2 google-crc32c==1.5.0 google-pasta==0.2.0 google-resumable-media==2.7.1 googleapis-common-protos==1.63.1 grpc-google-iam-v1==0.13.0 grpc-interceptor==0.15.4 grpcio==1.64.1 grpcio-status==1.48.2 h11==0.14.0 h5py==3.11.0 hdfs==2.7.3 httpcore==1.0.5 httplib2==0.22.0 httpx==0.27.0 idna==3.7 immutabledict==4.2.0 importlib_resources==6.4.0 ipykernel==6.29.4 ipython==8.25.0 ipython-genutils==0.2.0 ipywidgets==8.1.3 isoduration==20.11.0 jedi==0.19.1 Jinja2==3.1.4 joblib==1.4.2 Js2Py==0.74 json5==0.9.25 jsonpickle==3.2.1 jsonpointer==3.0.0 jsonschema==4.22.0 jsonschema-specifications==2023.12.1 jupyter-events==0.10.0 jupyter-lsp==2.2.5 jupyter_client==8.2.0 jupyter_core==5.7.2 jupyter_server==2.14.1 jupyter_server_terminals==0.5.3 jupyterlab==4.2.2 jupyterlab_pygments==0.3.0 jupyterlab_server==2.27.2 jupyterlab_widgets==3.0.11 keras==2.15.0 keras-tuner==1.4.7 kiwisolver==1.4.5 kt-legacy==1.0.5 kubernetes==12.0.1 libclang==18.1.1 lxml==5.2.2 Markdown==3.6 MarkupSafe==2.1.5 matplotlib==3.9.0 matplotlib-inline==0.1.7 mistune==3.0.2 ml-dtypes==0.3.2 ml-metadata==1.15.0 ml-pipelines-sdk==1.15.1 mplcyberpunk==0.7.1 nbclient==0.10.0 nbconvert==7.16.4 nbformat==5.10.4 nest-asyncio==1.6.0 nltk==3.8.1 notebook==7.2.1 notebook_shim==0.2.4 numpy==1.26.4 oauth2client==4.1.3 oauthlib==3.2.2 objsize==0.7.0 opt-einsum==3.3.0 orjson==3.10.5 overrides==7.7.0 packaging==24.1 pandas==1.5.3 pandocfilters==1.5.1 parso==0.8.4 pathlib==1.0.1 pexpect==4.9.0 pickleshare==0.7.5 pillow==10.3.0 platformdirs==4.2.2 portalocker==2.8.2 portpicker==1.6.0 prometheus_client==0.20.0 promise==2.3 prompt_toolkit==3.0.47 proto-plus==1.23.0 protobuf==3.20.3 psutil==5.9.8 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==10.0.1 pyarrow-hotfix==0.6 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycparser==2.22 pydantic==2.7.4 pydantic_core==2.18.4 pydot==1.4.2 pyfarmhash==0.3.2 Pygments==2.18.0 pyjsparser==2.7.1 pymongo==4.7.3 pyparsing==3.1.2 python-dateutil==2.9.0.post0 python-json-logger==2.0.7 python-snappy==0.7.2 pytz==2024.1 PyYAML==6.0.1 pyzmq==26.0.3 redis==5.0.6 referencing==0.35.1 regex==2024.5.15 requests==2.32.3 requests-oauthlib==2.0.0 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rouge_score==0.1.2 rpds-py==0.18.1 rsa==4.9 sacrebleu==2.4.2 scipy==1.12.0 seaborn==0.13.2 Send2Trash==1.8.3 shapely==2.0.4 simple_parsing==0.1.5 six==1.16.0 sniffio==1.3.1 soupsieve==2.5 sqlparse==0.5.0 stack-data==0.6.3 tabulate==0.9.0 tensorboard==2.15.2 tensorboard-data-server==0.7.2 tensorflow==2.15.1 tensorflow-data-validation==1.15.1 tensorflow-datasets==4.9.6 tensorflow-estimator==2.15.0 tensorflow-hub==0.15.0 tensorflow-io-gcs-filesystem==0.37.0 tensorflow-metadata==1.15.0 tensorflow-serving-api==2.15.1 tensorflow-transform==1.15.0 tensorflow_model_analysis==0.46.0 termcolor==2.4.0 terminado==0.18.1 tfx==1.15.1 tfx-bsl==1.15.1 timeloop==1.0.2 tinycss2==1.3.0 toml==0.10.2 tomli==2.0.1 tornado==6.4.1 tqdm==4.66.4 traitlets==5.14.3 types-python-dateutil==2.9.0.20240316 typing_extensions==4.12.2 tzlocal==5.2 uri-template==1.3.0 uritemplate==3.0.1 urllib3==2.2.2 wcwidth==0.2.13 webcolors==24.6.0 webencodings==0.5.1 websocket-client==1.8.0 Werkzeug==3.0.3 widgetsnbextension==4.0.11 wrapt==1.14.1 zipp==3.19.2 zstandard==0.22.0

Current Behavior

I have a simple pipeline consisting of only one component (CSVExampleGen) to ingest csv files and convert them to TFRecords. However, upon running the pipeline I get the following warning: WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be. I have already installed python-snappy and the corresponding C library using the following commands:

sudo apt-get install libsnappy-dev
pip install python-snappy

Expected behavior: The execution of this simple pipeline should be much faster and no such warnings should be produced.

Standalone code to reproduce the issue

Download any moderately sized csv file with numerical data and run the following code:

from tfx.proto import example_gen_pb2
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
from tfx.orchestration.pipeline import Pipeline
from tfx.orchestration.local.local_dag_runner import LocalDagRunner
from tfx.orchestration import metadata

pipeline_root = 'artifacts'
data_dir = 'data'

input_config = example_gen_pb2.Input(
    splits=[
        example_gen_pb2.Input.Split(name='data', pattern='data.csv')
    ]
)

output_config = example_gen_pb2.Output(
    split_config=example_gen_pb2.SplitConfig(
        splits=[
            example_gen_pb2.SplitConfig.Split(name='train', hash_buckets=8),
            example_gen_pb2.SplitConfig.Split(name='eval', hash_buckets=2)
        ]
    )
)

example_gen = CsvExampleGen(
    input_base=data_dir,
    input_config=input_config,
    output_config=output_config,
)

pipeline = Pipeline(
    pipeline_name='testing pipeline',
    pipeline_root=pipeline_root,
    components=[
        example_gen
    ],
    enable_cache=True,
    metadata_connection_config=metadata.sqlite_metadata_connection_config(
        os.path.join(pipeline_root, 'metadata.sqlite')
    )
)

LocalDagRunner().run(pipeline)

Jun 23 '24 10:06 sagnik-t

Hi, sorry for responding this issue so late.

The warning appears when it fails to import python snappy, as per https://github.com/apache/beam/blob/v2.56.0/sdks/python/apache_beam/io/tfrecordio.py#L48.

Could you please test running python3 -c 'import snappy' if it is properly imported? Thanks!

Jul 24 '24 03:07 lego0901

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

Dec 20 '24 02:12 github-actions[bot]

This issue was closed due to lack of activity after being marked stale for past 7 days.

Dec 28 '24 01:12 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

Dec 28 '24 01:12 github-actions[bot]