datafuzz icon indicating copy to clipboard operation
datafuzz copied to clipboard

Problem adding noise to my dataset example

Open springcoil opened this issue 8 years ago • 1 comments

  • datafuzz version: 0.1.0a
  • Python version: 3.5
  • Operating System: Linux Ubuntu

Description

Describe what you were trying to get done. I was trying to add some noise to a dataset. Tell us what happened, what went wrong, and what you expected to happen. It didn't work and I got a bug, talking about Pandas dataframe objects

What I Did

from datafuzz.generators import DatasetGenerator
from datafuzz import DataSet, NoiseMaker, Duplicator

generator = DatasetGenerator({
    'output': 'pandas',
    'schema': {
        'market': ['lemon','orange','pineapple','banana','kiwi','papaya','passion fruit','guava'],
        'Channel': ['Organic/PPC Brand','PPC_Generic/email','Comparison_site','Comparison_site_preapproved', '3rd_Party', 'Rabbit'],
        'name': 'faker.name',
        'created date': range(2005, 2018),
        'city': 'faker.city',
        'Requested_Amount': range(1000, 25000, 1000)
    },
    'num_rows': 1000,
})

generator.generate()

dataset = generator.to_output()
noiser = NoiseMaker(
    dataset,
    noise=['add_nulls', 'random'],
    columns=['market', 'Channel', 'Requested_Amount'],
    percentage=30,
)
noiser.run_strategy()


print(dataset)

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/datafuzz/noise.py", line 58, in __init__
    self.columns = self.get_numeric_columns(self.columns)
  File "/usr/local/lib/python3.5/dist-packages/datafuzz/strategy.py", line 71, in get_numeric_columns
    if self.dataset.data_type == 'pandas' and any([isinstance(c, str)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 3614, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'data_type'

springcoil avatar Jan 17 '18 17:01 springcoil

Hi @springcoil ,

So excited you are working with datafuzz, sorry that the documentation is not yet complete, as I think this problem is easily solved and should be better documented somewhere!

Can you try making it a DataSet object, before you pass it to NoiseMaker?

So you can either change the output from the Generator to 'dataset' or run the following:

dataset_input = DataSet(dataset, output='pandas')

If you need it in pandas form after the noise process, you can run

dataset_input.to_output()

Let me know how it goes, and thanks again for giving it a try!

-kj

kjam avatar Jan 19 '18 08:01 kjam