Python code to Generate Report on Validation and Credibility of Datasets
Description about Issue
As users downloads dataset for their project, We try to give more understanding and clear overview about the datasets they are downloading in a Report format thus the user be feed with idea about how to use dataset for their own project in effective way.
Expected Behavior
we expect,
- More Statistical Analysis about Datasets
- How it's values are present and their Distributions over plot
- Check for corruption and Mismatch of data
- Suggestions to which kind of project the dataset will suit
- Suggestions on preprocessing of datasets for effective usage in project.
Expect to generate report with respect to it's format like CSV,JSON,txt etc...
Current Behavior
In Validation folder in Main.py we implement some of the previously mentioned, you can also view Report.txt for sample report we generated.
Contributions
You can Implement features one by one and then make a pull request to us. Expect your Valuable Contributions and collaborations
Ok from your issue description I understood that you want
-
the code line which would give statistical information to the user regarding all the features of the dataset like mean, count, etc.
-
Visualization of the features based on the target variable on a plot.
-
Any kinds of missing values, format issues basically feature engineering to improve the dataset.
-
On the basis of the features and the target, judging the projects for which the dataset would be useful.
So if I get your intentions right, can you please assign this issue to me:)
Thank you for your Volunteer @Ayushlion8 , You can try out with any single features at start
Ok @Gladwin001 you mean to say I have to do all sorts of feature engineering and data preprocessing on one independent feature
So from a dataset I'll choose one feature and write LOC for that and then add that file into one folder or directly create a PR for that..
Thank you for your Volunteer @Ayushlion8 , You can try out with any single features at start
I would suggest breaking this issue into small issues so it can be handled by 2 or 3 contributors.
I also interested in contributing to this issue.
@VigneshRamanathan101 and @Ayushlion8 you can break this issue into smaller issues and proceed
@Ayushlion8 @VigneshRamanathan101 started on this before the issue was originally created. Feel free to work off what I've already done: https://github.com/neokd/DataStorehouse/pull/105
@Ayushlion8 @VigneshRamanathan101 any updates on the issue?
@neokd modifications are going on, will update you soon with the PR. Thanks for your patience :)
@Ayushlion8 Yeah sure