pdpipe icon indicating copy to clipboard operation
pdpipe copied to clipboard

Append & Merge PdPipeline Stages for Joining Pandas DataFrames

Open Asrst opened this issue 4 years ago • 6 comments

Thanks for this library, it made my pandas workflow clean & easy to maintain.

Feature Request: pd.merge is a common operation when dealing with 2 or more data frames & pipeline stage for this is missing. Currently, I am implementing a custom pipeline stage pdp.Merge from PdPipelineStage & would like to contribute.

Asrst avatar Jun 05 '21 07:06 Asrst

Sure, I'd appreciate any contribution! :) Let me know if you need any help.

And thank you for the kind words. :)

shaypal5 avatar Jun 05 '21 10:06 shaypal5

Hi, can we close this issue.? I already raised a pull request with this features #48

Let me know, if you need any help in migrating to GitHub actions/releasing the changes.

Asrst avatar Sep 06 '21 10:09 Asrst

Hey @Asrst ,

Sorry for the belated response! :) I've migrated successfully to Github Actions.

I'd love it if you can rebase over the current head of the master branch, and then open a new pull request. Tests should then work properly, including linting and coverage reports by codecov.

Cheers, Shay

shaypal5 avatar Sep 13 '21 08:09 shaypal5

By the way, you can see here your code reduces test coverage below 100% (which we cannot have): https://github.com/pdpipe/pdpipe/pull/53

Please enrich tests to cover all cases you accounted for in your code.

I don't want to use that PR (#53, mine), as it has two merge commits. I rather you rebase over the master and open a PR with a single code commit, which I can then fast-forward to, to avoid a merge commit completely. :)

shaypal5 avatar Sep 13 '21 09:09 shaypal5

hi @shaypal5,

Improved the code coverage, rebased over the master & raised a new pull request. #55

Asrst avatar Sep 26 '21 08:09 Asrst

Hey @Asrst ,

I'm so sorry. I just got to reading the PR, and unfortunately the current implementation does not make a lot of sense. It seems like you wrote two stages that get a single dataframe when creating the pipeline, and then on application append or merge it to the input dataframe.

Can you explain the use case? Share your use of it?

I think stages using the pandas.DataFrame.append and pandas.DataFrame.merge operations must mainly cater to use cases where both dataframes are somehow inputted on application time, and not one of them being statically set when creating the pipeline.

If your use case is very specific, I would prefer you stick with a AdHocStage that implements the same functionality. And even if you show this to be a generalizable use case, stage names will have to be more specific, as this is a very specific use of the methods. Maybe AppendFixedFrame and MergedFixedFrame or something.

shaypal5 avatar Sep 29 '21 11:09 shaypal5