Bug BBQ: Call to `reset_index` in merging lesson not needed?
The merging lesson has the following code excerpt:
# read in first 10 lines of surveys table
survey_sub = surveys_df.head(10)
# grab the last 10 rows
survey_sub_last10 = surveys_df.tail(10)
#reset the index values to the second dataframe appends properly
survey_sub_last10=survey_sub_last10.reset_index(drop=True)
# drop=True option avoids adding new index column with old index values
It's not really explained why reset_index needs to be called for the second frame to append "properly". I tried following along without resetting the index, and it seemed to do the right thing. Of course there's a big gap between the index values in the first and second half of the concatenated df, but that seems fine to me (and even preferable to having duplicate index values, which happens if you call reset_index).
I'm inclined to remove that line, but it's possible I'm missing something.
This might be sensitive to Pandas version. When I run
survey_sub_last10 = surveys_df.tail(10)
I get:

which has the index set to the original DF indices.
But if I used
survey_sub_last10.reset_index(drop=True)
which results in:
which has an index that corresponds to the current DF.
To some extent, which of those things you would like is probably an aesthetic issue. But I think wanting a reset index that doesn't correspond to the original index is a common enough operation that demonstrating that function is warranted.
What do you think?
I see the same behaviour, but what I meant to say was that, in the context of concatenating the dataframes (vertically), resetting the index of the tail frame feels like it's worse than doing nothing, because it leads to duplicate index values (whereas, if we do nothing, we merely have to deal with non-contiguous index values).
But looking at the lesson again, I'm realizing that maybe the reset_index is for the sake of the horizontal stacking (not the vertical stacking, which is what I was thinking of).
I agree that reset_index does seem like a worthwhile thing to teach. But I think the right place to introduce it might be after doing the vertical stack. Which is already mentioned in the lesson:
Have a look at the vertical_stack dataframe? Notice anything unusual? The row indexes for the two data frames survey_sub and survey_sub_last10 have been repeated. We can reindex the new dataframe using the reset_index() method.
Going back to the horizontal stacking part, I have the following thoughts:
- Calling
reset_index()before the horizontal concat is not strictly necessary, in the sense that failing to do so won't raise an Exception. It will just result in a dataframe with more rows and a bunch of NaN values. I'm guessing that that's what the comment about making the dataframe "append properly" is talking about. But it's not clear why one result is more proper than the other, because it's never really stated what the horizontally stacked dataframe is meant to represent, or what problem we're trying to solve. - Expanding on that last part, I'm wondering if the horizontal stacking example should just be cut. The current example seems artificial, and I don't think it leaves the reader with an intuition about why they would ever want to horizontally stack dataframes. Also, my impression is that vertical stacking is a skill that learners are more likely to have a use for.
Sorry, that was a lot text. I hope that makes sense. It might be clearer if I just put together a small P.R. with what I had in mind.
I'm not particularly open to the idea of removing the horizontal stacking, since that's a really useful thing to be able to do. But if you'd like to make a pull request for the rest, have at it.
If that's the case, I'd love to see a note on what kinds of situations it can be useful for.
Or, better yet, for the example to be reworked so that it is one of those situations. i.e. the stacking is being used as a tool to answer some question about the dataset, or solve some problem.
All right, tagging this beginner-friendly if any of our new instructors would like to rework the text or the example.
- [ ] Edit explanation or example on reset_index to better show utility.
@wrightaprilm I'm interested to work on this.
will the issue be fixed by removing the code line for reset_index()
I would suggest as a small 'fix', to make new sub-dataframes for the horizontal stacking, but in this case based on selecting columns. Then you don't need the reindexing and it feels like a more natural use case (although still a bit artificial since we're selecting from the original dataframe).
So we would get:
surveys_sub1 = surveys_df[['month', 'day', 'year']]
surveys_sub2 = surveys_df[['plot_id', 'species_id']]
horizontal_stack = pd.concat([surveys_sub1, surveys_sub2], axis=1)