fix: Extra column in csv(or non-raw csv) get while sending on email
SUMMARY
While sending csv on Email getting an extra column(non-raw csv get) which has a count from 0 to n-row . Which I should not want.( And I think no one wants in CSV because we get numbering while opening it on platforms )
I have fix this thing and added CSV_INDEX=False //In superset_config.py Which make to not create an index If You want index in csv make it True.
For More detail on this fix Refer this issue https://github.com/apache/superset/issues/22981
TESTING INSTRUCTIONS
- Simply Checked By sending chart as csv on email extra index-column is not coming
ADDITIONAL INFORMATION
- [x] Has associated issue: Fixes 22981
- [ ] Required feature flags:
- [ ] Changes UI
- [ ] Includes DB Migration (follow approval process in SIP-59)
- [ ] Migration is atomic, supports rollback & is backwards-compatible
- [ ] Confirm DB migration upgrade and downgrade tested
- [ ] Runtime estimates and downtime expectations provided
- [ ] Introduces new feature or API
- [ ] Removes existing feature or API
Codecov Report
Attention: 4 lines in your changes are missing coverage. Please review.
Comparison is base (
7a270a5) 67.20% compared to head (14a7b1b) 69.44%. Report is 1 commits behind head on master.
| Files | Patch % | Lines |
|---|---|---|
| superset/commands/report/execute.py | 55.55% | 4 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #23155 +/- ##
==========================================
+ Coverage 67.20% 69.44% +2.24%
==========================================
Files 1899 1899
Lines 74347 74357 +10
Branches 8263 8263
==========================================
+ Hits 49965 51639 +1674
+ Misses 22333 20669 -1664
Partials 2049 2049
| Flag | Coverage Δ | |
|---|---|---|
| mysql | ? |
|
| postgres | 78.09% <60.00%> (-0.02%) |
:arrow_down: |
| presto | 53.74% <20.00%> (?) |
|
| python | 82.89% <60.00%> (+4.64%) |
:arrow_up: |
| sqlite | 77.68% <60.00%> (-0.01%) |
:arrow_down: |
| unit | 56.41% <10.00%> (?) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@betodealmeida I have updated the PR with suggested changes, could you please review it again?
@betodealmeida I have updated the PR with suggested changes, could you please review it again?
Looks like some of the tests are failing, can you take a look at them?
any updates on this?
@mike-fischer1 Actually, I think this fix would not work, as post-processing logic is used in other flows and is breaking things like data-import for maintaining the sequence/order of data through appending indexes.
There are other ways we can try:
-
We can disable the post_processing while emailing the data, but then the feature of sending ordered data, as we saw in Explore, will break. Link to the API
-
Alternatively, we could drop the index after the data is retrieved from the API in the Email-Flow by checking that we have to return CSV data."
reports/command/execute.py
def _get_csv_data(self) -> bytes:
url = self._get_url(result_format=ChartDataResultFormat.CSV)
auth_cookies = machine_auth_provider_factory.instance.get_auth_cookies(
self._get_user()
)
if self._report_schedule.chart.query_context is None:
logger.warning("No query context found, taking a screenshot to generate it")
self._update_query_context()
try:
logger.info("Getting chart from %s", url)
csv_data = get_chart_csv_data(url, auth_cookies)
**###Here we apply check for CSV_INDEX enable disable and if disable then drop the csv index.**
except SoftTimeLimitExceeded as ex:
raise ReportScheduleCsvTimeout() from ex
except Exception as ex:
raise ReportScheduleCsvFailedError(
f"Failed generating csv {str(ex)}"
) from ex
if not csv_data:
raise ReportScheduleCsvFailedError()
return csv_data
@betodealmeida I have updated my changes as previously the test-cases were failing, could you re-run the workflow tests on my latest changes
@betodealmeida could you please re-run the checks.
Hello everyone :) Any updates on this issue? We are using Version: 3.0.1 and getting index column in csv file when sending email report. Maybe someone know if there another fix? Or should we finish this PR?
Or should we finish this PR?
I'm for it! @rohitpawar2811 are you able to follow up on this?
Enabling CI checks.... 🤞
@betodealmeida @Antonio-RiveroMartnez looks like CI passed 🎉 Any bandwidth to revisit the requested changes?
Hey folks, we have an account manager that has raised this as a concern. Do you think that there will be any updates on this going forward?
I think @betodealmeida's change request is blocking it. It looks like CI will need to run again, but there seem to be a few unanswered questions on the thread. @SkinnyPigeon feel free to give it a code review if you'd like, that always helps. Mabye @Antonio-RiveroMartnez or @eschutho know where this stands in someone's/anyone's queue otherwise.
Hey @rusackas, @rohitpawar2811, playing about with the API a bit, it seems that there is an Enum called POST_PROCESSED that is what is causing the additional index to be added. This is set here in the report execution for the CSV datatype. If this is left blank, at least in the API, it returns the same data minus the index.
We could add the config check to this point instead. WDYT?
Hey @rusackas, @rohitpawar2811, playing about with the API a bit, it seems that there is an Enum called POST_PROCESSED that is what is causing the additional index to be added. This is set here in the report execution for the CSV datatype. If this is left blank, at least in the API, it returns the same data minus the index.
We could add the config check to this point instead. WDYT?
@SkinnyPigeon I think this should work, but we need to check the test cases. Actually, in the past, I made changes inside to the post-processing logic, but I think your suggested check here makes more sense.
Disturbing Part :
- Making check on uri will remove csv index as well dataframe index.
Should we remove the DataFrame (text-email) index as well? What do you think?
Hey @rusackas, @rohitpawar2811, playing about with the API a bit, it seems that there is an Enum called POST_PROCESSED that is what is causing the additional index to be added. This is set here in the report execution for the CSV datatype. If this is left blank, at least in the API, it returns the same data minus the index. We could add the config check to this point instead. WDYT?
@SkinnyPigeon I think this should work, but we need to check the test cases. Actually, in the past, I made changes inside to the post-processing logic, but I think your suggested check here makes more sense.
Disturbing Part :
- Making check on uri will remove csv index as well dataframe index.
Should we remove the DataFrame (text-email) index as well? What do you think?
With regards to where to make the change, I would suggest here. You could leave that line as is but then have a check for the config setting underneath and pass the URL to a function like this if we want the index removed:
def remove_post_processed(url: str) -> str:
"""Remove the type=post_processed parameter from the URL query string.
Args:
url (str): The URL to process.
Returns:
str: The URL with the type=post_processed parameter removed."""
if "?" not in url:
return url
base_url, query_string = url.split("?", 1)
params = query_string.split("&")
filtered_params = [param for param in params if param != "type=post_processed"]
filtered_query_string = "&".join(filtered_params)
filtered_url = f"{base_url}?{filtered_query_string}"
return filtered_url
This gives us something nice and easy to write some tests for:
def test_remove_post_processed():
url = "https://superset.com/?param1=value1&type=post_processed¶m2=value2"
expected = "https://superset.com/?param1=value1¶m2=value2"
assert remove_post_processed(url) == expected
def test_retain_other_parameters():
url = "https://superset.com/?param1=value1¶m2=value2"
expected = "https://superset.com/?param1=value1¶m2=value2"
assert remove_post_processed(url) == expected
def test_no_post_processed_present():
url = "https://superset.com/?param1=value1¶m2=value2"
expected = "https://superset.com/?param1=value1¶m2=value2"
assert remove_post_processed(url) == expected
def test_empty_query_string():
url = "https://superset.com/?"
expected = "https://superset.com/?"
assert remove_post_processed(url) == expected
def test_no_query_string():
url = "https://superset.com"
expected = "https://superset.com"
assert remove_post_processed(url) == expected
I'd leave the design decisions to the the code owners in terms of whether to remove it from the DataFrame as well, but I'd say the CSV is probably a good place to start for now