superset fix: Extra column in csv(or non-raw csv) get while sending on email

SUMMARY

While sending csv on Email getting an extra column(non-raw csv get) which has a count from 0 to n-row . Which I should not want.( And I think no one wants in CSV because we get numbering while opening it on platforms )

I have fix this thing and added CSV_INDEX=False //In superset_config.py Which make to not create an index If You want index in csv make it True.

For More detail on this fix Refer this issue https://github.com/apache/superset/issues/22981

TESTING INSTRUCTIONS

Simply Checked By sending chart as csv on email extra index-column is not coming

ADDITIONAL INFORMATION

[x] Has associated issue: Fixes 22981
[ ] Required feature flags:
[ ] Changes UI
[ ] Includes DB Migration (follow approval process in SIP-59)
- [ ] Migration is atomic, supports rollback & is backwards-compatible
- [ ] Confirm DB migration upgrade and downgrade tested
- [ ] Runtime estimates and downtime expectations provided
[ ] Introduces new feature or API
[ ] Removes existing feature or API

Feb 22 '23 12:02 rohitpawar2811

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (7a270a5) 67.20% compared to head (14a7b1b) 69.44%. Report is 1 commits behind head on master.

Files	Patch %	Lines
superset/commands/report/execute.py	55.55%	4 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #23155      +/-   ##
==========================================
+ Coverage   67.20%   69.44%   +2.24%     
==========================================
  Files        1899     1899              
  Lines       74347    74357      +10     
  Branches     8263     8263              
==========================================
+ Hits        49965    51639    +1674     
+ Misses      22333    20669    -1664     
  Partials     2049     2049

Flag	Coverage Δ
mysql	`?`
postgres	`78.09% <60.00%> (-0.02%)`	:arrow_down:
presto	`53.74% <20.00%> (?)`
python	`82.89% <60.00%> (+4.64%)`	:arrow_up:
sqlite	`77.68% <60.00%> (-0.01%)`	:arrow_down:
unit	`56.41% <10.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 07 '23 18:03 codecov[bot]

@betodealmeida I have updated the PR with suggested changes, could you please review it again?

Mar 31 '23 05:03 rohitpawar2811

@betodealmeida I have updated the PR with suggested changes, could you please review it again?

Looks like some of the tests are failing, can you take a look at them?

Mar 31 '23 20:03 betodealmeida

any updates on this?

Sep 22 '23 19:09 mike-fischer1

@mike-fischer1 Actually, I think this fix would not work, as post-processing logic is used in other flows and is breaking things like data-import for maintaining the sequence/order of data through appending indexes.

There are other ways we can try:

We can disable the post_processing while emailing the data, but then the feature of sending ordered data, as we saw in Explore, will break. Link to the API
Alternatively, we could drop the index after the data is retrieved from the API in the Email-Flow by checking that we have to return CSV data."

reports/command/execute.py

def _get_csv_data(self) -> bytes:
    url = self._get_url(result_format=ChartDataResultFormat.CSV)
    auth_cookies = machine_auth_provider_factory.instance.get_auth_cookies(
        self._get_user()
    )

    if self._report_schedule.chart.query_context is None:
        logger.warning("No query context found, taking a screenshot to generate it")
        self._update_query_context()

    try:
        logger.info("Getting chart from %s", url)
        csv_data = get_chart_csv_data(url, auth_cookies)
        **###Here we apply check for CSV_INDEX enable disable and if disable then drop the csv index.**
    except SoftTimeLimitExceeded as ex:
        raise ReportScheduleCsvTimeout() from ex
    except Exception as ex:
        raise ReportScheduleCsvFailedError(
            f"Failed generating csv {str(ex)}"
        ) from ex
    if not csv_data:
        raise ReportScheduleCsvFailedError()
    return csv_data

Sep 25 '23 06:09 rohitpawar2811

@betodealmeida I have updated my changes as previously the test-cases were failing, could you re-run the workflow tests on my latest changes

Oct 02 '23 11:10 rohitpawar2811

@betodealmeida could you please re-run the checks.

Oct 12 '23 05:10 rohitpawar2811

Hello everyone :) Any updates on this issue? We are using Version: 3.0.1 and getting index column in csv file when sending email report. Maybe someone know if there another fix? Or should we finish this PR?

Dec 05 '23 09:12 romran

Or should we finish this PR?

I'm for it! @rohitpawar2811 are you able to follow up on this?

Feb 01 '24 20:02 rusackas

Enabling CI checks.... 🤞

Feb 05 '24 17:02 rusackas

@betodealmeida @Antonio-RiveroMartnez looks like CI passed 🎉 Any bandwidth to revisit the requested changes?

Feb 09 '24 20:02 rusackas

Hey folks, we have an account manager that has raised this as a concern. Do you think that there will be any updates on this going forward?

Jun 07 '24 12:06 SkinnyPigeon

I think @betodealmeida's change request is blocking it. It looks like CI will need to run again, but there seem to be a few unanswered questions on the thread. @SkinnyPigeon feel free to give it a code review if you'd like, that always helps. Mabye @Antonio-RiveroMartnez or @eschutho know where this stands in someone's/anyone's queue otherwise.

Jun 07 '24 19:06 rusackas

Hey @rusackas, @rohitpawar2811, playing about with the API a bit, it seems that there is an Enum called POST_PROCESSED that is what is causing the additional index to be added. This is set here in the report execution for the CSV datatype. If this is left blank, at least in the API, it returns the same data minus the index.

We could add the config check to this point instead. WDYT?

Jun 10 '24 10:06 SkinnyPigeon

Hey @rusackas, @rohitpawar2811, playing about with the API a bit, it seems that there is an Enum called POST_PROCESSED that is what is causing the additional index to be added. This is set here in the report execution for the CSV datatype. If this is left blank, at least in the API, it returns the same data minus the index.

We could add the config check to this point instead. WDYT?

@SkinnyPigeon I think this should work, but we need to check the test cases. Actually, in the past, I made changes inside to the post-processing logic, but I think your suggested check here makes more sense.

Disturbing Part :

Making check on uri will remove csv index as well dataframe index.

Should we remove the DataFrame (text-email) index as well? What do you think?

Jun 10 '24 15:06 rohitpawar2811

Hey @rusackas, @rohitpawar2811, playing about with the API a bit, it seems that there is an Enum called POST_PROCESSED that is what is causing the additional index to be added. This is set here in the report execution for the CSV datatype. If this is left blank, at least in the API, it returns the same data minus the index. We could add the config check to this point instead. WDYT?

@SkinnyPigeon I think this should work, but we need to check the test cases. Actually, in the past, I made changes inside to the post-processing logic, but I think your suggested check here makes more sense.

Disturbing Part :

Making check on uri will remove csv index as well dataframe index.

Should we remove the DataFrame (text-email) index as well? What do you think?

With regards to where to make the change, I would suggest here. You could leave that line as is but then have a check for the config setting underneath and pass the URL to a function like this if we want the index removed:

def remove_post_processed(url: str) -> str:
    """Remove the type=post_processed parameter from the URL query string.
    
    Args:
        url (str): The URL to process.
    Returns:
        str: The URL with the type=post_processed parameter removed."""
    if "?" not in url:
        return url
    base_url, query_string = url.split("?", 1)
    params = query_string.split("&")
    filtered_params = [param for param in params if param != "type=post_processed"]
    filtered_query_string = "&".join(filtered_params)
    filtered_url = f"{base_url}?{filtered_query_string}"
    return filtered_url

This gives us something nice and easy to write some tests for:

def test_remove_post_processed():
    url = "https://superset.com/?param1=value1&type=post_processed&param2=value2"
    expected = "https://superset.com/?param1=value1&param2=value2"
    assert remove_post_processed(url) == expected

def test_retain_other_parameters():
    url = "https://superset.com/?param1=value1&param2=value2"
    expected = "https://superset.com/?param1=value1&param2=value2"
    assert remove_post_processed(url) == expected

def test_no_post_processed_present():
    url = "https://superset.com/?param1=value1&param2=value2"
    expected = "https://superset.com/?param1=value1&param2=value2"
    assert remove_post_processed(url) == expected

def test_empty_query_string():
    url = "https://superset.com/?"
    expected = "https://superset.com/?"
    assert remove_post_processed(url) == expected

def test_no_query_string():
    url = "https://superset.com"
    expected = "https://superset.com"
    assert remove_post_processed(url) == expected

I'd leave the design decisions to the the code owners in terms of whether to remove it from the DataFrame as well, but I'd say the CSV is probably a good place to start for now

Jun 11 '24 07:06 SkinnyPigeon