equalify Reports with Lots of Data Load Very Slowly

Problem

Displaying live data limits filtering available to users with large amounts of URLs, and results in very slow loading of reports with lots of data. Users also don't use the live data, since they only check on issues once in a while.

Solution

We are going to step through optimizations, trying things as we go.

First Optimization

[ ] Making a smaller dictionary of messages and tags, so that users are sharing data instead of having their own. That should speed up inserts and queries.

We'll bounty that one task @heythisischris for now, then check in after you do that. Please add a budget here.

Future Optimization Ideas

A "Generate Report" feature would queue report generation, then disable results of generated report when user accesses content via API or Dashboard.

Dashboard User Experience Changes

Add Timestamp of last generated and "Update" button to report page. Regenerate would be conditionally disabled with the value "Updating.." if a report is being regenerated.
When report is generating, add notice "Report is updating" in area for timestamp and disable "Update" Button. That notice will be replaced with "Updated: dd/mm/yy hh:mm CST" and the "Update" button will become active. This would be displayed after filters are updated, report is created for the first time, or "Generate" button is clicked.

Accessibility Notes

Use toast notification to announce when a report has been updated
Use toast notification to announce when a report has started updated
Use toast notification to announce if a report update has failed.

Architecture Notes

Generating reports should be processed in a queue based system, like BullMQ (we have a pro license), so we can fairly prioritize generation to clients. IE- If one client pushes two reports, then another pushes one report we should prioritize the other clients first report over the one client's second
We should build this to easily scale workers doing this work. IE- If we need to generate more reports in parallel we'll be able to do this just by spinning up more workers for the task. This helps us scale.

API Changes

/get/results?reportId=... returns generated report data
get/charts?reportId=... returns generated report data
update/reports triggers generation of report
add/reports triggers generation of report

Aug 26 '24 16:08 bbertucc

@heythisischris I added two notes for "Architect". We'll need to make sure we're thinking about scaling and about our clients getting a fair go at generating their reports. Please let me know your budget and additional development details! Currently slated to have a budget Monday so we can get this in the next sprint cycle.

Aug 31 '24 09:08 bbertucc

@bbertucc Got it! Will have a budget posted by Monday.

Aug 31 '24 17:08 heythisischris

@heythisischris FYI- feel free to respond to this with a solution that keeps the existing structure, while also eliminating the problem of reports taking too long to load.

I realize that having live data is a feature in itself. I don't think it's a feature that many people want, but if you have a quicker way of solving the underlining problem, there's no reason we shouldn't just do that. We can talk about whatever you come up with at today's contributor meeting.

Sep 02 '24 10:09 bbertucc

@heythisischris Kahn continuing on that last comment: for instance, we can move "expensive" filters that require lots of processing to the front end, running them as filters on JSON instead of database queries. For instance, "status" and "tag" filters can be done when a user is looking at a report page instead of when a users editing filters in the backend. We could also cache or save results from an API query, so we don't load all the results every time a user comes to the page, then load new queries in the background on some type of interval that is based on the time a user is on a page and the last time a user visited the page (like if a user is on a page we run the API result query in the background every five minutes and if a user visits a page and they haven't missed the page for the last five minutes, we also run the API results query).

So we're:

Rebuilding the API to only accept query filter parameters that are "easy" to load - ie: messages, URLs, properties (are those even easy to load)
Rebuilding the frontend to get API results in the background and to save data instead of stream data.
Rebuilding the frontend to trigger that background task every 5 min when a user is on a page and if a user visits a report page and hasn't visited it in 5 min
Rebuilding the dashboard to use the other parameters (tags, status, etc) to run on JSON instead of as additional API query parameters - this may also involve UX updates that brought these filters directly on the report page??

Would something like that create near real time results on huge amounts of data?

Trying to think of clever ways to save time and achieve our end goal of avoiding long loading times when loading reports. That's the key problem.

Sep 02 '24 10:09 bbertucc

Another possibility is to refactor the way we get results. Instead of using the results endpoint, scan return results. These results could be saved then the frontends job is just to filter and display the saved json. We would have to deal with URLs that were deleted. For those, I would just return blank results.

So that approach would mean we:

make sure the API deals with pages that are deleted or re-added. When deleted, return blank data. When re-added, test against previously equalified data and return all results.
rebuild the API to return results of a property from /scan. Currently I think we're returning results for each URL, but we should return each property right?
rebuild the dashboard to save the scan results data
rebuild the dashboard to use saved data to display saved json of select properties.

From a business angle, this approach is great because it truly puts the weight on scans as our main processing cost and so simplifies what we're charging customers (we only charge per scan).

Maybe that's the best approach?

Would it be quicker to build and solve our problem of loading huge datasets?

This is an important issue to solve so I'm throwing lots of ideas here to make sure we get it right.

Our work can also inform more common sense UX for future versions.

Sep 02 '24 11:09 bbertucc

@heythisischris @azdak came up with another option: database optimization. we could just optimize the database so we can get relatively realtime processing and filtering. but yeah. lot's of options! curious to hear what you think is the best solution to solve that issue of slow loading reports.

Sep 02 '24 16:09 bbertucc

On 9/5 call, we discussed:

Making a smaller dictionary of messages and tags, so that users are sharing data instead of having their own. That should speed up inserts and queries.

We'll bounty that one task @heythisischris for now, then check in after you do that. Please add a budget here.

Sep 05 '24 16:09 bbertucc

I think this might be related because I am using four filters, but y'all will know better.

I have a five page sitemap. And I am using four filters, "Properties (mine)", "Types: Error", "Tags: wcag2a", and "Tags: wcag2aa".

When I am on the Reports page and click "View Report" it will not load sometimes. Sometimes it takes 90 seconds. After it loads (assuming it is getting cached) it loads quickly.

Expected behavior would be it loading in under 5 seconds. If that is not possible it needs a progress bar. I just assumed it was down. It might have just been slow.

Is this related to filters?

Sep 05 '24 17:09 joelhsmith

Yeah @joelhsmith. We currently have to do lots of queries to display reports. Tags, like "WCAG2A", are particularly taxing.

Fixing this issue is a priority. @heythisischris is going to make a smaller dictionary of all the tags and messages to try to minimize the size of the query.

Sep 05 '24 17:09 bbertucc

Hey @bbertucc, here's a gameplan and budget:

Overview

We need to optimize our getResultsAll query to speed up loading reports- we can achieve this by removing redundant data from our database. It seems that messages and tags are good candidates because they are equally shared by all users.

This should reduce the size, speed up indexing, and generally improve performance.

Steps

[ ] Remove user_id column from messages and tags
[ ] Adjust INSERT queries inside of processScans script and remove user_id specificity
[ ] Adjust SELECT queries inside of getResultsAll script and remove user_id specificity
[ ] Look for additional optimizations (this could include disabling row level security, removing user_id enforcement via Hasura permissions, studying the cost of the SQL query, etc).

Budget

The requested bounty is $500 for roughly 8 hours of development spread across 4 days ($62.50/hour).

Timeline

I intend to have this completed/validated by Friday the 13th 🩻 (September 13th, 2024).

Sep 06 '24 05:09 heythisischris

@heythisischris Would implementing object caching via Memcached/Redis be viable options here? If the reports are more or less the same until a new one is generated, there is no reason to keep hammering the database. ElastiCache is a fully managed AWS solution for this.

Sep 06 '24 17:09 alexstine

@heythisischris Would implementing object caching via Memcached/Redis be viable options here? If the reports are more or less the same until a new one is generated, there is no reason to keep hammering the database. ElastiCache is a fully managed AWS solution for this.

@alexstine Definitely a viable option. I'm not sure if we want to go all-out and provision a managed ElastiCache instance (minimum $90/mo)... I'm thinking possibly about starting with Redis on EC2 ($10/mo), or even rolling up our own simple caching solution using DynamoDB and leverage time-to-live to purge expired rows. Will include this effort within the scope of the budget.

Strategy 1: Redis on EC2
- Provision a t4g.micro instance and install Redis.
- Expose Redis endpoint to equalify-api Lambda function & leverage for all result queries.
Strategy 2: Custom caching via DynamoDB
- Provision an equalify DynamoDB table with pk and sk as primary and sort keys.
- Inside of getResultsAll, attempt to find cached data with the following identifier:
  - pk: USER#abc, sk: RESULT#xyz, data: ["results":[]]
- If no result found, store new cached data with ttl set to 1 hour in the future.

Only big limitation with the DynamoDB strategy is storing results that are larger than 400kb (DynamoDB row limit size). There are simple strategies of chunking the data across several rows and leveraging the sort key like so: RESULT#xyz#PAGE#0, RESULT#xyz#PAGE#1, etc.

Sep 06 '24 18:09 heythisischris

@heythisischris If you take the EC2 approach, try not to use your T2/T3 instance families. They suffer on network bandwidth and for caching, that needs to be quick. Maybe go with one of the memory optimized instance families. Cost more but should still be cheaper than ElastiCache. Make sure all your connections happen over private VPC subnet for best performance.

Sep 06 '24 19:09 alexstine

@heythisischris approved.

Thanks @alexstine for expert additional advice. Feel free to open another ticket with budget to focus on any of the optimizations that @heythisischris isn't. All efforts to speed up reports are definitely worth a bounty.

Sep 06 '24 20:09 bbertucc

Contributor call notes: @heythisischris is making progress. Aiming to have it ready middle of this week.

Sep 09 '24 17:09 bbertucc

Contributor Call Update: @heythisischris is still working on this. Hoping to have it by our Thursday technical review meeting.

Sep 17 '24 16:09 bbertucc