Question Regarding Cleanup of Feeds
Hello all,
I was wondering about how to handle feeds being trimmed. For example, in the BaseFeed class, any call to add an item will call self.trim() randomly, which trims only the timeline storage of that feed. So, after being trimmed, my question is: how you handle a user scrolling their feed all the way in the past to the point of it being trimmed? It seems like we have lost activities for good at this point.
My idea was to have a SQL backup of the activities, and if we detect the user has one too far back in time, start loading activities from SQL. This seems quite onerous to implement, but it seems like there must be some solution to preserve the activities while still trimming the feeds for memory (using Redis BTW).
Is there anyone who has dealt with this?
long time ago we used to support something like this. we used a fallback storage (eg. pulling from a RDBS) and used special activities to mark feeds as trimmed (so that one would only hit the fallback storage when data was trimmed). We removed that from the library as it was very hard to make it work in a generic way. Let me know if you need help or even better, if you find have something we can merge in ;)
Thanks for your comments @tbarbugli ! I would love to hear more comments, especially from @tschellenbach or anyone who has implemented such a thing. For example, how does GetStream.io do it (if that is not giving away the secret sauce)?
Hi!
We don't use this approach for getstream.io, the storage space isn't much of a concern when using Cassandra.
We did it in the past with Redis though. The basic approach is: 1.) Create an end of feed market activity class. 2.) Overwrite how the data is loaded. Start by reading from Redis, if there are no more result (and there is no feed end marker) read from the database and insert that data back in Redis. Be sure to use primary key based filtering when accesing the database results. Limit/offset based pagination doesnt work well with large datasets and heavy queries.
Cheers, Thierry
@tschellenbach Thanks for the response.
When you say "We don't use this approach for getstream.io, the storage space isn't much of a concern when using Cassandra", is it because its easily scalable versus redis which can only use memory?
Yes, Cassandra makes it easy to store many TB of data. Redis is great, but only uses memory, which is expensive.
On Thu, Jul 9, 2015 at 2:46 PM, James [email protected] wrote:
@tschellenbach https://github.com/tschellenbach Thanks for the response.
When you say "We don't use this approach for getstream.io, the storage space isn't much of a concern when using Cassandra", is it because its easily scalable versus redis which can only use memory?
— Reply to this email directly or view it on GitHub https://github.com/tschellenbach/Stream-Framework/issues/137#issuecomment-120139848 .
@tschellenbach I also meet with such issue.
Cause of memory limitation, I am considering to trim the feed and global activities. Basic idea is:
- random trim aggregated feed with max_length. this is what stream framework already done.
- Create periodic celery task to go through global activity table and delete the ones who's create time is older than some time point.
- when accessing feed and refer to an deleted activity, remove this activity from feed.
- when user want to view more agg activities which are not in feed, try to retrieve them from database and add them into agg feed.
And moving to cassandra is also an option for me. Before that, I would like to go with above idea. So, do you have any suggestion or reminder for me at above steps and moving to canssandra? I have a little concern with canssandra's maturity.
Thanks
@microelec
The redis approach can work. Cassandra is indeed much harder to work with. (Eventually you'll likely end up switching though, at least in the scenario that your app keeps on growing)
Note that for step 4 you should implement some sort of feed end marker (see also the above answers). Otherwise you'll continuously hit the DB when someone hits the end of the feed.
Just for the record, Cassandra 1.2 and 2.0 branches are very mature, you can use that as storage for Stream-Framework without any issues.
On Mon, Aug 10, 2015 at 4:39 PM microelec [email protected] wrote:
@tschellenbach https://github.com/tschellenbach I also meet with such issue.
Cause of memory limitation, I am considering to trim the feed and global activities. Basic idea is:
- random trim aggregated feed with max_length. this is what stream framework already done.
- Create periodic celery task to go through global activity table and delete the ones who's create time is older than some time point.
- when accessing feed and refer to an deleted activity, remove this activity from feed.
- when user want to view more agg activities which are not in feed, try to retrieve them from database and add them into agg feed.
And moving to cassandra is also an option for me. Before that, I would like to go with above idea. So, do you have any suggestion or reminder for me at above steps and moving to canssandra? I have a little concern with canssandra's maturity.
Thanks
— Reply to this email directly or view it on GitHub https://github.com/tschellenbach/Stream-Framework/issues/137#issuecomment-129477440 .