the-algorithm Provide verifiable proof that this algorithm is in production

Is your feature request related to a problem? Please describe. There is currently no way to tell if this repo is actually being used in production on Twitter. If the goal is actually to "open source the-algorithm", to provide transparency and allow the public to review it, then it is meaningless to post a bunch of code without proving that it is being used.

Describe the solution you'd like

Versioning of the algorithm complete with a cryptographic hash should be displayed somewhere on Twitter that is public and easy to find.
The cryptographic hash is generated for each release and can be verified by a member of the public that it is in production.
Verification does not need to be doable by any, non-technical person but instructions for verification should be provided, and it must be a credible and trustless process.
Alternatively but far less ideally, a trusted, independent third party verifies the version/SHA whatever that is claimed to be used and outlines how they did it.

Apr 01 '23 00:04 shannonwells

First issue opened on this repo that is actually worth something lol

Apr 01 '23 00:04 enlightened

Besides that, ideally we should be able to change the algorithm used in settings by specifying a link to a repository and a commit SHA. Might be very difficult to implement though. But we could then point specifically at this repository and compare if it gives simmilar recomendations.

Apr 01 '23 00:04 tomaszpieczykolan

real

Apr 01 '23 00:04 CozyMeli-gitv2

Besides that, ideally we should be able to change the algorithm used in settings by specifying a link to a repository and a commit SHA. Might be very difficult to implement though. But we could then point specifically at this repository and compare if it gives simmilar recomendations.

This is more likely impossible to do in a sustainable way, as deploying an entire set of microservices is already really costly by itself. Sure, Twitter may have (and to be honest, probably already have) multiple versions based on multiple commits on prod that they can make you use transparently, but it's not really intended to be user-controllable. You'd have to keep all the different versions of all the different publicly released commits online, which will definitely have a huge cost for Twitter, not to mention that it is pretty inefficient and does not scale as well as a globally used, "single" algorithm version - at least in the current state of things.

Also, Twitter keeping the algorithm's usage control is kind of a good thing, since it permits them to A/B test something completely transparently, which will not disturb the current process. Giving more choice to users sure is a good thing, but letting the engineers keep control of it will absolutely help getting things done the way they usually were.

If a feature like this was actually in production, the people that would use it would more likely be more technical and have a better general understanding of what happens behind the social network - which is especially true with an open sourced algorithm like this one. As you might have noticed, internal logging, analytics and experiments are a crucial part of R&D. Those technical users could cause some kind of bias in the results of those logs and analytics, since keeping track of such things across multiple, different versions of a single project will cause some trouble. Some commit might use a slightly tweaked ML model, some other commit might use different constants... You get the point. It's more likely that the "main" production algorithm would be the only one actually going to be used for analytics.

Going above the (admittedly) edge case that I just mentioned to a more general standpoint, giving users the ability to use a completely different algorithm repo could actually be dangerous. Some bad actor might use this capability to distribute biased algorithms, and with some kind of social engineering technique, get non technical users to switch to it. This raises a huge security concern, which is absolutely crucial to consider. Take the example of politics. We've already seen Facebook and the Cambridge Analytica scandal - it could be used in a similar manner, which would absolutely not be considerable.

With all that said, I do agree that it could be great to provide a technical yet practical way to use a different algorithm. Sure, it might be abused in some way or another, but that already happens - just not as easily as having direct source code access to the algorithm, which is also not complete by itself. This is raising many security and usability questions, but it could be interesting to actually think about it and provide solutions, one day or another.

I'm afraid it's going to be really complicated if it's ever in consideration though, both from these technical challenges we mentioned, and the security concerns that feature would bring. I just can't really think of a practical technical solution like that, but tried to share my thoughts about the suggestion you made anyway - apologies for the length and possible confusion, or even bad takes I can have, I'm a sleep-deprived human of all things lol.

Apr 01 '23 01:04 busybox11

Actually, yeah do this pls

Apr 01 '23 03:04 preland

I was going to open identical ticket. As I am not an expert in such questions I don't know if this is even possible. Here are my ideas:

Zero Knowledge Proof (ZKP) cryptography - is a type of cryptography which tries to address such issues: how can a person prove to anyone that she has some information without revealing any bit of it. So you have to build proving-verification setup for that particular information where you will have: Prover key, Verifier key, Prover and Verifier schemes. Prover key is something what you have to keep private. Verifier Key is publicly available. The actually process: - Proving scheme accepts two type of inputs: private and public. Prover has to supply it with the actual information (private), proving key (private), auxiliary data (public). The output of such process will be a cryptographic challenge which can be passed to verifier. - Verifying scheme accepts next data: cryptographic challenge from prover, verifier key, auxiliary data (public input from proving step). The output will be true - prover has the information, false - not. - Fortunately, proving scheme can be built in a way so it can produce some useful result besides cryptographic challenge. For example, you can build a hash algorithm (SHA) which produces digest and a challenge. So it means that you can verify that produced digest was created by this particular hash algorithm as it is a part of proving/verification scheme. - Unfortunately, for this type of application it is required to have more generic proving scheme. Some guys come up with a concept TinyRAM http://www.scipr-lab.org/doc/TinyRAM-spec-0.991.pdf. Shortly it is an emulation of RISC processor using ZKP. The general idea that you can supply it with an assembler code and it will provide the result with a challenge for verification. Here is a an implementation of such https://github.com/scipr-lab/libsnark/tree/2af440246fa2c3d0b1b0a425fb6abd8cc8b9c54d/libsnark/relations/ram_computations/rams/tinyram. Ideally for this particular issue I assume twitter apps has to be build by a compiler written on ZKP where you will have application binaries and a challenge for verification.

zkEVM is probably the most advanced ZKP application at the moment https://github.com/0xpolygonhermez. This virtual machine execute blockchain smart contracts.

The problem here is that proving process is very resource expensive. In other words, the more code your program has more time and memory it takes to execute it. In twitter case I have doubts it is feasible now. There are some academic researches trying to solve this problem by implementing chips tailored to run with ZKP (RISC processor which produce proofs of execution).

PS. Couple of years ago I was building such proving verification scheme for video transcoding. And from my personal experience I can say it is really f___ hard.

Apr 04 '23 14:04 tarassh

This would be largely solved by suggestions in #1342 though all these methods should really be combined for validation

In practice though, having enough companies running tests to validate their content will not be downranked funds and incentives the process of validation

Apr 06 '23 04:04 stealthpaladin

Correct me if I'm wrong but since they licensed the code under AGPL, they must continue opensourcing the algorithmn code if it's put into production. They could relicense and modify it before making it live, but that would require approval from every contributor, post release.

Simplest fix would be to merge a minimal comment code from random pull requests, making multiple contributors as a sign of promise? Or just wait till they start merging more front-end visible changes from community PRs.

Apr 12 '23 12:04 jaey-p

Correct me if I'm wrong but since they licensed the code under AGPL, they must continue opensourcing the algorithmn code if it's put into production. They could relicense and modify it before making it live, but that would require approval from every contributor, post release.

Simplest fix would be to merge a minimal comment code from random pull requests, making multiple contributors as a sign of promise? Or just wait till they start merging more front-end visible changes from community PRs.

No one would know.

Apr 15 '23 16:04 sz55net

As the owners of the copyright they do not require a license to use it, or may dual license. though once they use any contribution from GitHub you would be correct

On Sat, Apr 15, 2023, 11:52 AM homelyseven250 @.***> wrote:

Correct me if I'm wrong but since they licensed the code under AGPL, they must continue opensourcing the algorithmn code if it's put into production. They could relicense and modify it before making it live, but that would require approval from every contributor, post release.

Simplest fix would be to merge a minimal comment code from random pull requests, making multiple contributors as a sign of promise? Or just wait till they start merging more front-end visible changes from community PRs.

No one would know.

— Reply to this email directly, view it on GitHub https://github.com/twitter/the-algorithm/issues/511#issuecomment-1509891547, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ6AOLMHRU3V47UQURHPJDXBLG3XANCNFSM6AAAAAAWPH6VSA . You are receiving this because you commented.Message ID: @.***>

Apr 17 '23 03:04 stealthpaladin

No one would know.

Yea no, I feel you. GPL violations run rampant. This wont be an issue, if the merges are verifiable/visible on the frontend. But as @stealthpaladin mentioned, since they aren't really merging a lot of pull requests yet, it might very well be a case of dual licensing.

Apr 22 '23 02:04 jaey-p

Although, I just read through the CLA .. All PRs have to surrender their copyright, so yeah, AGPL would only apply to people who would want to fork it for whatever reason. Twitter would retain the rights to change it in production.

This would be largely solved by suggestions in #1342 though all these methods should really be combined for validation

In practice though, having enough companies running tests to validate their content will not be downranked funds and incentives the process of validation

This probably feels like the way to go.

Apr 22 '23 03:04 jaey-p

These are great ideas, but honestly Twitter isn't even close to needing them. As of this writing, the last commit was on May 22 and that was documentation only. The activity in this repo shows that this is not the code that is being used for deploys. They just copy-pasted a bunch of code into this repo. They have not done the work to make their systems modular enough to deploy the product with this open-source algorithm as a dependency. This is not easy, to be sure! However, if I saw all the commits in near real-time (maybe a daily sync job if Twitter doesn't want to add a GitHub dependency) it would be like a costly proof of work that the code in this repo at least matters. Of course, there might be other aspects of the algorithm that are not open-sourced, and I wouldn't know but that is a HARD problem, compared to actually deploying the code in here.

Jun 28 '23 15:06 wisefool769