Figure out a way to host private content
Much of the content people will host is their public content, just in a more convenient format.
But some of the content will be private (e.g. unpublished notes or drafts). People will want to have that available, but only to a small list of people.
Figure out some way to protect that content.
Just thinking out loud about how this might work in a way that is simple and easy to get started... but might smoothly morph into something more complex in the future if people need it.
Each chunk in a library can have an access_tag. A missing or empty access_tag means it's public. Typically people have one private access tag, called private, but theoretically they could have multiple. Everything around access_tags defaults to using the tag private unless another one is provided.
When a library is opened, it can have an access_tag provided in the constructor, and that automatically adds that access_tag to all chunks inside. Those access tags will flow with that chunk if they are merged into a larger library. As a convenience if the library filename a library is opened with includes access/FOO in the filename path then it will add access_tag=foo unless access_tag is explicitly passed to the constructor as ''.
There is a access.SECRET.json in the root directory, with a .gitignore that ignores any files matching *SECRET*. It is structured like:
{
//This defaults to "private" if not explicitly set.
"default_private_access_tag": "private"
"tokens": {
//user_vanity_id can be any user-understable name, like "Alex" or "[email protected]"
<user_vanity_id>: {
"token": <access_token>
//The access_tags this token is allowed to access in this library. If it is omitted it defaults to `["private"]`
"access_tags": ["private"]
}
}
}
There is a convenience script to add items to this file. A command of python3 -m access.main add <user_vanity_id> <access_tag_to_add_1> <access_tag_to_add_2> ... . You can omit the <access_tag_to_add> and it will default to <default_private_access_tag>. When you run the command it will add a crypotgraphically secure random string for a token for htat user. It will also print the string to the console for transmitting to the user, and remind the owner of the library to redeploy for the changes to take effect. (There's also a script to revoke tokens for a user) That file is just uploaded as part of the app engine app, and when the host boots it reads access.SECRET.json into memory to figure out which access_tokens allow access to libraries marked up with which access_tags. In the future maybe that access list is fetched from GFS or something so it doesn't have to be redeployed?
The library owner they provide that token to the user some way. The user adds it in their client_config.SECRET.json affiliated with their server endpoint. Now their client will be automatically send it to that server endpoint as the access_token parameter.
The library.query() includes an access_token parameter. It's looked up in the access information and see which access_token it grants access to. It filters out any ones that it doesn't have access to.
There should also be a way for people querying hte library endpoint to explictly filter down to only content that includes a given access_tag, or that excludes all but a given access_tag.
Just thinking out loud, this is probably overly complicated (we could probably get away with private just being a bool if we wanted for now.
The purpose of the library hosts is to offer up chunks to be remixed by the completion AI, but not to be directly scraped and shown to users. We don't want them to be used as a convenient way to slurp up people's content for some other purpose, especially for content that is access restricted.
Ultimately there's no good way to fully do that (that I've come up with after thinking about it for a little bit) without having the remixing client be hosted somewhere trusted--but even then a query prompt injection could extract context.
So we can't make it impossible, but we can make it so that anyone who tries to just scrape private chunks would have to obviously know they were doing something unsupported by fighting the system, so no one can credibly claim "I had no idea I wasn't allowed."
I don't know how you'd do that either. Maybe something like having the code that accepts a library from a server and then does the remixing hashes its own python file it booted from and checks that it's a known hash before proceeding? Ultimately we're just trying to make it sufficiently hard to defeat that you have to obviously know you're doing something against the wishes of the content authors, but fundamentally it's not possible to guarantee anything in a federated system that has to see the cleartext.
This idea I'm really just exploring in a somewhat delirious state, this might be an actively terrible idea or unworkable or any of the above. :-D
Please make it go and then we'll mess around it and see how this particular bicycle rides!
- [x] Library constructor takes access_tag property.
- [x] Access_tags are stripped out of the JSON before being Serialized unless include_access_tag is True.
- [x] Library() constructor has sugar for a filename that includes
access/FOO/ - [x] Library handles access_tag=True --> access_tag = DEFAULT_PRIVATE_ACCESS_TAG
- [x] Create an access.SECRET.json (gitignore), and document format
- [x] Load access.SECRET.json if it exists
- [x] Allow specifiying an access file different than
access.SECRET.json - [x] Library.query() filters out any items that have an access_tag in them.
- [x] Verify that access_token really works and filters out items that don't have access_tag but allows access to items that do
- [x] Medium importer can be used to generate a draft output with
unpublishedaccess tag. - [x] Library.query() accepts an access_token and keeps any items that have it.
- [x] Client should be able to be passed a different config file
- [x] Have a client.SECRET.json file that can store access tokens
- [x] Client sends the access_token if it exists.
- [x] Create a
access.host addscript - [x] Create an
access.host revokescript - [x] Add a way to pass multiple
--access-tagto grant - [x] Move the default file to
host.SECRET.json - [x] Move the tool to
config.host access.grant - [x] Make the tool have sub commands like
config.host access grant - [ ] Wire through
config_filethrough every library constructor - [x] Add schemas for library file, client.json, host.json (new issue)
- [ ] Add
--mode {dev,prod*}which sets the default filename tohost.dev.SECRET.json(prod omits any mode tag in filename) - [ ] Consider having the
host.SECRET.jsonbe passed in via a Config object that has getters, etc - [ ]
access.hostshould be able to pass a different config file - [ ] Access token should be passed as a
Authentication: Bearer {token}(update the debug query form in the GET) - [ ] Add host.SECRET.json:restricted.message to the GET for the endpoint too
- [x] the README should document the best practice of private content without all of hte overhead about access tags
- [x] There should be some way for a server to communicate that it has private content. E.g. an opt into
details.counts.private_chunksand aaccess_messagethat is rendered in the response and in the GET endpoint - [x] Truncate items in personal production items
- [ ] Allow access tokens to be skipped in development (easiest way is probably turning off Library constructors auto-setting access_tags.
- [ ] Allow a way to specify that a given token in
host.SECRET.jsonprovides access to all tags. But make sure it has to be explcitly set, so things fail closed. - [ ] Allow a
endpointin host.SECRET.json` that configures where the production endpoint is. - [x] Add a
set endpointcommand that sets the endpoint parameter - [x] Add a
set restricted.messagecommand - [x] Add a
set restricted.countcommand - [ ] Validate that
set endpointdoesn't include a trailing\ - [ ] Add a command to give an example or documentation for each property
- [x] The output of
access grantshould also print to output f"{endpoint}/?SECRET=sk-key-123" if endpoint is set (andCall set endpoint ...to set the endpoint to print out an easy access key - [ ] (Read through the description above to see what else I'm missing in these TODOs)
f95cc4896b276a4219c36d0a4639643397ee4e0a, 68e13143fe468cb1c7666e80f0cc4b2e420c9e1a, d77a4ccb6d00fcea3bdc4245fa05fc9fae0c77ad, 98108cc8d16b8e3627ba23b31f7ab04424deb865 were erroneously marked as being part of #32 but are actually part of #26.