spec icon indicating copy to clipboard operation
spec copied to clipboard

Discussion: Profile discoverability

Open wesbiggs opened this issue 2 years ago • 9 comments

Should Profiles be discoverable without the aid of a content indexer? A Profile link type could be specified as User Data that references a URL and hash. In effect, we would be moving Profile announcements from offline batches to User Data.

Pros:

  • Better alignment with usage patterns expected in social media.
  • Ability for applications to navigate to profile without the aid of other services.

Cons:

  • Higher storage cost for consensus system.

Rejected design option:

  • Putting user-generated "speech" content directly in a consensus system is an anti-goal of DSNP for legal risk reasons as well as technical storage requirement concerns.

Adjacent concerns:

  • The ability to derive a current URL for a Profile could be used by applications to activate Mention tags without outside knowledge; cf. #266
  • Mastodon uses Person instead of Profile, though most of the fields have overlapping semantics.

wesbiggs avatar Dec 14 '23 15:12 wesbiggs

There is a slightly to much greater legal risk by separating out Profile links instead of requiring them to be part of batched announcements, simply because of there being far more URLs to be stored on the consensus system. It would increase opportunities for corrupted or corrupting (i.e. after the fact) URLs. Content indexers would ignore these URLs and the system would potentially lose many "pairs of eyes" on potentially problematic content.

shannonwells avatar Mar 04 '24 23:03 shannonwells

Open URLs in consensus system storage I agree is a risk. Let's consider using CIDs only (a la batch publications).

Latency in finding profile documents on an external distributed file store (e.g. IPFS) could be mitigated by allowing providers to give hints in the form of URL templates that identify their preferred gateways. (For discussion.)

wesbiggs avatar Mar 08 '24 13:03 wesbiggs

An expansion on some comments I made at the Community Call last week.

Layers

There are several layers here, and I want to lay them out again for clarity around this discussion.

  1. DSNP Id (aka the user account identifier)
  2. Discovery of the existence of a user's profile
  3. Discovery of the user's current profile document location
  4. Retrieval of the user's current profile document
  5. Validation of the user's current profile document
  6. Parsing of the user's current profile document
  7. Attributes on the user's current profile document
    • Public Existence, Public Content (currently the only type)
    • Public Existence, Content Protected (There could also be existence protected, but they wouldn't be in the profile document)
    • Additional/Conditional: Attributes that only show under some circumstances or secondary profile that overrides the "default" in some situations or such.
  8. Retrieval of references from user's current profile document
    • Avatar, etc...

Layers and this Discussion

The only layers being discussed here are really 2-3:

    1. Discovery of the existence of a user's profile
    • Current: Discovered by exhaustive search of all published Profiles
    • Proposed by @wesbiggs: Discovered by direct User Data query by the consensus system
    1. Discovery of the user's current profile document location
    • Current: Most recent profile that returns
    • Proposed by @wesbiggs: Resolved URL or none

Constraints

As discussed above, the primary reason for not choosing direct User Data in the initial design is the cost of storage and churn of the additional data.

  • Assuming an IPFS data, it is approximately 40 bytes per user. So 100 million users that's about 4 Gb. (Likely more as the overhead isn't exact)
  • 100 million users with assuming only 10% new profiles each year and a 50% churn. That's 60 million transactions or ~2 per second. In the case of a blockchain with a block time is 6 seconds, that's 12 transactions per block just used by profile changes.

The current system is better assuming that some of that churn is able to be batched, but does have a longer term growth issue.

wilwade avatar Mar 13 '24 17:03 wilwade

To capture a side discussion: the number of updates required for profiles to be linked from User Data is estimated to be far fewer than the number of social graph updates. In that light, this proposal doesn't change the order of magnitude of consensus system changes significantly.

wesbiggs avatar Apr 04 '24 16:04 wesbiggs

Notes from DSNP Spec Community Call 2024-05-02

  • Trend to move things into User Data from Announcement Data. We need to profile the data usage on the implementation side
    • Various options for implementation, but it makes sense from the perspective of keeping the spec separate.
  • What about private profile data
  • What about alternate profile expressions such as into a group
  • Loss of the ability to know recently changed profiles
    • CID changes at least tell you about a change given you have a cached version

wilwade avatar May 02 '24 17:05 wilwade

I'll let those closer to implementations look at optimal storage solutions, but compared to graph storage and updates I don't think profile CIDs would add significant requirements.

There's a separate discussion to be had about the different types of profile data as noted in the spec call comment. I think the goal at the moment would be to cover the existing (public Activity Content) profile definition, but provide a structure to enable future profile-linked formats via the file type enumeration.

The notion of context-based profiles is a deep one and gets into the notion of identity expression via personas. DSNP currently lacks a two-level structure for this, so it is assumed that a DSNP User Id represents a single persona. Structurally, we might consider a way for a user (e.g. a human) to participate in various social networking activities with their choice of personas (and/or participate anonymously in some activities). For separation of personas to be effective, we would like to create a system that (through cryptographic techniques, say) did not enable easy correlation of multiple personas belonging to an individual by a third party. For this reason I think alternate profile expressions are out of scope for this work item, but I'd be open to further discussion.

Change visibility: DSNP requires systems to generate and emit State Change Records whenever User Data changes. While systems are not required to make historical data available, observers can still build their own history set and watch for changes that impact their caches.

wesbiggs avatar May 13 '24 17:05 wesbiggs

@wesbiggs from a Frequency implementation view, usually we build Frequency Schemas to be versioned via the schema identifier as that is needed to parse the particular data.

If we did this, I could see here we do an on-chain data structure that is something like we see below with the DSNP Profile version inside of the data structure instead of outside as most are/should be.

{
  type: "record",
  name: "UserProfilePointer",
  namespace: "org.dsnp",
  fields: [
    {
      name: "cid",
      type: "bytes",
    },
    {
      name: "version",
      type: "int",
    },
  ],
}

It would be stored in the User data store (Stateful Storage): And the schemas repo style deploy config:

{
      model: profile,
      modelType: "AvroBinary",
      payloadLocation: "Paginated",
      settings: [],
      dsnpVersion: "1.x",
}

It wouldn't handle the persona issue, which one could argue should be handled at the metadata level, but I think even if there were a persona setup in the generalized profile (instead of being in a group settings or such), then having that data still in one file (with links to media such as profile pictures), still makes sense as the profile, even if it has several duplicated pieces of data, is quite small.

Side Note: I could also see the on-chain metadata including a length value for the expected IPFS file as Frequency has for the IPFS Payload Location structure as well. This informs consumers of the data before they attempt to download extremely large files that might not actually be real profiles.

wilwade avatar May 14 '24 16:05 wilwade

I'm not following, why is the version needed within the Avro?

I think including the expected byte length is a useful addition.

wesbiggs avatar May 15 '24 01:05 wesbiggs

I'm not following, why is the version needed within the Avro?

It isn't required, but it is an optimization for the search.

Let's imagine a future where there are 10 non-backward compatible versions. In this case, each query to Frequency to look for a profile could (assuming you wished to support all 10) require up to 10 queries to the chain to discover the profile. On Frequency the map for Stateful storage is User Id, Schema Id. (Then page/item depending).

By shifting the version from effectively the Schema Id layer into the data layer, it allows the content version to shift as long as the metadata doesn't.

wilwade avatar May 15 '24 13:05 wilwade

I was envisioning that the spec itself might add or replace profile document types over time, hence the inclusion of the type enum value for each link record. For example, the DSNP community might start with the existing Activity Content Profile JSON doc, but later decide that something like a Solid WebID Profile is important to enable for various use cases. The user data payload could include both or either of these.

I think if we do this we keep it open for future evolution of types while not suffering version-related issues, as the core data (cid, type, and size) remains consistent regardless of the target file type.

wesbiggs avatar Jun 03 '24 22:06 wesbiggs

This functionality has been integrated for DSNP 1.3.

wesbiggs avatar Jul 08 '24 21:07 wesbiggs