atomic-data-docs icon indicating copy to clipboard operation
atomic-data-docs copied to clipboard

Towards Atomizer: tooling for producing atomic data - convert / import / extract data

Open joepio opened this issue 4 years ago • 19 comments

If we want Atomic Data to be successful, we need easy, accessible methods for people to convert their existing data to Atomic Data.

I call this project: Atomizer.

We need to consider a couple of things:

  • Usecases: What are examples of resources users may want to import?
  • Process: What should happen when atomizing data? What does this mean for the UI / UX?
  • Architecture: How do we keep the system and implementations as modular as possible?

Usecases and examples

  • Contacts. Upload .vcf vcard files. Should deal with custom properties.
  • Bookmarks. User should be able to export bookmarks from popular browsers, and import it to their Atomic Server. This will probably be a manual import / upload step. Can be implemented client-side.
  • Unmapped CSV. Map the properties of the CSV to existing or new Atomic Data Properties. Maybe a a Class to each row. Perform datatype validation.
  • Import calendar items as files. Upload / parse .ical files.
  • Sync calendar. Integrate with Google / Micrsoft / Apple? Periodically sync? Requires server-side logic.

Process

What needs to happen to convert a piece of data to Atomic Data?

  • Map the properties to Properties, convert values to supported datatypes
  • Prevent duplicate imports
  • Create a URL (subject) for the new Resource
  • Set a parent (optional, but very useful)
  • Host the URL somewhere on the web

Map the properties to Properties and convert the value

This is probably the hardest, most expensive step when dealing with arbitrary new data. The converter (person doing the atomization) will need to either find or create a Property.

  • Find: Have some sort of place on the internet where the converter can search through existing Properties (or Classes). Maybe this should be on atomicdata.dev, maybe it should be a separate website that specializes in aggregating Properties. I think users will want to know how many people are using a Property, and perhaps want to see reviews.
  • Create: Creating an Atomic Data Property means modelling it, and also hosting it, as all Atomic Data Properties should be accessible on the web. Perhaps this also means it might be a good idea to provide some sort of hosted service for this. We can also let Atomic-Server create these properties on the go.

We can choose to automatically generate Properties for unknown fields, or we can ask the user for input.

Prevent duplicate imports

  • Use hashes or timestamps

Create URL for resource

We first need to know the origin / path of the URL.

Various strategies exist:

  • Create a UUID or a hash or something similar (easy) example.com/UUID
  • Use some specific path example.com/document/UUID
  • Use human-readable names example.com/document/my-document-title-2022-01. A bit tougher, since we must prevent name collissions, and need to specify which field(s) should construct the URL
  • Let the user decide for each input

Set a Parent / define the hierarchy

Parents help to give structure to data, and to set authorization / ownership. They are optional, but highly recommended. I think the Atomizer should set Parents for all created resources, so the data is owned by something. As a fallback, we could always create an import resource which is the parent of all the imported resources, and crate an imports collection which is the parent of all imports. But for many cases, we should be able to find more intuitive hierarchies.

Many common existing data formats are nested, such as XML documents. In these, we often know what the parent is. But we still need some parent above each imported instance.

In the case of importing folders, it might make sense to create Folder resources and set these as parents to the resources inside.

Host the data somewhere on the web

This is something that Atomic-Server makes easy, but would otherwise be kind of a hassle.

Architecture

Some considerations:

  • There are a ton of libraries in the world that we can use, if we choose the right strategy.
  • Some users will want to import things periodically, such as a Twitter feed. This means the importer will need some sort of periodic runner and internet access.
  • A very simple GUI would be awesome.
  • It would make sense to start the import process from within Atomic Data Browser.
  • Some users may prefer a CLI interface
  • Since the Atomic Data model is quite powerful, we can also provide exporters to things like XML, CSV, JSON, whatever. It might be out of scope for this issue, but perhaps it makes sense to combine this functionality.
  • For many importers, we'll need filesystem access. This also probably means we're thinking about a desktop-first UI.
  • Being able to import a folder containing all sorts of files could be really useful.

Approaches:

atomizer Rust libraries for importers, powered by atomic_lib

I'll create a new atomizer repo, which has a CLI binary (atomizer-cli) and a whole lot of importers as library modules (e.g. atomizer-bookmarks, atomizer-vcard, etc.). Since all the importers are written as re-usable modules, we can later re-use them somewhere else. I expect to use some of these in a WASM runtime (see issue), or possibly in an atomic-server context to provide a drag-and-drop UI.

  • Rust only, so we can't re-use python / js / whatever libraries for importing.
  • CLI, GUI, libraries... This approach covers most interactions
  • Might be a bit intimidating for new developers to write for, compared to a JS / python approach.

atomizer JS libraries, powered by @tomic/lib

Similar to the approach mentioned above, but written in JS (probably typescript): atomizer repo with a lot of libraries.

  • Can run client side, in the browser
  • Can run server-side, using the wasmedge runtime, for example. Could prove to be quite complicated, though!

joepio avatar Jan 28 '22 17:01 joepio

Seems you stopped the above post in the middle of a senten

jonassmedegaard avatar Jan 29 '22 01:01 jonassmedegaard

I think this can be separated into two distinct tasks:

  • defining atomic data classes
  • using atomic data classes

Definitions involve deeper knowledge about the data being modelled, often require knowledge about related fiends of knowledge for the created model to be most interoperable, and often involve programming e.g. to express the model as an ontology (ideally reusable also for RDF outside of Atomic Data) and for coding an extension library for atomizer.

Using involves knowledge on the concrete dataset, to sensibly classify especially vaguely hinted data (is "Elvis" a musician or a a text editor, or perhaps a personal teddy bear, in the context of the dataset currently being classified?).

jonassmedegaard avatar Jan 29 '22 10:01 jonassmedegaard

For the design of "atoms" (i.e. defining atomic data classes), I recommend to read GRDDL Primer for inspiration.

jonassmedegaard avatar Jan 29 '22 11:01 jonassmedegaard

For "splitting into atoms" (i.e. using atomic data classes), I very broadly expect these "modes" of use:

  • automated:
    1. scan dataset for all known atoms
    2. classify all items detected at XX% certainty
    3. report "Detected YYY items"
    4. quit
  • interactive:
    1. scan dataset for all known atoms
    2. report "Detected YYY items, with XX% certainty"
    3. offer to either "Cancel" or "Classify detected items" or "Change certainty"
    • if "Cancel" -> quit
    • if "Classify detected items -> classify detected subset of data; quit
    • if "Change certainty" -> adjust threshold; goto scan...
  • structured:
    1. scan dataset and any related sidecars¹ for all known atoms
    2. report "Detected YYY items of ZZ types, with XX% certainty"
    3. offer to either "Cancel" or "Classify detected items" or "Change certainty" or "Save as sidecar"
    • if "Cancel" -> quit
    • if "Classify detected items -> classify detected subset of data; quit
    • if "Change certainty" -> adjust threshold; goto scan...
    • if "Save as sidecar" -> store search pattern used (which atoms to include or exclude for which reasons (i.e. what semantic rules was represented by the "certainty" value) as a "Sidecar" metadata file; quit

¹A "sidecar" is a hint file refining how to classify this kind of dataset. See darktable sidecars for inspiration of the concept of sidecars (but instead of tied only to one file, atomizer sidecars can be tied to the whole dataset or a subset (e.g. for a dataset consisting of a filesystem directory, a subset could be a subdir or a tarball, or for a dataset consisting of a website, a subset could be an upper path or a page). For the format of a sidecar, see RDF-EASE for a syntax tailored for web and XML content reusing CSS for semantic hinting. Possibly a more modern and popular approach for a sidecar format might involve SHACL (but that's just a stray idea, not thought through yet!)

jonassmedegaard avatar Jan 29 '22 11:01 jonassmedegaard

Ideas for types of atoms:

  • web activities: bookmarks, email/chat addresses, email/chat posts
    • multiple agents and/or multiple media mentioning same topic at nearly same time is a vague indicator of an "event" taking place
    • explicitly declared CalDAV or iCalendar event with topic same (or nearly same) as other web activity topics is a strong (or medium) indicator that they are tied to that same event
  • desktop activities: OSCAF (and related NEPOMUK and di.me)
  • music: MusicBrainz (and related ListenBrainz and more...)
  • copyright and licensing: SPDX (and related license identifiers and shortname identification syntax and REUSE)

jonassmedegaard avatar Jan 29 '22 12:01 jonassmedegaard

For coding the atomizer engine, I recommend to do it as a library usable both as a standalone tool and directly integrated with Atomic Server, and abstract away all knowledge of specific "atoms" to topic-specific libraries, each also usable either standalone or integrated with atomizer.

  • librust-atomizer - Rust project, containing multiple Rust-centric but atom-agnostic crates

    • libatomizer - main atom-agnostic library
    • atomizer-cli - standalone command-line tool for unix-style pipeline- and FIFO-oriented use
    • atomizer-gtk - standalone graphical tool integrating with GTK
      • non-default features included as git submodule, to limit default dependency stack
    • atomizer-qt - standalone graphical tool integrating with Qt and QML
      • non-default feature included as git submodule, to limit default dependency stack
  • node-atomizer - NodeJS project, containing a single Node module

    • atomizer - NodeJS module, acting as thin wrapper for libatomizer (compiled as WASM code)
      • usable server-side with NodeJS
      • usable for in-browser processing
      • enables e.g. Electron-based tools or webapps to atomize data, either for internal use or using Atomic Server as data store via its lightning-fast JSON-AD websocket endpoint
  • librust-event-atomizer - Rust project, containing a single library and optional executable binary

    • event-atomizer - library and optional command-line tool to detect and normalize event data
      • input:
        • default: URIs denoting data objects to scan
          • default recursion: none
        • alternative (and default as tool): filesystem paths to data objects to scan
          • default recursion: none for files, 1 for directories (i.e. scanning a directory also scans contained files but not contained directories)
        • alternative: raw contents of a single data object to scan
        • [when built] as tool, input is all non-option arguments, or STDIN if none provided
      • output:
        • default: RDF serialized as JSON-AD
          • alternative serialization: Turtle
        • alternative format: iCalendar file
        • [when built] as tool, output is option --output, or STDOUT if not provided
      • includes CWL spec

By having the general library/tool load multiple *-atomizer helper libraries and extending recursion depth, relations across objects might be detected - e.g. image files with embedded EXIF hints and addresses (vCard data) with matching identifiers.

jonassmedegaard avatar Jan 29 '22 14:01 jonassmedegaard

The process of "scanning" for data can (commonly, maybe always?) be divided into several steps:

  1. select
  • for simple uses without recursion, selection equals input, but when recursing through multiple objects, selecting which to bother examine and which to completely skip needs to be decided rule-based - e.g. using include/exclude regexes or globs, or include/exclude MIME types, or conneg expressions
  1. decode
  • might be as simple as UTF-8 -> strings, but could be a legacy string encoding (maybe only for some objects - see recursion notes for select above), or could involve legacy double-encoded strings-in-strings (e.g. ISO 8859-1 data in otherwise UTF-8 data)
  • some data contains interleaved strings and binary parts - e.g. Postscript can contain comments as strings but also binary parts unparseable as UTF-8 (and either irrelevant e.g. when scanning rights information, fatal if scanning images, or important if scanning rights information in EXIF data embedded in images embedded in encapsulated Postscript) encoded as interleaved strings and binary chunks).
  • non-decodable character might be skipped, or replaced (e.g. transliterating when decoding as ASCII) or treated as failure to parse the whole object
  1. parse
  • traverse according to rules of data format, noting both vague and strong hints of semantic describable data
  1. qualify
  • assess collected hints and concluding what to "keep" based on format-specific qualifier rules
    • example: parsed emails may strongly qualify lots of addresses, but to exclude spam may qualify as contacts only senders from trusted domains and recipients for emails sent from trusted domain or by otherwise trusted sender
    • example: copyright holders with same name but different or missing email address may be treated as separate agents using a strict rule but as same agent using a loose rule - where a simple format-agnostic "threshold" level translates to applying either the loose or the strict rule
  1. structure
  • map onto output model
  • maybe normalize (e.g. specific order of semantically unordered items)
  1. serialize

NB! One object may contain other objects, not only by embedding (which can be handled with variable "recursion" level), but also through multiple interleaved encodings - i.e. each object potentially need to be decoded + parsed more than once. An SVG file may contain unstructured comments about rights, and structured RDF data about rights, and strings within rendered SVG data about rights. All three might resolve to same semantic information which should then (except at utmost strict threshold) be merged to one set of semantic data, but if ambiguous or conflicting then several sets of rules may offer varying conflict resolution.

jonassmedegaard avatar Jan 29 '22 15:01 jonassmedegaard

Notion API have a concept of block https://developers.notion.com/reference/block, common for both pages and databases. Notion .so also is trying to collect different types of data into their repository. I think model can be something similar to block with object type "object": "block" (or contact or bookmark) and hierarchy support "children":[{ of different types. I am not sure how to capture - source repository, i.e. twitter/google contacts, there are "collections" in notion API which may serve the same purpose.

AlexMikhalev avatar Jan 31 '22 10:01 AlexMikhalev

To me, the Notion list of objects looks useful as classes for detected semantic objects.

I don't see, however, how Notion is helpful in designing atomizer processes.

I think that as atomizer input a source repository is a piece of data that can potentially be decoded into multiple pieces of data. As atomizer output a source repository is multiple semantic objects, depending of what is being scanned for (either by choice or by limits of scanners implemented):

  • a project
  • a collection of projects
    • possibly none if virtually empty
    • strongly several, if main project contains embedded projects
    • vaguely several, if project borrows or takes inspiration from other projects
  • a collection of project assets (PDFs, code files, documentation files, license grants, license texts, etc.)

Similarly, a PDF file is as atomizer input one thing, but as atomizer output it is potentially multiple things:

  • a PDF file
  • a collection of PDF metadata
    • a set of licensing and other rights constraints
  • a collection of embedded objects (images, text strings, interaction code, URIs, etc.)
    • each embedded object may itself be multiple things - e.g. an embedded image may contain an embedded ICC color calibration object which may contain rights constraints

In my work on streamlining Licensecheck I found the notion of "intermediaries" quite helpful. Problem in Licensecheck is that it scans human text where (at least with my capabilities as a programmer) there is no finite grammar, so the result of scanning is not certainly "copyright statements" and "license statements" but instead what I call "traits" - word compositions that factually exist in the text but only potentially hold the exact meaning that I am looking for.

A simpler example for same kind of dilemma is scanning for URIs in plaintext - if only humans would always include protocol at the beginning and wrap them all with <...> then it would be piece of cake to scan for them. Since humans are sloppy (leaving aside typos, even) some corner cases are ambiguous - e.g. if trailing punctuation belongs to the URI or to the surrounding text.

Essentially we want atomizer to resolve information contained in data. What I talk about with "traits" above seems to be the same as a proposed DAEK pyramid does differently than the more commonly used DIKW pyramid - make the reasoning step between information and knowledge explicit.

jonassmedegaard avatar Jan 31 '22 12:01 jonassmedegaard

Really interesting answer, let me think about it and come back to you after reading references. The immediate challenge I see in such behaviour for atomizer: how do you check if input changed or stays the same (eTag in HTTP)? Converting and extracting data will require making sure duplicates are not imported - bookmarks/contacts etc. I would approach atomizer differently: ontology - what types of objects atomizer supports, what is the taxonomy between objects types and then go into entity types. It seems to fit what you mean by "make the reasoning step between information and knowledge explicit", that step requires a knowledge graph, which is ontology/taxonomy/dictionary (thesaurus).

AlexMikhalev avatar Jan 31 '22 17:01 AlexMikhalev

By "a reasoning step" I don't mean formal semantics, and I think that is overkill here: What I mean is more casually some kind of internal qualifier rules (as described earlier).

I see now that my use of those pyramid models to describe how to get from atomizer input to atomizer output may be confusing: I take inspiration from formalized models of a larger world, but apply it to an internalized world of what happens inside atomizer - which I envision can be mostly internalized even further into specialized libraries for each output model. I think it is not necessary to design formal language for how those internal processes behave.

...but, as I introduced my previous post, I think it might make sense to define formal language for the output models.

jonassmedegaard avatar Jan 31 '22 17:01 jonassmedegaard

The reason I want atomizer modularized yet not formalized further internally is ~~that I expect that~~ to allow for relatively quickly cover a larger set of output models - and it allows for implementing competing libraries covering same output models - e.g. I could write a dirty crappy library to detect legal rights in source code, and you, @AlexMikhalev, could write a competing one applying formal reasoning and AI knowledge and whatever.

Or - as loosely discussed in a videomeeting recently - @joepio might write some competing libraries that was more secure (executing within a sandboxed environment) but more difficult to compile or execute (involving bleeding edge Rust code compiled into wasm binaries).

jonassmedegaard avatar Jan 31 '22 17:01 jonassmedegaard

I think one of the most important requirements for me is to prevent the double import of bookmarks and contacts. Also, I found https://github.com/vaimee/dasi-breaker - what is the difference and if we can learn from them?

AlexMikhalev avatar Feb 02 '22 10:02 AlexMikhalev

Atomizer targets Atomic Data which is more generic than Titanium JSON LD targeted with DASI Breaker - i.e. this project is far more generic, as it targets all atoms not only the relatively rare titanium. </joke>

Thanks for sharing DASI Breaker - seems they are also in initial speculation phase with no concrete code yet. From a (far too) quick view it seems to be network-based - and I fear that it will be heavy (I have seen too many EU-sponsored project being too huge and inaccessible to my liking). I will certainly have a closer look, and if nothing else it might help clarify how this project is different.

jonassmedegaard avatar Feb 02 '22 10:02 jonassmedegaard

I think one of the most important requirements for me is to prevent the double import of bookmarks and contacts.

Great point. As an ideal I agree, but I don't expect it possible to ensure - or more accurately, if ensured we limit ourselves from processing certain types of sources.

I think that atomizer should not try detect duplicates across scans at all, because that would require access to existing knowledge: It would change from a simple "data in -> information out" with a write-only connection to an Atomic Server, to a "data + pre-existing knowledge in -> new knowledge out" with a read-write connection to an Atomic Server (apart from the complexity bloat within atomizer itself) would require a two-way dialogue with the Atom Server backend.

Instead, atomizer should try collect relevant data points allowing consumers of the atomizer output (notably Atomic Server) to recognize and evaluate how to deal with duplicates (e.g. through semantic reasoning).

Examples:

  • UUID:
    • iCalendar and vCard data identifies objects using a UUID, where ideally a newer instance of same UUID replaces any older ones.
      • timestamps might be wrong or missing or using "floating" timezone -> notion of "older" is unreliable
    • some files optionally contain UUIDs in metadata (e.g. MusicBrainz IDs for music files), where ideally a newer instance with same UUID replaces any older ones.
      • there is no guarantee that newest scan is for newest edit of content, or that content really is same (e.g. two persons can have identically same core music files but with different additional metadata added).
  • path:
    • some files evolving over time (e.g. a text processing document) could be identified by path, where ideally a newer instance of same path replaces any older ones.
      • filesystem timestamps might be wrong, and execution time might or might not be reliable as fallback depending on wether the scans are known to be done in order (is it a weekly scan of same live system, or scans of old storage disks done in potentially random order?)
      • sometimes only relative path is stable - e.g. when stored on a USB thumb drive that may appear at some random sequential mount point.
  • content checksum:
    • some files may move around but as long as their content is exactly the same metadata about them stays same
      • beware of files that may change content without changing semantic meaning - e.g. different serialization
      • beware of file where their location in a filesystem may (partly) affect information about them

jonassmedegaard avatar Feb 02 '22 14:02 jonassmedegaard

Also, I found https://github.com/vaimee/dasi-breaker

I was mistaken that DASI Breaker lack code: They have done a minimally viable product available in git submodules.

I guessed correctly that DASI Breaker is implemented as a service, and a relatively heavy one at that. But regardless, it is quite an interesting and inspiring case!

what is the difference and if we can learn from them?

Difference is that DASI Breaker is multiple storage-backed and networked agents exchanging a multitude¹ of information, where atomizer is a single information exchange between one² constrained³ agent producing output consumable by an Atomic Server.

Interesting, I think, to try turn it around: What would it look like if the developers of DASI Breaker had prioritized resource constraints and built their framework around Atomic Server and Atomizer? How many of their intermediate agents might then be sensibly absorbed into either of those, and which would still remain?

I guess @joepio would excitedly want to implement SEPA into Atomic Server, and I would try convince that to be a separate (or at least separable) component - with the reasoning that such service might run under different agency (e.g. a public service, where Atomic Server might be behind a firewall) and also might not need same storage resources as a full-blown Atomic Server (where "full-blown" is noticable within our measure of constraints but dismissal for someone operating multiple dockerized Java virtual machines with PostGIS, Redis, and MySQL backends.

¹ Some DASI Breaker information exchanges use SPARQL, some SQL, and some NGSI-LD (which seems derived from but incompatible with JSON-LD - as a competing implemention describes it: "an extended subset of JSON-LD"). For comparison, Atomizer exchanges JSON-AD which is a subset of JSON-LD.

² I encourage organizing atomizer code as multiple semi-reusable libraries, but that is an implementation detail.

³ I want it possible to compile atomizer as a host-integrated binary executed as a unix-style command-line shell tool; @joepio wants it possible to build atomizer in a WASM binary executed in a JavaScript sandbox.

jonassmedegaard avatar Feb 02 '22 21:02 jonassmedegaard

Related project on curating interoperable shapes: https://shaperepo.com/

jonassmedegaard avatar Feb 22 '22 14:02 jonassmedegaard

Related collection of foo-to-RDF tools: https://www.w3.org/wiki/ConverterToRdf

jonassmedegaard avatar Mar 04 '22 10:03 jonassmedegaard

We now have a functioning Importer class and a new JSON-AD publishing spec that definitely helps realizing the rest of the ambitions.

joepio avatar Aug 24 '22 10:08 joepio