Ontogen

DCAT-R, Gno, and RDF.ex 3.0

2026-03-19T10:00:00+00:00

It has been almost a year since the last project update, and a lot has happened behind the scenes. The most substantial part of the current development phase is not finished yet - but in the course of this work, several sub-projects have emerged that are independently useful and ready for release today.

As described in the previous roadmap update, the development required extracting and generalizing several foundational components. Today, I am pleased to announce three of these:

Gno - a library for managing RDF datasets in SPARQL triple stores
DCAT-R - a specification, vocabulary, and Elixir implementation for describing RDF repositories
RDF.ex 3.0 - with the new RDF.Data.Source protocol for polymorphic RDF data access

Each of these grew out of Ontogen’s internals but has been designed to stand on its own. In the following sections, I will introduce each project, explain where it comes from, and highlight what is new.

Gno

Gno is a library for managing RDF datasets in SPARQL triple stores. The name “Gno” comes from the Greek root for “knowledge” (as in gnosis). It provides a unified API that abstracts the differences between storage backends, so you can work with your data the same way regardless of the underlying store. Built-in adapters are available for Apache Jena Fuseki, Oxigraph, QLever, and Ontotext GraphDB, and any other SPARQL 1.1-compatible store can be configured with explicit endpoint URLs. Gno also normalizes behavioral differences between stores - for instance, it transparently handles the divergent default graph semantics (isolated vs. union) across backends.

Readers of the Repository and Service Model article will recognize the store-related parts of Gno. That article introduced the og:Store concept and the overall service architecture that connects a repository with a storage backend. Gno is, in essence, an extraction of this store adapter system and the surrounding data management operations into an independent library. What was previously only available as part of Ontogen - the store adapter abstraction, the SPARQL operation API, the changeset and configuration system - is now usable on its own, without any dependency on Ontogen’s versioning machinery.

Gno covers all standard SPARQL operations - SELECT, ASK, CONSTRUCT, DESCRIBE queries as well as INSERT, DELETE, and graph management operations (CREATE, DROP, CLEAR, COPY, ADD, MOVE). Beyond raw SPARQL, it provides two higher-level systems:

A changeset system for expressing structured changes through four actions: add (insert new statements), update (property-level overwrite), replace (subject-level overwrite), and remove (delete statements). Before applying changes, a changeset can be converted to an effective changeset that queries the current state and computes only the minimal changes actually needed - statements that already exist are not added again, and statements that do not exist are not removed.

The commit system

The main addition that Gno brings beyond what existed in Ontogen is an extensible commit system. While the store operations and changeset system were extracted largely unchanged, the commit system is a new layer designed from the start to support middleware-based extensibility.

The commit processor implements a state machine that orchestrates the application of changes through well-defined phases:

graph LR
    A[init] --> B[preparing]
    B --> C[prepared]
    C --> D[starting\ntransaction]
    D --> E[applying\nchanges]
    E --> F[changes\napplied]
    F --> G[ending\ntransaction]
    G --> H[finalizing]
    H --> I[completed]

    B -.->|error| J[rollback]
    C -.->|error| J
    D -.->|error| J
    E -.->|error| J
    F -.->|error| J
    G -.->|error| J

    style A fill:#e8e8e8,stroke:#333,stroke-width:2px
    style I fill:#c2f0c2,stroke:#333,stroke-width:2px
    style J fill:#f0c2c2,stroke:#333,stroke-width:2px

At each state transition, the configured middleware pipeline is invoked. Middleware components can participate in every phase of the commit lifecycle - they can validate changes before they are applied, enrich the commit with additional metadata, add supplementary changes to other graphs, or perform cleanup after completion. If an error occurs at any point during the transactional phases, the processor automatically rolls back all changes.

Middleware is configured declaratively in the service manifest. For example, to enable commit logging:

@prefix gno:  .

 a gno:CommitOperation
    ; gno:commitMiddleware (  )
.

 a gno:CommitLogger
    ; gno:commitLogLevel "debug"
    ; gno:commitLogChanges true
.

This middleware architecture is the primary extension point through which higher-level systems build on Gno. Ontogen, for instance, implements its entire versioning logic - creating commit objects, updating the history graph, advancing the repository HEAD - as Gno commit middleware. This means Ontogen’s versioning is not a separate mechanism layered on top; it participates directly in Gno’s transactional commit lifecycle, with full rollback support.

For more details, see the Gno User Guide or its API documentation.

DCAT-R

The Repository and Service Model article introduced the idea of modeling Ontogen repositories as DCAT catalogs and Ontogen instances as DCAT services. The og:Repository was defined as a DCAT catalog containing the user dataset and the history graph; the og:Service combined a repository with a store backend.

During the subsequent development, this pattern kept recurring. Gno, the store management library introduced above, needed the same kind of structure. So did other projects in the pipeline. In each case, the application was an RDF infrastructure service - providing generic capabilities like store access, versioning, or identity management over RDF datasets - and in each case, the same organizational questions arose: How are graphs organized? Which are user data, which are configuration, which are operational infrastructure?

What these applications share is that they leverage RDF’s universality not just for the user data they manage, but also for their own configuration and metadata. The repository description, the service settings, the graph organization - it is all RDF, stored as named graphs alongside the user data. When application structure and user data coexist in the same dataset, the need for principled organization naturally arises.

DCAT-R (Data Catalog Vocabulary for RDF Repositories) addresses this need. It is a language-independent specification of a vocabulary extending the W3C’s Data Catalog Vocabulary (DCAT) 3, alongside an Elixir implementation (DCAT-R.ex).

Where DCAT focuses on an external perspective - cataloging datasets for discovery, describing service endpoints for consumers - DCAT-R adds an intra-service perspective: vocabulary for how a service organizes its data internally. It models this internal structure using DCAT’s own concepts: a repository is a dcat:Catalog, each graph is a dcat:Dataset, the service remains a dcat:DataService. This means existing DCAT tooling can process DCAT-R descriptions without any knowledge of the DCAT-R vocabulary - it simply sees catalogs containing datasets served by data services.

DCAT-R works on two levels. At its simplest, it provides vocabulary for describing RDF datasets at the graph level - classifying graphs by purpose, organizing them into directories, attaching metadata. But it is also designed as a foundation for application frameworks: applications extend DCAT-R by subclassing dcatr:Service with their own operations, adding application-specific dcatr:SystemGraph subclasses for operational data, and extending the manifest with application-specific configuration. DCAT-R provides the organizational skeleton; applications fill it with their operations.

The four-level hierarchy

DCAT-R models RDF repositories through a four-level hierarchy, each level refining a DCAT 3 concept:

Service         (what you can do)
 └── Repository (what you have - distributable)
      └── Dataset   (the user data)
           └── Graph     (individual RDF graphs)

Service (dcatr:Service, extends dcat:DataService): The operations layer. A service provides access to a repository and defines what operations are available.
Repository (dcatr:Repository, extends dcat:Catalog): A managed collection that bundles an RDF dataset with operational infrastructure and catalog metadata. Analogous to a software repository that combines content with build scripts, configuration, and metadata.
Dataset (dcatr:Dataset, extends dcat:Catalog): The actual RDF 1.1 dataset - the user data that the repository manages, modeled as a catalog of its constituent data graphs.
Graph (dcatr:Graph, extends dcat:Dataset): An individual RDF graph carrying its own metadata.

Multi-graph support

The original Ontogen model only supported a single graph. DCAT-R supports two patterns for organizing data graphs within a repository:

Multi-graph pattern: dcatr:repositoryDataset links to a dcatr:Dataset catalog containing multiple data graphs. An optional dcatr:repositoryPrimaryGraph designates one graph as the main entry point.
Single-graph shortcut: dcatr:repositoryDataGraph links directly to a single data graph, which also serves as the primary graph.

This lays the groundwork for multi-graph support in Ontogen.

Distribution boundary

A key architectural addition is the clear separation between distributed data and local data:

graph TD
    A[dcatr:Service] -->|dcatr:serviceRepository| B(dcatr:Repository)
    A -->|dcatr:serviceLocalData| C(dcatr:ServiceData)

    B -->|dcatr:repositoryDataset| D[dcatr:Dataset]
    B -->|dcatr:repositoryManifestGraph| E[dcatr:RepositoryManifestGraph]
    B -->|dcatr:repositorySystemGraph| F[dcatr:SystemGraph\n- distributed -]

    D -->|dcatr:dataGraph| G[dcatr:DataGraph 1]
    D -->|dcatr:dataGraph| H[dcatr:DataGraph n]

    C -->|dcatr:serviceManifestGraph| I[dcatr:ServiceManifestGraph]
    C -->|dcatr:serviceWorkingGraph| J[dcatr:WorkingGraph]
    C -->|dcatr:serviceSystemGraph| K[dcatr:SystemGraph\n- local -]

    style A fill:#ccd1e0,stroke:#333,stroke-width:4px
    style B fill:#d1c2f0,stroke:#333,stroke-width:2px
    style C fill:#f0e6c2,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
    style D fill:#f0e6ff,stroke:#333,stroke-width:2px
    style E fill:#f0e6ff,stroke:#333,stroke-width:2px
    style F fill:#f0e6ff,stroke:#333,stroke-width:2px
    style G fill:#fff,stroke:#333,stroke-width:1px
    style H fill:#fff,stroke:#333,stroke-width:1px
    style I fill:#fff5e0,stroke:#333,stroke-width:1px
    style J fill:#fff5e0,stroke:#333,stroke-width:1px
    style K fill:#fff5e0,stroke:#333,stroke-width:1px

The Repository contains everything that is part of the distribution: the dataset with its data graphs, the repository manifest graph with DCAT catalog metadata, and distributed system graphs (e.g., version history, provenance). When the repository is replicated or shared, all of this travels together.

ServiceData contains everything local to a particular service instance: the service manifest graph with instance-specific configuration, working graphs for temporary data, and local system graphs (caches, logs). Service data is never distributed.

This separation enables multi-instance deployments where different service instances serve the same repository with different configurations or storage backends.

Graph naming

In an RDF dataset, graph names are dataset-local identifiers - they are not inherently suited as global identifiers in a distributed context. When a repository is replicated or shared, this becomes a problem: which names are stable, globally meaningful identities, and which are just local conventions of a particular service instance?

DCAT-R addresses this by consistently separating a graph’s graph ID - its RDF resource URI, serving as a globally stable identifier - from the local graph name under which it appears in a particular service’s RDF dataset. Following the distribution boundary principle, graph IDs belong to the repository (distributed), while local graph names belong to the service configuration (local). By default, DCAT-R uses the graph ID as the graph name. When a different local name is needed, the dcatr:localGraphName property allows defining one in the service manifest.

The same principle underlies the distinction between primary graph (a repository-level concept: the graph that operations target by default) and default graph (a service-level concept: the unnamed graph in the RDF dataset). The dcatr:usePrimaryAsDefault property controls the relationship between these two.

Graph type taxonomy

Every graph in DCAT-R belongs to exactly one of four disjoint types:

DataGraph: User data forming the dataset content
ManifestGraph: DCAT-R configuration and catalog metadata (with subtypes RepositoryManifestGraph and ServiceManifestGraph)
SystemGraph: Application-specific operational data (e.g., version history, indexes, provenance records)
WorkingGraph: Temporary, service-local graphs for drafts, staging, or caches

These four types are defined as pairwise disjoint OWL classes whose union equals dcatr:Graph, ensuring that every graph has an unambiguous classification. This enables applications to reliably distinguish user data from infrastructure without relying on naming conventions.

Manifest system

Building on Ontogen’s original Ontogen.Config, the configuration system has been formalized as a two-graph manifest system reflecting the distribution boundary: the repository manifest graph carries distributed catalog metadata, while the service manifest graph carries instance-local configuration. Additionally, DCAT-R introduces Manifest Graph Expansion (MGE), a mechanism for automatically including referenced resources from a shared pool into the appropriate manifest graphs. This provides a DRY pattern for shared resources (such as agent descriptions) across multiple manifest files.

Directory support

Real-world RDF repositories can contain dozens or hundreds of named graphs. DCAT-R introduces dcatr:Directory as a hierarchical containment mechanism for organizing graphs into named collections, much like a filesystem organizes files into directories. Directories can be nested to arbitrary depth, and each graph belongs to at most one directory. When graph URIs follow a hierarchical naming scheme, directories can make this structure explicit and navigable.

DCAT-R.ex

DCAT-R.ex is the Elixir implementation of the DCAT-R specification. It provides Grax-based schemas for all DCAT-R classes and a manifest loading pipeline that resolves environment-specific configurations from Turtle files.

The key design principle of DCAT-R.ex is extensibility through behaviors. Applications define specialized types by implementing:

DCATR.Service.Type - to define a service with custom operations and configuration
DCATR.Repository.Type - to add distributed system graphs
DCATR.ServiceData.Type - to add local system graphs or working graphs
DCATR.Manifest.Type - to register the specialized service type and optionally integrate custom configuration logic (such as the Bog-based interpretation used in Ontogen)

Gno as a DCAT-R service

Gno itself is a concrete example of this extension pattern. A gno:Service is a subclass of dcatr:Service that adds two elements:

graph TD
    A[gno:Service] -->|dcatr:serviceRepository| B(dcatr:Repository)
    A -->|gno:serviceStore| C(gno:Store)
    A -->|gno:serviceCommitOperation| D(gno:CommitOperation)

    B -->|dcatr:repositoryDataset| E[dcatr:Dataset]
    B -->|dcatr:repositoryManifestGraph| F[dcatr:RepositoryManifestGraph]

    C -->|rdf:type| G[gnoa:Fuseki / gnoa:Oxigraph / ...]

    D -->|gno:commitMiddleware| H["( Middleware 1, Middleware 2, ... )"]

    style A fill:#ccd1e0,stroke:#333,stroke-width:4px
    style B fill:#d1c2f0,stroke:#333,stroke-width:2px
    style C fill:#c2f0d1,stroke:#333,stroke-width:2px
    style D fill:#c2f0d1,stroke:#333,stroke-width:2px
    style E fill:#f0e6ff,stroke:#333,stroke-width:2px
    style F fill:#f0e6ff,stroke:#333,stroke-width:2px

A Store (gno:Store) representing the SPARQL triple store backend, with vendor-specific subclasses that know how to construct the correct endpoint URLs (e.g., gnoa:Fuseki constructs Fuseki’s /{dataset}/sparql, /{dataset}/update, etc. from a dataset name).
A CommitOperation (gno:CommitOperation) carrying the middleware pipeline configuration for the commit system.

This means Gno does not introduce its own repository model - it reuses DCAT-R’s repository, dataset, and graph structure and adds only the store-related and commit-related configuration on top. Systems that build on Gno (like Ontogen) can in turn extend the Gno service type further, adding their own system graphs (a history graph in this case), commit middleware, and application-specific configuration - all within the DCAT-R framework.

RDF.ex 3.0

Alongside these higher-level frameworks, RDF.ex 3.0 brings a significant redesign of the RDF.Data API, among other improvements. The previous RDF.Data protocol is now structured in two parts, following Elixir’s Enumerable/Enum pattern:

The RDF.Data.Source protocol defines a minimal set of primitives that RDF data structures implement.
The RDF.Data module builds a rich, user-friendly API on top of these primitives, providing functions for iteration, transformation, navigation, aggregation, and conversion.

Just as implementing Enumerable for a custom data structure gives access to all of Enum’s functions, implementing RDF.Data.Source gives access to the entire RDF.Data API. This enables uniform processing of RDF data regardless of whether it comes from an RDF.Description, an RDF.Graph, an RDF.Dataset, or a custom implementation.

For a comprehensive overview, see the new RDF.Data section in the user guide or the API documentation.

Ontogen status

Ontogen has already been fully migrated to DCAT-R and Gno. The architecture now forms a clean three-layer stack:

DCAT-R provides the structural vocabulary
Gno adds store operations
Ontogen adds versioning semantics

In concrete terms: Ontogen services are now realized as DCAT-R services (via Gno), and Ontogen’s versioning logic - creating commit objects, writing to the history graph, updating the repository HEAD - is implemented as Gno commit middleware.

However, completing Ontogen’s next version also depends on two other projects that are not release-ready yet. The new version of Ontogen is planned for release this summer, together with the release of these other projects.

One item from the original roadmap that will not be realized is the planned DID integration by Patrick, whose other commitments did not leave him enough time to pursue this work.

As always, I would like to express my sincere gratitude to the NLnet Foundation for their continued support through the NGI Zero Core fund, which makes all of this work possible.

Project Update: New Roadmap for 2025 and release announcements

2025-04-09T10:00:00+00:00

The work on Mud, the extracted and enhanced version of Bog, had been progressing well towards solving the problem of static UUID graph names in Ontogen repositories. However, when it came to extending automatic URI generation and management to the resources of the dataset itself, it became increasingly clear that Bog’s strategy alone was insufficient. So, after going back to the drawing board, I’ve developed a new approach. While requiring significantly more effort than initially planned, it promises a comprehensive solution to the URI generation problem which, as described in the Bog article, I consider one of the fundamental challenges of the Semantic Web.

During the design of this new solution, it also became clear that the project must be split up further, so that there’s now also a need for an additional extraction: that of the DCAT-based Repository and Service model (described in this article). This separation will make the dataset management capabilities available independently while allowing Ontogen to focus exclusively on its core versioning functionality.

The project’s evolution, particularly additional requirements that emerged for the planned DID implementation, has also led to several updates to the RDF on Elixir libraries, some of which are being released today:

JSON-LD.ex v1.0 with JSON-LD 1.1 support
RDF.ex v2.1 with rdf:JSON literal support
Grax v0.6 with rdf:JSON support and ordered lists based on rdf:Lists

Please refer to the respective CHANGELOGs for a comprehensive list of the changes.

Further planned library updates include:

Support for the upcoming RDF 1.2 specification in RDF.ex
Support for JSON-LD framing
Two patterns used repeatedly to solve problems in the new design will be implemented as Grax extensions:
- Managing temporal values in RDF, tracking how values change over time while maintaining their history
- Handling multiple language-tagged string values as a unique value, enabling functional property behavior for localized strings

Some previously announced features, at least Sagas, will need to be deferred to a future development phase to accommodate this more fundamental work.

While this represents a shift from the original roadmap, I strongly believe that the benefits of the planned comprehensive URI management outweigh the delay of other features.

I’d like to express my sincere gratitude to the NLnet Foundation for their continued support, making all of this possible.

Roadmap for next NLnet funding

2024-10-22T10:00:00+00:00

I’m pleased to announce that the NLnet foundation has approved follow-up funding for Ontogen from the NGI0 Commons Fund. I’m grateful for NLnet’s trust in this project and their general commitment to advancing open source technologies.

Unfortunately, the future of such funding through the EU is uncertain. There are plans to redirect these funds in questionable directions, potentially affecting this and many other open source projects. For more information on this concerning development, see this article by the Free Software Foundation.

Fortunately, the funding for Ontogen’s development is secured for the coming year. With this support, the plan is to overcome the current limitations mentioned in the introductory articles (and now also clearly stated in the project READMEs) and enhance Ontogen’s functionality. Here’s an overview of the planned work.

Roadmap

Mud

Bog, Ontogen’s configuration language and corresponding precompiler for automatic ID management in RDF, will be extracted from Ontogen and realized as its own project, as its functions can be usefully applied outside the versioning context. In the process, it will be renamed to Mud.

The problem of cryptic graph names will be solved by extending the mechanism for generating URIs for resources to include the ability to later rename them to custom URIs and track the past URIs. The possibility of generating URIs, previously limited to the resources of an Ontogen repository, will be extended to the resources of the version-controlled dataset.

Furthermore, the range of functions will be expanded to include additional typical and extensible operations that one might want to perform when working with RDF datasets, such as URI normalizations, RDF smushing, etc.

Saga Synchronization Protocol

Sagas introduces a synchronization protocol to keep different storage locations in sync by utilizing Ontogen’s version history itself. Just as VCS can be viewed as a synchronization solution for copies of a repo, “Sagas” can be viewed as “clones” within an Ontogen instance that are kept in sync via the shared version history.

This allows us to turn necessity into virtue and actually duplicate and synchronize not just the repository configuration in the file system, but the entire content of the dataset. This duplication opens up new possibilities. By having a complete copy of the repository in the file system, we enable the use of the many file-based tools for working with the data, for example, we can use ordinary text editors to edit the data kept in sync with the triple store.

This significantly enhances the flexibility and accessibility of the data, allowing users to leverage their preferred tools and workflows. Furthermore, this approach allows for integration with Git. By leveraging Git’s versioning capabilities alongside Ontogen’s specialized RDF versioning, we can achieve better collaboration, history backups and migrations, and even more flexible workflows in RDF dataset management, especially until Ontogen is more mature and can offer similar functionalities (although the syntactic versioning of Git will never be offered and for users who want or need that, this integration is useful in the long term).

Multi-Graph Dataset Versioning

Supporting versioning of datasets with multiple graphs is a larger undertaking, unfortunately, as RDF-star, which forms the basis of Ontogen’s versioning model, only allows annotation of triples and not quads. Therefore, a rather complex tracking of the assignment of individual triples to graphs must be realized at the level of the RDF-star annotations of RTC compounds.

DID Integration

In addition to these core developments, Patrick Oscity will contribute by developing an Elixir implementation of W3C Decentralized Identifiers (DIDs). This implementation will be integrated into Mud and Ontogen to support the automatic generation of DIDs, providing a standardized, interoperable format for persistent identifiers with enhanced privacy features, for datasets where this is needed or useful.

Scalability problem

A previously unmentioned limitation should be mentioned: Ontogen, in its current form, is not suitable for large datasets. While I work on the planned developments, I’m also exploring solutions to this scalability issue.

At least an issue causing timeouts with larger queries was removed in the recently released version 0.1.3. If you’ve already installed Ontogen, you can upgrade using

brew upgrade ontogen

Ontogen Configuration with Bog

2024-08-14T07:00:00+00:00

This is the fourth and last of a series of four blog posts introducing the different parts of the Ontogen version control system and the ideas behind it:

In the previous article, we explored the interpretation of dcat:Catalog and dcat:DataService concepts within the framework of Ontogen as a Data Control Management (DCM) system. Now, let’s turn our attention to another crucial aspect: the configuration of Ontogen components.

We’ve seen that an Ontogen service represents a concrete instantiation of a repository on a user’s machine, consisting of the repository itself and its associated triple store. Configuring these components presents us with several challenges, particularly when it comes to naming and identifying resources.

In this article, we’ll delve into Ontogen’s configuration system. We’ll explore Bog, a specialized configuration language developed for Ontogen. Bog addresses some of the fundamental challenges in configuring RDF-based systems and offers innovative solutions for resource naming and identification.

Bog

“Naming is hard.” This is especially true in the RDF world with its URIs, and even more so under Linked Data rules, where the preference for HTTP URIs further complicates the issue by introducing DNS as a social authority, thus bringing societal and political aspects into play.

To be clear: URIs as a central component of the RDF model are one of the reasons that make it so powerful. Nevertheless, the additional complexity this creates for the naming problem is, in my opinion, one of the reasons why it still causes reservations compared to other data models. Particularly when combined with the lack of automated solutions for this problem, it makes traditional, relational data model proponents shake their heads and happily continue to generate records (or models of their ORMs) with automatically generated primary keys.

Bog aims to automate the problem of resource naming (aka URI minting) for a specific class of resources, initially and specifically for the resources of an Ontogen service that need to be minted as part of the configuration.

To this end, Bog introduces some special RDF properties and resources with particular semantics that are processed by a Bog interpreter. The most fundamental of these properties, to which all others are ultimately reduced, is bog:ref. With bog:ref, a resource initially identified by a blank node can be given a locally valid name.

Example:

# we omit this prefix in the following code snippets
@prefix :  .

[ :ref "this-service-instance" ; a og:Service ] .

When first interpreted by the Bog interpreter, this is interpreted as minting a new, locally named resource. A random salt is then stored in a file with the specified name. The interpreter then replaces the blank node with a generated UUIDv5 URI. For each subsequent interpretation, loading the salt reproduces exactly the same UUIDv5 URI, allowing for consistent translation to the same graph.

Note: the name itself is not part of this hash, meaning that the locally used :ref name can be changed at any time by renaming the file and all names in the local configuration files, yet still leads to the same UUID URIs.

To prevent accidental changes, Bog throws an error if the salt file is not present. This is only allowed during initial minting.

Bog offers another solution to avoid name changes: the indexical bog:this property. The concept of indexicality comes from the philosophy of language and refers to linguistic expressions whose meaning depends on the context of their utterance. In Bog, this concept is applied to the referencing of resources. It allows referencing individual instances of classes that represent a unique individual relative to this instance in the context of the executing instance. Consequently, these quasi-relative singletons can also be referenced directly via their class in the context of the executing instance.

Example: In the context of execution on an Ontogen instance, there is exactly one distinguished individual of the class og:Service that represents this very instance as an og:Service. Thus, in Bog, it can also be referenced using the bog:this property as follows:

[ :this og:Service ] .

This is interpreted by the Bog interpreter as follows:

[ :ref "service" ; a og:Service ] .

In fact, almost all elements of an og:Service are unique singletons relative to this instance and thus indexically referenceable via bog:this and the corresponding class name.

Additionally, for referencing the user of the application, Bog provides the indexical bog:I resource, which can be used to reference the user of the system:

:I ex:p ex:O .

This is interpreted as:

[ :this og:Agent ; ex:p ex:O ] .

Through local interpretation, Bog thus allows us to reference the various components of an Ontogen service without having to name them, but instead getting automatically generated, stable URIs for these resources.

It is planned to spin off Bog as its own project with expanded functionality in the next version of Ontogen. In particular, it should be possible to give the resources managed with Bog proper Linked Data URIs at a later point in time if needed.

In future versions, there are plans to apply the Bog-based minting process to resources of the version-controlled dataset itself, and to include this minting process as a speech act within the version history. The secret salts possessed by the creator of such resources would then give them cryptographic control over the further use and modification possibilities of these resources.

These developments aim to enhance Bog’s capabilities and integrate it more deeply with Ontogen’s versioning system, potentially offering new levels of security and control over resource management.

Bog-based configuration of Ontogen services and repositories

With this mechanism, we have a method for generating persistent and reproducible UUID URIs. Now, let’s look at how the concrete specification of Ontogen services and their components (store, repository, etc.) works in practice with the Bog interpreter.

For a project created with the Ontogen CLI, the directory structure typically looks like this:

my_dataset/  
├── .ontogen  
│    ├── .salts  
│    │   ├── agent.salt  
│    │   ├── dataset.salt  
│    │   ├── fuseki.salt  
│    │   ├── history.salt  
│    │   ├── oxigraph.salt  
│    │   ├── repository.salt  
│    │   ├── service.salt  
│    │   └── store.salt  
│    └── config  
│        ├── agent.bog.ttl  
│        ├── dataset.bog.ttl  
│        ├── fuseki.bog.ttl  
│        ├── oxigraph.bog.ttl  
│        ├── repository.bog.ttl  
│        ├── service.bog.ttl  
│        └── store.bog.ttl  
│── ...

Each of the .ontogen/config/*.bog.ttl files configures exactly one singleton instance of the respective class of an Ontogen service instance, to which the bog:this in the respective Bog Turtle file refers. The salt for generating the URI is persisted in the respective .ontogen/.salts/*.salt file of the same name.

Rather than using a single large configuration file, a modular approach is followed here, where each component is configured in its own Bog Turtle file. This approach is particularly advisable because the component descriptions should not only contain the configuration of the respective component (for which there are not many configuration options at this point in this first version), but also the general description of the respective resource belongs here, i.e., the description of the og:Service as a dcat:DataService, the description of the og:Dataset as a dcat:Catalog or the og:Agent as a foaf:Agent, etc. A complete description of all these resources in one file would quickly become very large and confusing.

Here’s an example of the configuration of the og:Dataset with a complete DCAT description:

@prefix :  .
@prefix og:  .
@prefix dcat:  .
@prefix dcterms:  .
@prefix foaf:  .
@prefix xsd:  .

[ :this og:Dataset
  ; a dcat:Dataset
  ; dcterms:title "My Dataset"@en
  ; dcterms:description "An example dataset"@en
  ; dcterms:created "2023-08-13"^^xsd:date
  ; dcterms:creator :I
  ; dcterms:publisher :I
  ; dcat:contactPoint :I
  ; dcat:keyword "RDF"@en, "Ontology"@en, "Versioning"@en
  ; dcat:theme 
  ; dcterms:license 
]

(You can find more examples of the configuration in the User Guide.)

Additionally, in global files, similar to Git, Bog Turtle files with more general default values for certain configurations and metadata can be specified, for example, about the user agent in general:

/etc/ontogen.conf.bog.ttl
~/.ontogen.conf.bog.ttl

The structure and interpretation of the description of an Ontogen service and its components thus looks as follows:

When Ontogen starts, all configuration files are loaded into a graph, starting with the global files. During this incremental construction of the graph, it should be noted that values for properties of the same resource are overwritten. This ensures that the values from the global configuration files only serve as default values and the repository-specific configuration files can completely overwrite these values if necessary.

The following Bog Turtle fragment is then added to this graph to ensure the basic static linking of the resources of the og:Service aggregate:

 [ :this og:Service  
     ; og:serviceRepository [ :this og:Repository  
         ; og:repositoryDataset [ :this og:Dataset ]  
         ; og:repositoryHistory [ :this og:History ]  
     ]  
      
     ; og:serviceOperator :I  
 ] .

Then the Bog precompiler is applied to this graph, which resolves blank nodes to URIs according to the salts or mints them.
Finally, the resulting graph is loaded using Grax (the RDF Data Mapper for Elixir used) into a deeply nested, native Elixir structure for the Ontogen service and all its components, which then serves as the state of the Ontogen singleton GenServer.

An issue that needs to be addressed soon is that the repository description is copied to the repository metadata graph in the store when the repository is set up (with og setup). Currently, there’s no way to update this metadata when the configuration has been changed.

This issue, along with the previously mentioned lack of ability to introduce custom URIs for resources, is the reason why users currently have to work with cryptic graph names. Specifically, the Bog-generated URI of the og:Dataset is used as the name of the graph for the version-controlled dataset, the Bog-generated URI of the og:History serves as the name for the history graph, and the Bog-generated URI of the og:Repository is employed as the name of the repository metadata graph.

I consider this a significant drawback of the current implementation. Addressing these two issues - allowing metadata updates and introducing custom URI support - is therefore of the highest priority in the upcoming development work. The goal is to provide users with more intuitive and manageable graph naming conventions.

Store adapters

An careful reader may have noticed that the basic structure of an og:Service described earlier lacks the link to the og:Store. The reason for this is that the configuration of the og:Store differs from that of other components, as different triple stores, although all SPARQL-based, require different configurations. Ontogen addresses this challenge through the implementation of store adapters.

In Ontogen, triple store adapters are implemented as subclasses of og:Store. This solution is not only conceptually very simple but also provides a comprehensive basis for solving this problem thanks to Grax and its support for polymorphic links. (In particular, we overcome the “Walled Gardens within Elixir”, which potentially allows store adapters to be developed and versioned as separate Hex packages if their complexity should increase over time, which might happen quickly when triple store-specific extensions are implemented in a store adapter.)

This should now also make clear why the statically added basic structure does not define a link to the og:Store: It is the user’s decision which triple store their Ontogen service instance should run with. The user makes this decision by specifying an instance of the corresponding og:Store subclass in the service configuration. Consequently, the configuration files generated during the initialization of a repository look like this:

# all triple store adapters have URIs in a dedicated namespace
@prefix oga:  .

[ :this og:Service

    #################################################
    # store selection

    # Select the adapter of your choice and feel free to remove the unused
    # (incl. their respective config files)
    ; og:serviceStore [ :this og:Store ]
    # ; og:serviceStore [ :this oga:Oxigraph ]
    # ; og:serviceStore [ :this oga:Fuseki ]

    #################################################
    # metadata

    ; dcterms:title "Your service name"
    ; dcterms:creator :I
    ; dcterms:publisher :I

] .

When initializing a repository, configuration files are generated for all available adapters. However, only the adapter specified here in the og:Service is actually used during operation.

If, as in the above case, no instance of a subclass is used but directly of og:Store, the og:Store configuration of the instance from .ontogen/config/store.bog.ttl is used, which implements a generic adapter. In this case, the URLs of the various endpoints of a triple store must be specified:

[ :this og:Store
    ; og:storeQueryEndpoint 
    ; og:storeUpdateEndpoint 
    ; og:storeGraphStoreEndpoint 
] .

In the configurations of the adapters, on the other hand, where the logic for generating the various URLs of the endpoints is implemented, only the corresponding components of these URLs need to be specified. Since the default values of these components for a standard installation of the corresponding triple store are also known in the adapter and are assumed if not present, an adapter configuration could theoretically be empty and still functional:

[ :this oga:Oxigraph ] .

However, during initialization, complete configurations with all properties for the components (using the defaults) are generated to make them explicit and easily adjustable if necessary:

[ :this oga:Oxigraph
    ; og:storeEndpointScheme "http"
    ; og:storeEndpointHost "localhost"
    ; og:storeEndpointPort 7878
] .

In some adapters, there are also components for which no sensible default exists and which therefore must be configured with a suitable value, such as in the case of Fuseki, the name of the dataset in which the repository should be stored (and which was previously created by the user in Fuseki):

[ :this oga:Fuseki
    ; og:storeEndpointScheme "http"
    ; og:storeEndpointHost "localhost"
    ; og:storeEndpointPort 3030
    ; og:storeEndpointDataset "name-of-the-dataset"
] .

At this point, operation via an adapter does not yet differ from that with an analogous configuration of the generic og:Store. However, the implementation of the adapters is structured in such a way that the HTTP requests can be flexibly wrapped and very easily provided with additional logic if needed, for example, to support triple store-specific optimizations or extensions.

In the future, further generic configuration options should be specifiable in the generic og:Store, for example, for the different SPARQL HTTP request forms, so that further adaptation and optimization options exist for triple stores without their own adapter implementation in Ontogen.

Furthermore, adapters for other popular triple stores will of course be offered in the future.

Conclusion

With this exploration of Ontogen’s configuration system and the introduction to Bog, the introductory series on Ontogen comes to an end. The aim of these articles has been to present the ideas behind Ontogen, from its fundamental concepts to its practical implementation and configuration.

Ontogen aims to address some of the challenges in managing and versioning RDF datasets. While it’s still in its early stages, it offers a new approach that combines established semantic web standards with novel ideas in versioning and configuration.

As an open-source project, Ontogen’s future development will greatly benefit from community feedback and contributions. The project can be found on GitHub, where stars, issues, and pull requests are always appreciated.

For those interested in following the project’s progress or reading future articles, updates are available via RSS, Mastodon, or LinkedIn.

Thank you for your interest in Ontogen. I look forward to seeing how it might be used and improved in the future.

Ontogen’s Repository and Service Model

2024-08-12T07:00:00+00:00

This is the third in a series of four blog posts introducing the different parts of the Ontogen version control system and the ideas behind it:

After examining the PROV-based og:SpeechAct and og:Commit model in the previous article, we will now focus on the structure of an Ontogen repository.

Isolated history graph

As we saw in the last article, Ontogen’s version history is based on so-called og:Propositions, which are implemented as RDF Triple Compounds. These propositions form the foundation for the Ontogen speech acts and commits, which represent the actual versioning information. A particular challenge in implementing a version control system for RDF is the question of how to store this versioning information in relation to the actual data.

A distinctive feature of storing versioning information in Ontogen is the strict separation between the actual data and the versioning artifacts. Similar to file-based version control systems like Git, where the version history is encapsulated in a hidden .git directory, Ontogen stores all versioning information in a separate graph, the so-called og:History graph.

This history graph stores the proposition compounds with the RDF-star statements and the assertions of all higher-level resources such as speech acts and commits. This approach ensures that the actual data of the RDF dataset remains completely free of version control artifacts.

The implementation is achieved through the use of RDF-star and the RTC vocabulary. We use the inverse rtc:elements property of the rtc:elementOf property and store the RDF-star assertions as “unasserted” in the history graph. This means that the actual assertions in the history graph are not restated, but only annotated.

Another advantage of this approach is that we save at least a little storage space: the statements are asserted only once in the graph with the version-controlled user data, while they appear in the history graph merely as quoted triples. This prevents additional duplication that would occur with “asserted” RDF-star annotations, where the statements would need to be stored again as regular RDF triples in the history graph. (Due to the identity properties of propositions, where identical statements coincide, at least some propositions can generally be saved in individual cases when identical statements are repeated in speech acts.)

It should be noted that the current version of Ontogen only supports versioning of individual RDF graphs. Managing changes across different graphs of an RDF dataset is planned for future versions.

Ontogen repositories as DCAT catalogs

Regardless of the versioning issue, a general question arises: how do we actually manage the datasets that are versioned but free from the versioning history? Of course, we don’t want to impose any obligations on the user here. Rather, it’s about how exactly we can describe the graphs of our RDF dataset and make their metadata accessible to consumers of our dataset. Moreover, if our dataset consists of a larger number of graphs (when we support this in the future), this can easily become unwieldy and makes further structuring, if not necessary, at least advisable.

According to the standard, RDF datasets are just a default graph and a set of named graphs, and beyond that, they make no further recommendations. There is the SPARQL Service Description Vocabulary that reflects the graph structure of a SPARQL store, but this is primarily designed for technical details of a SPARQL endpoint and offers little room for rich semantic descriptions of the data itself and nested or hierarchical relationships between datasets or graphs.

In fact, there is a general W3C standard for this problem, the Data Catalog (DCAT) Vocabulary, which is suitable for our purposes in many ways:

Flexibility and extensibility: DCAT provides a basic vocabulary that can be easily adapted and extended to specific needs.
Hierarchical structuring: DCAT allows the modeling of nested catalog structures, which is ideal for organizing complex RDF datasets.
Comprehensive metadata: DCAT offers a rich set of properties for describing datasets, including license information, access rights, and temporal aspects, some of which can even be provided automatically in the context of Ontogen, as these can be derived from the speech act and commit history (authors, creation period, data sources, etc.)
Support for versioning and integration with PROV: The soon-to-be-completed version 3 of DCAT introduces a comprehensive versioning concept that is particularly relevant for Ontogen’s use case. This extension also seamlessly integrates the PROV vocabulary, which forms the basis for Ontogen’s RDF speech act and commit history. This integration enables the direct derivation and modeling of provenance information and version metadata from the version history. For example, authors, creation dates, and other relevant metadata for specific revisions of a dataset can be automatically generated and presented in a standardized form. This close intertwining of versioning and provenance tracking makes DCAT 3 an ideal vocabulary for metadata description in Ontogen, as it can precisely map the complex temporal and authorial aspects of versioned RDF datasets.
Standardization and interoperability: As a W3C standard, DCAT enjoys wide acceptance and support in the data management community. This promotes Ontogen’s compatibility with other systems and tools in the field of data management. For example, DCAT has gained great importance in the European Union, where it serves as the basis for DCAT-AP (DCAT Application Profile for data portals in Europe) and is supported in prominent data catalog platforms. The use of DCAT in Ontogen thus allows seamless integration into existing data ecosystems and facilitates the exchange of metadata with a variety of platforms and services.

These properties make DCAT an ideal basis for modeling and describing Ontogen repositories, as it covers both the technical and organizational aspects of RDF data management.

So how do we organize our Ontogen repository with DCAT?

First of all, it should be noted that a dataset in the sense of DCAT is a more general, much broader class than an RDF dataset:

“dcat:Dataset := A collection of data, published or curated by a single agent, and available for access or download in one or more representations.

– https://www.w3.org/TR/vocab-dcat-3/#Class:Dataset

In Ontogen, however, we are dealing with RDF datasets as the subject of versioning, so with a specific subclass. Therefore, we define an og:Dataset as a subclass of dcat:Dataset. In fact, we can define it even more specifically as a subclass of dcat:Catalog, because according to the above broader definition, one can also consider a single graph of an RDF dataset as a dcat:Dataset and thus define it as a collection of these.

While og:Datasets are now DCAT catalogs of the pure user-defined graphs of an RDF dataset versioned with Ontogen, we define an og:Repository as a DCAT catalog around such an og:Dataset, supplementing it with two additional entries, so that an og:Repository as a DCAT catalog consists of exactly two explicit DCAT dataset entries and one implicit graph:

graph TD
    A[og:Repository] -->|og:repositoryDataset| B(og:Dataset)
    A -->|og:repositoryHistory| C(og:History)
    A -->|implicit dcat:dataset| D(Repository Metadata Graph)
    
    B -->|dcat:dataset| E[Graph 1]
    B -->|dcat:dataset| F[Graph 2]
    B -->|dcat:dataset| G[Graph n]
    
    C -->|contains| H[SpeechActs]
    C -->|contains| I[Commits]
    C -->|contains| J[PROV Entities]
    C -->|contains| K[PROV Agents]
    
    D -->|describes| A
    D -->|describes| B
    D -->|describes| C
    
    style A fill:#d1c2f0,stroke:#333,stroke-width:4px
    style B fill:#f0e6ff,stroke:#333,stroke-width:2px
    style C fill:#f0e6ff,stroke:#333,stroke-width:2px
    style D fill:#f0e6ff,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5

One entry for the og:Dataset DCAT catalog with the pure user-defined graphs (which is defined using the og:repositoryDataset property, a sub-property of dcat:dataset).
One entry for the og:History graph with the provenance history of the speech acts and commits as PROV activities, including the linked PROV entities and PROV agents (which is defined using the og:repositoryHistory property, a sub-property of dcat:dataset).
A repository graph that contains the DCAT metadata description of the og:Repository itself, including the DCAT metadata description of the og:Dataset catalog. This graph poses a particular challenge as it is both part of the repository and its description. This self-referencing leads to a conceptual ambiguity: on the one hand, the graph is a dcat:Dataset, on the other hand, it contains the description of the entire repository including itself. In the current DCAT specification, there is no clear solution for this problem of self-description as an explicit part of a dcat:Catalog. Therefore, in the current version of Ontogen, this graph is treated implicitly as part of the definition of an og:Repository, without an explicit dcat:dataset entry for it in the catalog. This solution is pragmatic, but ultimately not really satisfactory. Better suggestions for solving this problem would be very welcome.

Ontogen instances as DCAT services

Ontogen follows the DCAT model beyond the catalog structure in the implementation of an Ontogen instance. A locally running instance is implemented as a dcat:DataService.

dcat:DataService := “A collection of operations that provides access to one or more datasets or data processing functions.”

– https://www.w3.org/TR/vocab-dcat-3/#Class:Data_Service

An og:Service, which is defined as a subclass of dcat:DataService, is a resource that structurally consists of two elements:

graph TD
    A[og:Service] -->|og:serviceRepository| B(og:Repository)
    A -->|og:serviceStore| C(og:Store)
    
    B -->|og:repositoryDataset| D[og:Dataset]
    B -->|og:repositoryHistory| E[og:History]
    B -->|implicit| F[Repository Metadata Graph]
    
    C -->|rdf:type| G[Specific Triple Store Implementation]
    
    style A fill:#ccd1e0,stroke:#333,stroke-width:4px
    style B fill:#d1c2f0,stroke:#333,stroke-width:2px
    style C fill:#c2f0d1,stroke:#333,stroke-width:2px

The og:Repository linked via the dcat:servesDataset sub-property og:serviceRepository
An og:Store linked via the property og:serviceStore, which represents the locally running SPARQL triple store in which the repository is stored

While the same Ontogen repository can exist on different computers, the various Ontogen instances on these computers operate as different Ontogen services with different stores but the same repository.

The main module Ontogen is a GenServer over such an og:Service as state, which executes the Ontogen operations on the repository specified therein, in the triple store specified therein.

How exactly such an og:Service looks and is configured, especially its og:Store using triple store vendor-specific subclasses, will be the subject of the following and for the time last article in this series. This configuration is done in a special language specifically created for Ontogen, which needs to be introduced first.

Future developments

In upcoming versions of Ontogen, it is planned to expand the use of the DCAT integration. The interpretation of og:Dataset as dcat:Catalog is intended to form the basis for supporting versioning of datasets with multiple graphs, where DCAT’s capabilities for structuring complex datasets should be utilized.

The implementation of automatic generation of DCAT metadata from the PROV history is also envisioned, which will require developing a concept for dataset revisions in Ontogen based on the versioning concepts in DCAT 3.

These developments aim to enhance Ontogen’s compatibility with DCAT standards and provide more comprehensive dataset management features.

Ontogen’s Versioning Model

2024-08-08T09:00:00+00:00

This is the second in a series of four blog posts introducing the different parts of the Ontogen version control system and the ideas behind it:

In the previous article of our introduction series to Ontogen, we introduced RDF triple compounds (RTC). These triple compounds now serve as the foundation for Ontogen as a Data Control Management (DCM) system for RDF datasets. So, how exactly do the triple compounds used in Ontogen for versioning RDF datasets look?

Ultimately, our goal is to annotate sets of statements with metadata to automatically organize them in a version history. However, the challenge with triple compounds is that the changes we want to make and commit atomically are not necessarily a simple set with a single semantic meaning. Instead, they are sets with different change semantics that can potentially occur simultaneously in any combination: sets of statements to be added to a dataset, updated, or deleted.

Let’s consider the update of personal data after a marriage involving a name change. This complex but atomically related process requires different types of changes that should be encapsulated in a single, atomic entity. Imagine Sarah Miller gets married and takes her partner’s last name. Her new name is now Sarah Johnson. This scenario could involve various sets of statements with different change semantics. While we want to update the family name and marital status, we might simply want to add a new email address or wedding date in this context without overwriting old values.

Therefore, we need a potentially complex entity that encompasses these sets of changes.

RDF SpeechActs

In the Ontogen vocabulary (prefixed in the following with og:), these entities are called og:SpeechActs. We draw on the concept of speech acts from philosophy, which might seem unusual in this context at first glance. However, upon closer examination and when limited to RDF statements as the subject of speech acts, it perfectly models the situation in our versioning problem for RDF datasets.

A speech act, a term coined by J.L. Austin and further developed by John Searle, refers in linguistics and philosophy of language to an utterance that not only conveys information but also performs an action. Central to speech act theory is also the consideration of the context in which an utterance takes place. By including context, speech act theory extends the analysis of utterances beyond the purely semantic level to the pragmatic dimension.

To bridge the gap between the general concept of speech acts and their specific application in Ontogen, it’s important to understand how we’ve adapted this philosophical idea to our technical context. An og:SpeechAct represents a very specific form of speech act: the utterance of RDF statements or modification. This adaptation allows us to capture not just the content of RDF data, but also the act of asserting or changing that data, along with all the contextual information that surrounds that act.

In the context of Ontogen and RDF data, we apply this concept by treating each utterance of RDF statements as an og:SpeechAct - an action that does not represent the actual addition or modification of the dataset (we will continue to call this action a commit, following the usual versioning terminology). Instead, it represents the act of the original utterance of the statements in this dataset or subsequent acts that supplement, revise, confirm, etc. the original statements. This is because the central questions of provenance, i.e., the origin of the data, revolve around these acts. Some examples:

Who or what uttered, derived, or generated the data and when?
In what context was the data collected or modified?
With what intention or for what purpose was the data created or modified?
How was the data validated or verified?

With og:SpeechActs, our aim is to provide a model that allows us to capture information related to all possible questions about these utterances and record them as metadata.

Fortunately, we don’t need to develop a new ontology from scratch to model these metadata. There is already an excellent and standardized basis: the W3C PROV vocabulary. The W3C defines Provenance as follows:

“Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. The PROV Family of Documents defines a model, corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments […].

– https://www.w3.org/TR/prov-overview/

This definition fits perfectly with our approach of using speech acts as a basis for capturing provenance information in RDF datasets. In Ontogen, we therefore define our og:SpeechActs as a special manifestation, i.e., a subclass of prov:Activity.

“An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities.”

– https://www.w3.org/TR/prov-o/#Activity

This allows us to leverage the extensive semantics of the PROV vocabulary while modelling our specific concept of speech acts for RDF data.

Propositions

What characterizes og:SpeechActs as special prov:Activitys, i.e., how exactly are they defined? To answer this, we return to the issue mentioned at the beginning: how can we express potentially complex sets of statements with different “change semantics” or, as we can now more precisely say, with different pragmatics using triple compounds? The recently introduced og:SpeechActs now take on the role of the carrier resource with which the triple compounds are associated. An og:SpeechAct is thus a prov:Activity that can be associated with triple compounds with different pragmatics via four properties, the so-called action properties (which are all defined as sub-properties of prov:used).

og:add: A triple compound, i.e., a set of statements that is simply asserted without any further intentions. When persisted to a dataset in a triple store, these statements should be added.
og:update: A triple compound, i.e., a set of statements that is asserted with the intention to overwrite all previous statements with the same subject and predicate. When persisted to a dataset in a triple store, these statements should be added and the existing statements with the same subject and predicate should be overwritten.
og:replace: A triple compound, i.e., a set of statements that is asserted with the intention to overwrite all previous statements with the same subject. When persisted to a dataset in a triple store, these statements should be added and the existing statements with the same subject should be overwritten.
og:remove: A triple compound, i.e., a set of statements that should be retracted (“negated”). When persisted to a dataset in a triple store, it should remove these statements.

Let’s first look at how our initial example of a name change would look as an og:SpeechAct with normal triple compounds, if we initially use simple blank nodes as identifiers for the og:SpeechAct and their triple compounds:

@prefix :  .
@prefix og:  .
@prefix rtc:  .
@prefix xsd:  .

_:SarahMillerMarriageUpdate
    a og:SpeechAct ;
    og:update [
        a og:Proposition ;
        rtc:elements 
            << :employee39 :familyName "Johnson" >>,
            << :employee39 :maritalStatus :Married >>
    ] ;
    og:add [
        a og:Proposition ;
        rtc:elements 
            << :employee39 :emailAddress "sarah.johnson@example.com" >> ,
            << :employee39 :marriageDate "2023-07-25"^^xsd:date >>
    ] .

Regarding these action properties, it should be noted that it is the properties that define the pragmatics of the triple compound. The rdfs:range of these properties is always a triple compound set of the same type.

However, the action properties do not associate the og:SpeechAct with triple compounds in general, but with og:Propositions, a subclass of rtc:Compound, i.e., a special kind of triple compound that should exhibit some particular properties in this versioning context. We achieve this by using URI-encoded SHA256 hashes of the statement set for the URIs of the og:Propositions.

This approach allows us to achieve some important properties for our use case. On the one hand, we have a method to automatically generate the URIs of the og:Propositions. On the other hand, the authenticity of the statement set of the og:Proposition is verifiable, as we can detect whether the statement set is unchanged. Each og:Proposition can be verified by recalculating the hash of its canonicalized triple sets. This ensures that the integrity of the data is guaranteed throughout the entire version history. Changes or manipulations to the data would inevitably lead to a change in the hash and thus an inconsistent URI, which would be easily detectable.

Furthermore, the og:Propositions exhibit an interesting identity property: propositions with the same set of statements receive the same URI. This means that if the og:Proposition statement set appears in different og:SpeechActs, for example, because it was og:added once and og:removed once, the same statement set should not be duplicated a third, fourth, or fifth time, etc., in the rtc:elements of different og:Proposition compounds, but should reference the same og:Proposition.

Thus, og:Propositions represent immutable, abstract sets of statements that can be used in different, independent contexts and always have the same URI. However, this is only true to a limited extent without further measures, as there is still one problem: if two og:Proposition statement sets contain blank nodes and differ only in the local names used for the blank nodes, but are otherwise isomorphic, they are actually the same abstract statement set, which should therefore lead to the same URI. This can be achieved by bringing the RDF dataset into a canonical form before hashing, using the W3C-standardized RDF Dataset Canonicalization Algorithm, in which isomorphic graphs with different blank nodes receive the same blank node identifiers.

So, og:Propositions are an abstraction over concrete statement sets: the same statement set in different triple stores, with potentially different blank nodes, are all identified by the same URI of a resource. We identify the resulting equivalence set with a unique hash, so to speak. These og:Proposition compounds are thus, like the propositions of logic, abstract entities that are not bound to any utterance. Only through the utterance within an og:SpeechAct do they become time- and context-bound. As abstract entities, unlike RTC compounds in general, the og:Proposition compounds usually do not contain metadata, as the interesting metadata is utterance-related and therefore belongs to the og:SpeechAct.

However, we are still missing the URI of the og:SpeechAct itself. In Ontogen, this is

the URI-encoded SHA256 hash of all og:Propositions linked through its action properties (i.e., their SHA256 URIs),
the prov:endedAtTime timestamp,
and the og:speaker (a subproperty of prov:wasAssociatedWith) of the og:SpeechAct, or the og:dataSource (a subproperty of prov:used) if the speaker is unknown.

Now we can provide the actual Ontogen form of our above example og:SpeechAct of a name change by adding the previously missing SHA256 URIs of the og:Propositions and the og:SpeechAct:

@prefix :  .
@prefix og:  .
@prefix rtc:  .
@prefix xsd:  .
@prefix prov:  .
@prefix dc:  .


    a og:SpeechAct ;
    dc:description "Update of Sarah Miller's personal information following her marriage" ;
    og:add  ;
    og:update  ;
    og:speaker :JaneSmith ;
    prov:startedAtTime "2023-07-26T09:30:00Z"^^xsd:dateTime ;
    prov:endedAtTime "2023-07-26T09:31:15Z"^^xsd:dateTime ;
    prov:wasInformedBy :MarriageCertificateSubmission ;
    prov:used :MarriageCertificate20230725  .


    rtc:elements 
        << :employee39 :emailAddress "sarah.johnson@example.com" >>, 
        << :employee39 :marriageDate "2023-07-25"^^xsd:date >> .


    rtc:elements 
        << :employee39 :familyName "Johnson" >>, 
        << :employee39 :maritalStatus :Married >> .

:JaneSmith a prov:Person ;
    foaf:name "Jane Smith" ;
    foaf:mbox  .

:MarriageCertificate20230725 a prov:Entity ;
    dc:title "Marriage Certificate for Sarah Miller and Michael Johnson" ;
    prov:generatedAtTime "2023-07-25"^^xsd:date .

:MarriageCertificateSubmission a prov:Activity ;
    prov:wasAssociatedWith :employee39 ;
    prov:generated :MarriageCertificate2023-07-25 ;
    prov:endedAtTime "2023-07-26T09:00:00Z"^^xsd:dateTime .

Commits

In Ontogen, commits represent the actual changes made to a repository, resulting from applying an og:SpeechAct to this repository with its existing data. Like an og:SpeechAct, og:Commits are a prov:Activity. However, they represent the act of adding or modifying data in a dataset within a triple store in a specific state, rather than the act of uttering these statements, which may have occurred at a much earlier time and by a different speaker.

Like an og:SpeechAct, an og:Commit is a structure composed of og:Propositions linked through various action properties. However, since they have a slightly different pragmatics here and a different rdfs:domain, different properties and an additional one are used for this purpose. The semantics of this set of properties is characterized by encoding repository-relative changes, i.e., expressing the minimal changes relative to the current state of the dataset. This is crucial to ensure that og:Commits are revertible, as otherwise ambiguities in the history would arise:

Can we simply remove every statement added by a commit from the triple store during a revert? If we cannot assume minimal changesets, it cannot be automatically decided whether this deletion can be performed. If the statement didn’t already exist, it must be removed to reproduce the old state. If a statement already existed before, it must not be removed.
The same applies to removals of statements that do not actually exist in the current dataset and therefore should not be restored during a revert.
Additionally, the specific statements implicitly deleted by og:updates and og:replace must be explicitly recorded in an additional proposition so that they can be restored during a revert.

To ensure the reversibility of commits, the internally so-called “effective changeset” must be determined. From this, corresponding propositions are then generated and linked to the og:Commit via the following action properties, along with the og:SpeechAct that this commit reproduces on the repository:

og:committedAdd, og:committedRemove, og:committedUpdate, og:committedReplace: These action properties represent the minimal sets of statements as propositions necessary to change the state of the repository to match the corresponding og:Propositions of the og:SpeechAct. They contain only the actually required changes.
og:committedOverwrite: This property contains a set of statements as a proposition that represents the statements to be implicitly deleted due to updates or replacements.

If none of the changes conflict with existing triples (or in the case of deletions, with non-existing triples) in the triple store, the og:Propositions of the og:Commit are the same as the og:Propositions in the og:SpeechAct and do not require additional space for dedicated modified og:Propositions. Theoretically, we could rely on two simple sets for additions and deletions to determine the statement sets of the effective changes. However, to increase the chance of reusability of og:Propositions, we continue to use the same actions for commits, so that only in case of overlap with existing data, a separate, dedicated og:Proposition for the commit must exist. (To further optimize this case, support for og:SpeechAct-relative commits is planned for the future, where only og:Propositions of the overlap are necessary, which is much more efficient in many cases.)

Beyond this composition of og:Propositions, a commit is, of course, like in other version control systems, a sequential, uni-directionally linked list of commits to the respective predecessor commits. The predecessor commit is defined in Ontogen via the og:parentCommit property.

The automatically generatable identifiers of an og:Commit are, like those of og:SpeechActs and og:Propositions, again URI-encoded SHA256 hashes, in this case of:

the hash URI of the parent commit
the hash URIs of the propositions of the commit
the URI of the committer
the timestamp of the commit
and the commit message

Finally, let’s look at an og:Commit for our above example og:SpeechAct of the name change. Let’s assume that Sarah was previously married once, but the divorce was not recorded in our data, so the :maritalStatus is already set to :Marriedand the corresponding change effectively does not need to be made.

@prefix :  .
@prefix og:  .
@prefix rtc:  .
@prefix xsd:  .
@prefix prov:  .
@prefix dc:  .


    a og:Commit ;
    og:add  ;
    og:update  ;
    og:committer :JohnDoe ;
    prov:endedAtTime "2023-07-27T09:31:15Z"^^xsd:dateTime ;
    og:commitMessage "Update of Sarah Miller's personal information following her marriage" .


    rtc:elements 
        << :employee39 :familyName "Johnson" >> .



    a og:SpeechAct ;
    dc:description "Update of Sarah Miller's personal information following her marriage" ;
    og:add  ;
    og:update  ;
    og:speaker :JaneSmith ;
    prov:startedAtTime "2023-07-26T09:30:00Z"^^xsd:dateTime ;
    prov:endedAtTime "2023-07-26T09:31:15Z"^^xsd:dateTime ;
    prov:wasInformedBy :MarriageCertificateSubmission ;
    prov:used :MarriageCertificate20230725  .


    rtc:elements 
        << :employee39 :emailAddress "sarah.johnson@example.com" >>, 
        << :employee39 :marriageDate "2023-07-25"^^xsd:date >> .

  
    rtc:elements 
        << :employee39 :familyName "Johnson" >>, 
        << :employee39 :maritalStatus :Married >> .

:JaneSmith a prov:Person ;
    foaf:name "Jane Smith" ;
    foaf:mbox  .

:MarriageCertificate20230725 a prov:Entity ;
    dc:title "Marriage Certificate for Sarah Miller and Michael Johnson" ;
    prov:generatedAtTime "2023-07-25"^^xsd:date .

:MarriageCertificateSubmission a prov:Activity ;
    prov:wasAssociatedWith :employee39 ;
    prov:generated :MarriageCertificate2023-07-25 ;
    prov:endedAtTime "2023-07-26T09:00:00Z"^^xsd:dateTime .

An interesting possibility, which is not yet used in these early versions of Ontogen but is planned for future versions, should be mentioned in conclusion. The speech act-based commit model outlined here allows for some useful validity checks of changes that are not possible in conventional version control systems. For example, a deletion expressed at an earlier point in time can be detected as actually obsolete because a later expressed insert of the same statement was committed earlier, or analogously, an insert expressed at an earlier point in time is actually obsolete because a later expressed deletion of the same statement was committed earlier or was not effectively committed there because the statement did not exist yet. Let’s imagine, for example, that we had imported a dataset X in the very latest version and would later import a more comprehensive dataset Y that includes X, but in an older version. We can, for example, recognize from the more recent deletion of a statement from the already imported dataset that an addition in the now imported older version is already obsolete. Similarly, conflicts can be detected here that arise, for example, when updating a dataset on whose old version we have made some data cleansing.

Summary and Outlook

In this article, we have delved into the core of Ontogen’s versioning model, exploring how it leverages RDF Triple Compounds (RTC) to create a robust and flexible system for managing changes in RDF datasets. We introduced the concept of og:SpeechActs, a novel approach that applies the philosophical notion of speech acts to the domain of RDF data versioning. These og:SpeechActs, implemented as subclasses of prov:Activity, allow us to capture not just the content of changes, but also the context and intention behind them.

We then examined og:Propositions, which serve as immutable, abstract sets of statements within our model. By using SHA256 hashes for their URIs, we ensure the integrity and verifiability of our data throughout the versioning process. This approach also allows for efficient storage and referencing of repeated statement sets across different og:SpeechActs.

Finally, we discussed how Ontogen implements commits through og:Commits, which represent the actual application of og:SpeechActs to the repository. We explored the challenges of ensuring reversibility in commits and how Ontogen addresses these through careful management of change sets.

In the next article, we will expand our focus from the internal versioning mechanisms to the broader architecture of Ontogen. We’ll explore how Ontogen organizes and manages repositories as DCAT catalogs and implements Ontogen instances as DCAT services.

Introducing Ontogen

2024-08-08T08:00:00+00:00

After a year of intensive development, I am pleased to introduce Ontogen - a version control system for RDF datasets. My sincere thanks go to the NLnet Foundation for their support through the NGI Assure fund, which enabled me to dedicate myself full-time to this extensive project.

It’s important to emphasize that Ontogen is still in an early stage of development. Although the system is equipped with a comprehensive test suite and is regularly tested with two different triple stores (Fuseki and Oxigraph), it lacks extensive real-world testing. Therefore, I cannot yet recommend its productive use for critical data. In particular, some important extensions are still pending:

Full RDF dataset support (currently only versioning of individual graphs is supported)
Branching support

These and other extensions will require fundamental changes that will likely invalidate existing version histories.

In the coming weeks, I plan to publish the project’s future roadmap. An application for another round of funding from the NLnet Foundation for at least one year is currently in progress.

Ontogen offers some novel approaches to versioning RDF data (at least to my knowledge). To adequately explain these complex concepts, I have published a series of four blog posts, to introduce the different parts of the system and the ideas behind them:

In this first post, I’d like to introduce the technical foundations of Ontogen’s approach to versioning RDF datasets.

Source Control Management vs. Data Control Management

First, however, I’d like to take a step back and discuss the versioning problem with a particular focus on datasets. This distinct perspective is, to my knowledge, rarely taken, and in practice, software versioning solutions, i.e., SCMs like Git, are too often used for versioning datasets.

Datasets, however, are a different type of versioning subject compared to software. While both datasets and software may ultimately always be text, no one would dispute the claim that data is not the same as software. But why then is there no popular versioning solution for datasets of structured data, especially in the age of “data as the new oil,” where data management and analysis play central roles in almost all industries?

Some examples illustrate the inadequacies of an SCM system as a Data Control Management (DCM) system:

Roles: In an SCM, the committer is the crucial role. While SCMs recognize the difference between author and committer, in practice, this is usually of little importance. For a dataset, however, authorship, i.e., the exact source of datasets, is of greater importance, and many other roles are relevant and should be differentiable, such as data processors (people or systems that transform, clean, or enrich raw data), data curators (experts who organize, categorize, and enrich data with metadata), data protection officers, etc.
Lack of metadata: Datasets often require extensive metadata (e.g., origin, license, timestamps) that are not natively supported in SCMs.
Granularity of changes: SCMs often work at the file level, while for datasets, individual records or fields may be relevant.
Database integration: DCMs should ideally be able to interact directly with database systems, which is not provided for in SCMs.

Although increasing attention has been focused on the dataset versioning problem in recent years, it must be noted that no mainstream solution has yet emerged. Instead, SCMs are still too often resorted to for dataset versioning (if they are versioned at all).

The situation is particularly precarious in the Knowledge Graph community. Here, there is a lack of mature, specialized solutions for versioning.

Problems with Previous Versioning Systems for RDF

Attempts to develop solutions for versioning RDF data have existed for a long time. Unfortunately, all solutions developed so far are either academic proof-of-concepts or approaches that have not found broader acceptance in the community for various reasons. To better understand why, I want to briefly survey the most significant approaches and the common limitations they share.

Named-graph-based approaches

The majority of previous RDF versioning systems relied on named graphs as their primary mechanism for organizing versioned data. R43ples (Graube et al., 2014) acts as a SPARQL proxy that stores addition and deletion sets as separate named graphs for each revision. Quit Store (Arndt et al., 2018) maps each named graph to a file in a Git repository, delegating the actual versioning mechanics (history, branching, merging) to Git. R&Wbase (Vander Sande et al., 2013) stores deltas as quads, again using named graphs to organize different changesets. Stardog offered a VCS feature using named graphs for its internal history database, but removed it entirely in version 7.0.

The main problem all of these approaches share is that parts of a graph simply cannot be addressed directly within the RDF model. This forces the creation of a separate named graph for every small group of triples that needs to be versioned or annotated with change metadata, which can quickly lead to a flood of graphs and thus an unwieldy RDF dataset. This becomes particularly problematic when named graphs are also used for content purposes, which then become difficult to distinguish among this flood. Even Quit Store, which elegantly avoids named graphs as the versioning mechanism by delegating to Git, still depends on them as the unit of granularity - one cannot version parts of a graph independently without splitting them into separate named graphs first.

Alternative approaches

Other approaches tried to avoid the named graph problem through different mechanisms, but each came with its own significant trade-offs.

Early systems like SemVersion (Völkel et al., 2005) and the ChangeSet vocabulary (Tunnicliffe & Davis, 2005) used standard RDF reification to make individual triples addressable for change tracking. This solves the addressability problem but at severe cost: each described triple requires at least four additional triples, causing significant storage overhead and complex SPARQL queries with multiple joins per matched statement.

Delta and patch-based approaches like RDF Patch (part of Apache Jena) provide compact change formats, but focus on change logging and replication rather than version querying - there is no built-in way to query the state at version N or compute diffs between arbitrary versions.

Archiving-focused systems like OSTRICH (Taelman et al., 2018) use sophisticated hybrid storage strategies for efficient versioned queries over large RDF archives. However, their linear storage model is optimized for archival query scenarios rather than the collaborative workflows (branching, merging, provenance tracking) that a Data Control Management system requires.

At the HTTP/resource level, the Memento protocol (RFC 7089) and TailR (Meinhardt et al., 2015) provide temporal access to Linked Data resources, but operate at the document level with no triple-level awareness.

RDF-star as a new foundation

With RDF-star, which is currently being standardized as RDF 1.2, a new tool is now available that addresses the fundamental addressability problem without the overhead of reification or the constraints of named graphs. RDF-star is an extension of the RDF data model that allows direct annotation of RDF triples by allowing triples to be used as subjects or objects in other triples. This simplifies the representation of metadata about statements and enables a more natural modeling of complex relationships without having to resort to reification or named graphs.

The possibility of making RDF-star meta-statements with other statements as the subject opens up significant new possibilities. In particular, it is now trivial to define virtual, URI-identifiable sets of statements, i.e., partial graphs within a graph, by assigning the statement to a common resource using a property. This makes RDF-star an ideal foundation for versioning RDF graphs.

RTC as a Foundation for RDF Data Versioning

To provide a well-known property with this assignment semantics for partial graphs and thus create a foundation for tools that exploit this semantics, I published the RDF Triple Compounds (RTC) vocabulary last year. It consists of only one RDFS class rtc:Compound and a few properties, including in particular the rtc:elementOf with the aforementioned semantics and its inverse rtc:elements.

rtc:Compound 
	a rdfs:Class, owl:Class ;
    rdfs:label "Compound" ;
    rdfs:comment "A compound is a set of triples as an RDF resource." .  
  
rtc:elementOf 
	a rdf:Property, owl:ObjectProperty ;
    rdfs:label "element of" ;
    rdfs:comment "Assigns a triple to a compound as an element. The subject must be a RDF triple." ;
    rdfs:range rtc:Compound ;
    owl:inverseOf rtc:elements .
  
rtc:elements 
	a rdf:Property ;
    rdfs:label "elements" ;
    rdfs:comment "The set of all triples of a compound. The objects must be RDF triples." ;
    rdfs:domain rtc:Compound ,
    owl:inverseOf rtc:elementOf .

A triple compound of this vocabulary is thus a set of triples assigned to a common resource. The triples are assigned to a compound with an RDF-star statement using the rtc:elementOf property.

PREFIX : 
PREFIX rtc: 
 
:employee38 
    :firstName "John" {| rtc:elementOf :compound1 |} ;
    :familyName "Smith" {| rtc:elementOf :compound1 |} ;
    :jobTitle "Assistant Designer" {| rtc:elementOf :compound1 |} .

:compound1 a rtc:Compound ;
    :statedBy :bob ; 
    :statedAt "2022-02-16" .

Alternatively, the rtc:elements property can be used as the inverse of rtc:elementOf.

PREFIX : 
PREFIX rtc: 
 
:employee38 
    :firstName "John" ;
    :familyName "Smith" ;
    :jobTitle "Assistant Designer" .

:compound1 a rtc:Compound ;
    :statedBy :bob ; 
    :statedAt "2022-02-16" ;
    rtc:elements        
       << :employee38 :firstName "John" >> ,
       << :employee38 :familyName "Smith" >> ,
       << :employee38 :jobTitle "Assistant Designer" >> .

Simultaneously with the vocabulary, an Elixir implementation RTC.ex was also provided, which, based on this vocabulary, provides a structure that allows working with these virtual graphs in a way that is largely API-compatible with real graphs, as implemented in the RDF.Graph structure of RDF.ex.

# create a new compound with a couple triples
virtual_graph =  
  [  
    {EX.Employee38, EX.firstName(), "John"},  
    {EX.Employee38, EX.familyName(), "Smith"},  
    {EX.Employee38, EX.jobTitle(), "Assistant Designer"},  
  ] 
  |> RTC.Compound.new(EX.Compound, prefixes: [ex: EX])  

# add some triples to the compound
virtual_graph =  
  RTC.Compound.add(virtual_graph,   
    EX.Employee39  
    |> EX.firstName("Jane")  
    |> EX.familyName("Doe")  
    |> EX.jobTitle("HR Manager")  
  )  

# add some statements about the compound itself
virtual_graph =  
  RTC.Compound.add_annotations(virtual_graph,  
    %{EX.dataSource() => EX.DataSource}  
  )

With the RTC.Compound.graph(virtual_graph) function, the pure set of statements can be produced as an RDF.Graph at any time, which in this case generates this graph:

@prefix ex:  .

ex:Employee38
    ex:familyName "Smith" ;
    ex:firstName "John" ;
    ex:jobTitle "Assistant Designer" .

ex:Employee39
    ex:familyName "Doe" ;
    ex:firstName "Jane" ;
    ex:jobTitle "HR Manager" .

Whereas RTC.Compound.to_rdf(virtual_graph) provides the complete RDF-star graph with the RTC annotations for the compounds.

@prefix ex:  .
@prefix rtc:  .

ex:Compound
    ex:dataSource ex:DataSource .

ex:Employee38
    ex:familyName "Smith" {| rtc:elementOf ex:Compound |} ;
    ex:firstName "John" {| rtc:elementOf ex:Compound |} ;
    ex:jobTitle "Assistant Designer" {| rtc:elementOf ex:Compound |} .

ex:Employee39
    ex:familyName "Doe" {| rtc:elementOf ex:Compound |} ;
    ex:firstName "Jane" {| rtc:elementOf ex:Compound |} ;
    ex:jobTitle "HR Manager" {| rtc:elementOf ex:Compound |} .

The RTC property to be used can be configured either globally via application configuration or individually as an option of the RTC.Compound.to_rdf/2 function.

Summary and Outlook

In this article, we have laid the foundations for Ontogen’s approach to versioning RDF datasets. We have discussed the inadequacies of existing solutions and introduced RDF Triple Compounds (RTC) as the technical basis of Ontogen. RTC utilizes the capabilities of RDF-star to enable flexible and efficient groupings of RDF triples without having to accept the disadvantages of named graphs.

In the next article, we will take a closer look at the application of RTC in Ontogen. We will show how RTC compounds serve as building blocks for a versioning system that takes into account the various roles and aspects of data management. We will explain how Ontogen uses RTC to enable fine-grained version control while maintaining the clarity of the dataset.

Updated February 2026: Added a survey of specific previous RDF versioning approaches in the “Problems with Previous Versioning Systems for RDF” section.