What’s in a Pod? A knowledge graph interpretation for the Solid ecosystem

Ruben Dedecker, Wout Slabbinck, Jesse Wright, Patrick Hochstenbach, Pieter Colpaert, Ruben Verborgh: "What’s in a Pod? A knowledge graph interpretation for the Solid ecosystem", Proceedings of the QuWeDa 2022 : 6th Workshop on Storing, Querying and Benchmarking Knowledge Graphs co-located with 21st International Semantic Web Conference (ISWC 2022) (2022).

Biblio entry: 01GK3W89A9Y0B4B3Y73QGM5P7R.

Abstract

The Solid vision aims to make data independent of applications through technical speciﬁca‐ tions, which detail how to publish and consume permissioned data across multiple autono‐ mous locations called “pods”. The current document-centric interpretation of Solid, wherein a pod is a single hierarchy of Linked Data documents, cannot fully realize this independence. Applications are left to deﬁne their own APIs within the Solid Protocol, leading to fundamen‐ tal interoperability problems and the need for associated workarounds. The long-term vision for Solid is confounded with the concrete HTTP interface to pods today, leading to a narrower solution space to address core issues. We examine the mismatch between the vision and its prevalent document-centric interpretation, and propose a reconciliatory graph-centric inter‐ pretation wherein a pod is a hybrid, contextualized knowledge graph. In

this article, we con‐ trast the existing and proposed interpretations in terms of how they support the Solid vision. We argue that the graph-centric interpretation can improve pod access through diﬀerent Web APIs that act as views into the knowledge graph. We show how the latter interpretation pro‐ vides improved opportunities for storage, publication, and querying of decentralized data in more ﬂexible and sustainable ways. These insights are crucial to reduce the dependency of Solid apps on implicit API semantics and local assumptions about the shape and organization of data and the resulting performance. The suggested broader interpretation can guide Solid through its evolution into a heterogeneous yet interoperable ecosystem that better supports the diverging read/write data access patterns of diﬀerent use cases. 1. The Solid Vision Of Data Interoperability And Control Data privacy

and control have lost ground on today’s Web. User-generated data is stored in central‐ ized data silos, in which people have neither the control nor the knowledge to manage how their data is being used [1]. As a response, the Solid project [2, 3] was created with the aim of revitalizing the Web. Where the current system of centralized data silos create an ecosystem of limited integration, availabil‐ ity and innovation, Solid brings a course correction for the Web. Based on the separation of data and applications, the vision deﬁnes an ecosystem that facilitates the integration of data in diﬀerent applica‐ tions, while keeping people in direct control of their data. To this end, Solid introduces the concept of a pod as an online data space for an individual to control and manage their data on the Web. Together, these pods form a decentralized Solid ecosystem, from which applications can

directly integrate data from people’s Solid pods, after receiving their permission. This contrasts with current Web applications, where this data ﬁrst had to be collected in a centralized location, after which the platform-speciﬁc API had to be integrated, where all the while user control is at the mercy of the platforms maintaining the data. In order to separate data from applications, the semantic contents of the data must be captured such that they can be accurately interpreted and reused in diﬀerent contexts. Semantics allow applications to interpret data without requiring speciﬁc knowledge encoded in the API over which the data is re‐ © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). trieved. A key driver is the use of the Resource Description Framework (RDF) [4], which provides an in‐ frastructure

for capturing this semantic information. This again contrasts with current Web APIs, where data is served in formats that require additional semantics to be described in the API’s documentation. By shifting the focus from the API to the data, the Solid ecosystem aims to transition the Web from an ecosystem of API integration towards an ecosystem of data integration [5]. Unfortunately, we observe a signiﬁcant gap between theory and practice: current Solid apps do not succeed in API-independent reuse of data across use cases. Rather than only relying on the data and its semantics, apps resort to implicit knowledge about how this data is structured across documents in a pod’s Web API. Furthermore, diﬀerent use cases impose conﬂicting requirements on that structure in order to satisfy their own constraints. As such, app developers struggle to make sustainable decisions on how to structure

data for reuse, since their individual choices impact the interoperability of the en‐ tire ecosystem. In this article, we identify the root cause of this interoperability problem as the mismatch between Solid’s current single hierarchical API and the modeling requirements of real-world use cases. We de‐ scribe the properties of this document-centric interpretation of a pod, and introduce a graph-centric in‐ terpretation that can bridge diﬀerences between use cases. We compare both interpretations, explain the consequences for concrete Solid implementations, and argue why we consider the graph-centric in‐ terpretation a more sustainable candidate to realize the Solid vision of data and application indepen‐ dence under control of the user. 2. Motivating Use Cases In order to pinpoint the concrete diﬀerences between interpretations, we introduce two small use cases that we will carry

throughout the paper. 2.1. Contacts Use Case The contacts use case is a rather trivial example, but we introduce it to evidence that even very sim‐ ple use cases can expose issues in the interpretation of a pod. The implication is thus that, if a certain interpretation cannot adequately handle the contacts use case, then it is likely to break more complex cases as well. The case is as follows: !"The data consists of a set of contacts, each of which have associated attributes such as name, ad‐ dress, email, phone number, date of birth. !"An address book app provides read and write access to each attribute of a contact, and allows creat‐ ing new contacts. !"A birthday app shows daily reminders of contacts with upcoming birthdays, and allows editing birthdays and adding new ones. 2.2. Medical Use Case The medical use case is conceptually simple, but it involves highly sensitive data. Its

purpose is to demonstrate that issues identiﬁed in the contacts use case easily generalize to more core complex data and real-world problems. In this use case, the user is a patient storing the following data: !"A set of medical records reﬂecting blood test results, with records containing various vitamin levels as well as HIV status results. !"A set of heart rate and blood pressure measurements, captured by the user’s wearable device. 3. Preliminary Definitions Before describing the interpretations of a Solid pod, we start with a couple of deﬁnitions that we will use as building blocks throughout the article. !"We consider a protocol to be a generic set of rules for data transmission between systems. #"The HyperText Transfer Protocol (HTTP) [6] structures the exchange of data between a server and a client as resources identiﬁed by a URI. #"The Linked Data Platform (LDP) [7] constrains

HTTP with interaction rules for recursive con‐ tainers of RDF and non-RDF documents. #"The Solid Protocol [8] constrains HTTP with authentication and authorization, and with interac‐ tion rules for recursive containers of RDF and non-RDF documents (inspired by LDP). !"A Web API is a speciﬁc structuring of resources on top of HTTP (or a specialization thereof, such as the Solid Protocol). !"Authentication means identifying the agent issuing a request to a Web API. #"The WebID [9] is an HTTP URL that identiﬁes an agent. When dereferenced, it leads to a profile document describing various agent details. #"Solid-OIDC [9] establishes some authoritative identiﬁcation of an agent by a speciﬁc WebID. !"Authorization means determining to what extent a server can respond to a certain Web API request from a speciﬁc agent. #"Web Access Control (WAC) [10] is an Access Control List (ACL) mechanism

that allows assign‐ ing inheritable permissions to documents and containers through so-called ACL documents. #"Access Control Policies (ACP) [11] is a policy-based mechanism that allows assigning inherita‐ ble permissions to documents and containers through so-called Access Control Resources (ACR). Let us exemplify some of these deﬁnitions through our use cases: !"An HTTP interface at https://sasha.pod/ implements the Solid Protocol when its containers and documents follow the interaction rules, and when it correctly authenticates users using their WebID and applies authorization to each resource. !"A Web API within https://sasha.pod/ structures documents in containers. #"Contacts are stored in https://sasha.pod/people/ as individual documents: $"https://sasha.pod/people/sasha.ttl $"https://sasha.pod/people/lucian.ttl #"Medical records are stored in https://sasha.pod/private/acme-

hospital/ by date, such as: $"https://sasha.pod/private/acme-hospital/2022/10/15/test-results.ttl !"The WebID https://sasha.pod/people/sasha#me identiﬁes a person named Sasha. !"The agent identiﬁed by https://sasha.pod/people/sasha#me is allowed to access all documents on https://sasha.pod/. 4. Document-centric Interpretation Of A Pod This section discusses the currently prevalent interpretation of a pod, which is document-centric. 4.1. Definition As described in Section 3, the Solid Protocol models interactions with data as recursive containers with RDF and non-RDF documents. When a server oﬀers this protocol, clients of this server can itera‐ tively deﬁne a Web API by creating containers and documents. The document-centric interpretation assumes that the structure and contents of the Web API, which a pod exposes through the Solid Protocol, is that pod in its entirety. Within this

interpretation, the com‐ plete state of the pod is thus equivalent to the single Web API through which it is available; the source of truth is solely that speciﬁc Web API. That brings us to the following deﬁnition: In the document-centric interpretation, each Solid pod is a single speciﬁc hierarchical structure of containers and documents exposed through the Solid Protocol, where data and access control rules are stored in speciﬁc RDF and non-RDF documents within that hierarchy. 4.2. Example For example, a Solid pod would be fully deﬁned by the following hierarchy and the contents of its documents: !"container https://sasha.pod/ #"RDF document .acl (for access control) #"container people/ $"RDF document .acl (for access control) $"RDF document amal.ttl $"RDF document lucian.ttl $"… #"container private/ $"RDF document .acl (for access control) $"container medical-records/ $"non-RDF

document 2022-09-15.pdf $"non-RDF document 2022-10-15.pdf $"… #"… In the above example, the access control document https://sasha.pod/private/.acl could contain WAC rules such that only the agent https://sasha.pod/people/sasha#me is allowed to access the con‐ tainer https://sasha.pod/private/ and below. 4.3. Practical Usage The above document-centric deﬁnition of a pod leaves several degrees of freedom as to how the pod is structured and how the resulting structure is interpreted. We now describe how today’s Solid apps handle those degrees of freedom in practice. 4.3.1. Structure Of The Main Web API Importantly, the current Solid technical reports [12] do not impose any speciﬁc container structure onto a pod beyond the presence of a root container /. Therefore, the Solid Web API does not exist; only the Solid Protocol to create an API for each pod. Some past suggestions are nonetheless

present in cer‐ tain server implementations as defaults (such as /profile/, /inbox/, and /settings/ containers). Since these are not standardized across the ecosystem, their presence is not server-enforced, and as such can‐ not be relied upon. As a consequence, client-side applications have to invent their own (sub-)API within the pod’s URL space available through the Solid Protocol, by deﬁning a certain container structure and data distribu‐ tion across documents within this structure. We observe two kinds of behavior: !"Some apps use hard-coded paths to certain containers (e.g., /contacts/) or documents (e.g., /profile/card). !"Some apps use link traversal, which means they ﬁnd the URLs of documents and containers by fol‐ lowing links from the user’s WebID proﬁle document and/or via another index [13]. We also observe hybrid behavior, for instance where an initial path is obtained via

traversal (e.g., /private/medical/), but deeper relative paths are hardcoded (e.g., /private/medical/2022/10/). In particular, link traversal is bootstrapped via hardcoded paths: if no link exists to the certain kind of data, then a speciﬁc document is created at a hardcoded path and then linked from a proﬁle or index for future usage. 4.3.2. Aspects Of RDF Document Boundaries From the way current apps organize data in RDF documents, we can observe the meaning they as‐ cribe to such a document. Noting that Solid typically uses RDF 1.0 documents (so only triples, and not quads as in RDF 1.1), the occurrence of an RDF triple in a document seems to carry various degree of meaning with regard to the following aspects: !"(implicit) Context: the occurrence of certain triples within the same document often implies that they are somehow interrelated, and that these triples somehow relate to the

document. This topical relation is sometimes visible within certain triples, whose subject (e.g., https://sasha.pod/people /sasha#me) deﬁnes a URL fragment (e.g., #me) on the document identiﬁer (e.g., https:// sasha.pod/people/sasha). !"(explicit) Policy: both the WAC [10] and the ACP [11] speciﬁcations assign authorizations on a doc‐ ument level of granularity. Either the document can be accessed by a given agent in its entirety or not at all, thus resulting in all triples within a document sharing the same authorization rules. !"(implicit) Provenance: the document somehow captures the notion that its triples originate from a speciﬁc source or event, of which the document was a result. !"(implicit) Trust: the document deﬁnes a single boundary of trust for all of its triples. For example, a user’s proﬁle document is typically fully trusted by the user (because they are usually the only

party with write access to it), whereas inbox documents created by third parties might contain triples that are not trusted. !"(implicit) Performance: the document groups together a number of triples because it improves the performance of certain use cases. For instance, triples that are often used together might be in the same document, and triples that are less needed might be in extension documents, in order to opti‐ mize the number of HTTP requests and the used bandwidth. We remark that of these 5 aspects, only the policy on the document is modelled explicitly. The con‐ text is implicitly assumed because triples occurring in the same place were usually created by the same or related write operations, and because those triples are necessarily read together by apps. The prove‐ nance and trust are similarly derived from implicit assumptions about a shared origin, and the knowl‐ edge of

speciﬁc policies and thus agents that could have written to the document. Notably, the fact that an identiﬁer (e.g., https://sasha.pod/people/) is contained within a certain pod root container (e.g., https://sasha.pod/) does encode some explicit provenance about the document and its triples (e.g., “they were found in Sasha’s pod”), but not necessarily about its creator or level of trust (e.g., multiple actors might have write access to the document). Finally, the performance is typically based on edu‐ cated guesses, but seldom the result of actual performance measurements. 4.3.3. Alternative Web APIs To The Pod Many applications encounter practical limitations when the data they require happens to be structured across multiple documents in the main API. In an attempt to address such cases, alternative Web APIs were proposed. One proposal [2] suggests exposing a server-side SPARQL

endpoint [14] over all RDF data in a pod, enabling fully server-side SPARQL query [15] processing. Another proposal [16] suggests to expose this data through a read-only Quad Pattern Fragments (QPF) [17] interface, to speed up the client-side pro‐ cessing of SPARQL queries over the entire pod. Whereas these alternative APIs can alleviate part of the context and performance aspects of the main API, they come with challenges to implement policy and to adequately model provenance and trust in their responses. Crucially, such alternative APIs are always derived from the main API, which is equivalent to the pod in the document-centric interpretation. The derived APIs thereby unavoidably inherit some of the ex‐ plicit and implicit modeling aspects from the document-based main API. Concretely, the direct deriva‐ tion from the main API manifests itself in the choice of the data model for the

SPARQL and QPF inter‐ faces. The RDF 1.1 quads they expose are constructed by loading the triples from each document, adding as a graph component the URL of that document. The following example quad reﬂects this: !"subject: https://sasha.pod/people/sasha#me !"predicate: https://example.org/ontology#birthDate !"object: "1984-04-03" !"graph: https://sasha.pod/private/medical/2022/10/15.ttl Its components signify that there exists a triple with that speciﬁc subject, predicate, and object in the document with URL https://sasha.pod/private/medical/2022/10/15.ttl. In other words, the document-centric interpretation of a pod considers the birthdate statements’ occurrence in this speciﬁc document on the pod to be an integral part of the statement itself. 4.4. Consequences Of The Single Hierarchy In this section, we will examine and critique the consequences of the document-centric interpretation

of a pod. Speciﬁcally, we study the limitations of the single hierarchy it causes (Subsection 4.1), and the eﬀects of the implicit semantics in its structure (Subsubsection 4.3.1) and documents (Subsubsection 4.3.2). While some consequences could in theory be mitigated by alternative APIs (Subsubsection 4.3.3), the eﬀectiveness thereof is hindered by the necessity of those alternatives to de‐ rive from the main API structure. 4.4.1. Single-app Modeling Mismatches Each app needs to have a single consistent hierarchy to serialize its data, which does not reﬂect the complex nature of real-world organization. For instance, the address book app could organize people in categories such as /contacts/work/ and /contacts/sports/, which leads to duplication when a per‐ son’s colleague is also a member of their badminton team. A similar situation occurs when we need to decide whether to group

health measurements by date (/medical/2022/10/15.ttl) or by topical evolu‐ tion over time (/medical/vitamine-d-levels.ttl) Hierarchical organizations are thus either constrained by their necessity to commit to a single repre‐ sentation of the real world, or in need of mechanisms to cope with the eﬀects of data duplication or vir‐ tualization. An alternative API circumvents some of these limitations when it comes to reading, al‐ though the provenance and trust of the resulting responses are even less explicitly deﬁned than in the main API. Writing is still fully coupled to the destination documents in the main API, since the quad components of each triple need to contain a speciﬁc document URL. 4.4.2. Cross-app Modeling Mismatches The document-centric view of the Solid Protocol does not inherently provide interoperability, because apps are still responsible for determining the speciﬁc API

that deﬁnes the document and container structure they will access. To make matters worse, diﬀerent apps and use cases can have competing in‐ terests that lead them to prefer one API structure over another. For example, interoperability requires the address book and birthday apps to use the same data, and hence to store it in the same place, such as the /people/ container. However, their preferences regard‐ ing the organization of that container vary. The address book app, which lets the user edit contacts one by one, has a context and performance incentive to place all contact attributes in a single RDF docu‐ ment per contact, leading to an organization such as: !"/people/work/dani.ttl !"/people/work/kiran.ttl !"/people/personal/kai.ttl !"/people/personal/luka.ttl !"… In contrast, the birthday app aims to quickly determine which celebrations are coming up, probably preferring a structure

more like: !"/people/work/birthdays.ttl !"/people/personal/birthdays.ttl Or possibly even: !"/people/birthdays/january.ttl !"/people/birthdays/february.ttl !"… Similarly, whether heart rate and blood pressure measurements are organized by date or by evolution over time, depends on the speciﬁcs of a current use case. 4.4.3. Policy Modeling Mismatches Context- and performance-based grouping are trade-oﬀs that can be overcome with compromises, such as accepting that certain use cases will be slower than others. Unfortunately, the imposed group‐ ing of multiple diﬀerent aspects in the same document can also lead to more sensitive and insurmount‐ able conﬂicts for the policy, provenance, and trust aspects. Since the coupling of policies to document organization provides the only mechanism of control in the document-centric view, some use cases with conﬂicting requirements cannot eﬀectively be

realized today. For example, assume that the address book app indeed organizes contacts as one person per doc‐ ument: !"/people/dani.ttl !"/people/kiran.ttl !"… In that case, the birthday app would be able to read (and needing to parse) people’s personal details such as addresses and phone numbers, whereas the expectation is that it should only access names and birthdays. Insurmountable conﬂicts become even more apparent with the medical use case. The results of a given blood test might be stored in a single document, and thus have a single policy boundary associ‐ ated with it. If that test result contains both vitamin levels and an HIV status, then the document-based access control prevents users from only giving access to their vitamin levels. The fact that conﬂicting requirements between aspects would necessitate the complexity of copies, ﬂags a strong limitation of any single

hierarchical API. One partial solution is to split these pieces of data into diﬀerent documents, but this results in suboptimal boundaries for purposes of context, prove‐ nance, and trust. Users thus ﬁnd themselves torn between giving apps too much access, or having to deal with overly granular control—in the extreme case causing situations that necessitate managing micro-documents with only one or a handful of triples. Whilst the degrees of freedom in the Solid Protocol allows for any such structures, the resulting API would be highly impractical for humans and machines alike. Another solution involves creating and maintaining a copy of the document with a sub‐ set of the data, which—in addition to the overhead of managing such copies—would also generate a diﬀ‐ ﬀerent associated context, provenance, and trust—especially if writing to such derived documents is needed. Furthermore, all

these aspects would necessarily be reﬂected in any derived APIs, which are tied to the main API’s document-based structure and boundaries. We conclude that document-centric pods inherently contain a large amount of implicit semantics in their API structure, hindering the realization of the data and application independence that is para‐ mount to the Solid vision. Some semantics that are supposed to be entangled with the data are in prac‐ tice assumed by the API, the construction of which happens in an uncoordinated way over time. The re‐ sulting spontaneous contracts are not made explicit by a single app, nor shared across multiple apps, meaning that interoperating apps have to be coded against each other rather than against the data, cre‐ ating undesired inter-application coupling. 5. Graph-centric Interpretation Of A Pod In this section, we give the concept of a Solid pod a new

interpretation, which is graph-centric. 5.1. Design Considerations We start from the limitations of the document-centric pod interpretation, which essentially assumes the Web API exposed by a pod to be equivalent to the pod itself. On the one hand, we acknowledge the universality and simplicity of document-based APIs, and in particular of the Solid Protocol, which oﬀers the building blocks to construct such APIs with the appropriate authentication and authorization. On the other hand, we showed concrete evidence in Subsection 4.4 that no single such hierarchy is able to reconcile the conﬂicting constraints of diﬀerent use cases, especially given that core aspects such as policies and provenance can only be applied at a document-level granularity. While we recognize the importance of document-based Web APIs, we also observe that the simulta‐ neous support for multiple use cases clearly

requires multiple perspectives into the same data, each sat‐ isfying the constraints of particular cases. Even though the creation of multiple views has been att‐ ttempted for Solid pods with SPARQL and QPF interfaces, their direct derivation from the main API still leads them to inherit that API’s mismatched constraints on the modeling of policies, provenance, and trust. In other words, aiming to derive richer views from the main pod API is akin to using a real-world ob‐ ject’s two-dimensional projection to derive alternate two-dimensional projections of that same object. Because any two-dimensional projection is inherently designed to discard information of the original, the creation of complementary alternative projections actually requires the object’s underlying three- dimensional reality. The two-dimensional projection was only ever meant as a helpful approximation of the three-

dimensional object. Translating this dimensionality metaphor to the world of pods, we conclude that today’s single hier‐ archical API to a pod serves as a proxy for the underlying knowledge graph formed by the union of the pod’s interlinked RDF documents—except that this union cannot adequately be reproduced because sig‐ niﬁcant parts of its semantics are being discarded during its exposure in a speciﬁc API. Solid applica‐ tions looking to leverage the potential of this larger Linked Data knowledge graph, will thus always be hindered by the limitations of one arbitrarily formed document API acting as its sole access gateway. 5.2. Definition Given the above observations within the document-centric interpretation, we create a new interpre‐ tation that shifts the function of a pod’s main API from being the pod itself to acting as one of many possible interfaces to an underlying knowledge

graph, which is the pod. The source of truth is a knowl‐ edge graph consisting of documents as well as RDF statements, from which multiple Web APIs can be derived. Hence, no particular API is more prominent than any other. We deﬁne this as follows: In the graph-centric interpretation, each Solid pod is a hybrid, contextualized knowledge graph, wherein “hybrid” indicates ﬁrst-class support for both documents and RDF statements, and “contextual‐ ized” the ability to associate each of its individual documents and statements with metadata such as policies, provenance, and trust. 5.3. Example For example, the pod https://sasha.pod/ could be a hybrid knowledge graph consisting of: !"RDF triples expressing contact details of https://sasha.pod/people/amal#me !"RDF triples expressing contact details of https://sasha.pod/people/lucian#me !"a PDF document containing medical images dated 2022-09-15

!"RDF triples representing a blood test result dated 2022-10-15 !"… Examples of associated metadata within this pod are: !"The RDF triples about Amal form a shared context with speciﬁc trust and provenance. !"A policy states that professional contact names can be publicly readable. !"A policy states that contacts’ phone numbers are only visible to Sasha. !"The provenance of Lucian’s phone number is a speciﬁc email. !"We trust the test result of 2022-10-15 is unmodiﬁed and accurate, because it is certiﬁed by a medical professional. Below is one possible Web API on top of this pod that conforms to the Solid Protocol: !"container https://sasha.pod/ #"container contacts/work/ $"RDF document .acl (for access control) $"RDF d

Abstract

Paper