A SPARQL endpoint makes terminology data accessible

The Journey to a Multilingual SPARQL Endpoint

Time to get technical. How did we add a SPARQL endpoint to the Coreon Multilingual Knowledge System?
A SPARQL endpoint makes terminology data accessible

The idea of accessing a Multilingual Knowledge System through the means and methods of the Semantic Web brings two keywords immediately to mind: SPARQL and LOD (Linked Open Data). We’ve already talked about the benefits of a SPARQL endpoint and how it enables your enterprise to handle all its data via one, centralized hub in a previous post, but how did we actually achieve it?

The Coreon MKS is powered by a RESTful Web API, sending its data in JSON data structures through the wire. Everyone can develop extensions and custom solutions based on this. We ourselves did it recently, for instance: creating a plug-in to SDL Trados Studio so that linguists can directly access information stored in Coreon.

However, this required the developer of the plug-in to get familiar with the API and its data structures.

In the world of the Semantic Web (aka ‘the web of data’), we no longer see proprietary APIs. Developers and integrators instead access all the resources through the same method – SPARQL. Wouldn’t it be great to also access the Coreon repositories via a SPARQL endpoint?

We will outline how we did it with Coreon, but the process is not only relevant for our own MK system – it could easily act as a blueprint or guideline for those working with similar tools.

JSON Structures to RDF Graph

The first step was to analyze how Coreon’s data model could be mirrored in a RDF graph. What were the information objects? What were the predicates between them? What showed up as a literal?

In RDF, all elements or pieces of information you want to “talk about” are good candidates for becoming objects or, technically speaking, OWL classes. There were obvious candidates for classes, namely Concept or Term, but how about the concept relations such as “broader” or custom associative ones like “is-complementary-to”? How about descriptive information such as a Definition or Term Status value? Concretely, we had to go from the JSON data structure to an RDF graph model.

Before we dive in deeper, here’s a sample concept (with ID: 607ed17b318e0c181786b545) in Coreon that has two terms, English screen and German Bildschirm. Notice also the individual IDs of each of the terms – they will become important later on.

SPARQL Endpoint: Coreon concept example with concept and term IDs shown

In the original JSON data structure, this concept is represented as follows (only relevant code lines shown):

{
    "created_at": "2021-04-20T13:04:59.816Z",
    "updated_at": "2021-04-20T13:05:25.856Z",
    "terms": [
        {
            "lang": "en",
            "value": "screen",
            "created_at": "2021-04-20T13:04:59.816Z",
            "updated_at": "2021-04-20T13:04:59.816Z",
            "id": "607ed17b318e0c181786b549",
            "concept_id": "607ed17b318e0c181786b545",
            "properties": [],
            "created_by": {
                "email": "michael.wetzel@coreon.com",
                "name": "Michael Wetzel"
            },
            "updated_by": {
                "email": "michael.wetzel@coreon.com",
                "name": "Michael Wetzel"
            }
        },
        {
            "lang": "de",
            "value": "Bildschirm",
            "created_at": "2021-04-20T13:05:25.856Z",
            "updated_at": "2021-04-20T13:05:25.856Z",
            "id": "607ed195318e0c181786b55e",
            "concept_id": "607ed17b318e0c181786b545",
            "properties": [],
            "created_by": {
                "email": "michael.wetzel@coreon.com",
                "name": "Michael Wetzel"
            },
            "updated_by": {
                "email": "michael.wetzel@coreon.com",
                "name": "Michael Wetzel"
            },
        }
    ],
    "id": "607ed17b318e0c181786b545",
    "coreon_type": null,
    "alias": false,
    "referent_id": null,
    "branch_root": false,
    "properties": [],
    "parent_relations": [
        {
            "concept_id": "606336dab4dbcf018ed99308",
            "type": "SUPERCONCEPT_OF"
        }
    ],
    "child_relations": []
}

When we transform this into an RDF graph, the concept and its two terms are bound together in statements (so-called triples), each consisting of a subject, a predicate and an object. The concept will act as the subject, the term(s) act as the object(s), and the required predicate could be named in this case: hasTerm. This gives us the following triple:

coreon:607ed17b318e0c181786b545 coreon:hasTerm coreon:607ed17b318e0c181786b549 .

The triple shows that the resource with the ID 607ed17b318e0c181786b545 contains a term, and the term’s ID is 607ed17b318e0c181786b549. It doesn’t yet say anything about the value or the language of the term. It simply states that the term with the given ID is a member of that concept.

Now the next triple shows that the value for the resource with ID 607ed17b318e0c181786b549 has a literal value in English, namely the string screen:

coreon:607ed17b318e0c181786b549 coreon:value “screen”@en .

Such a set of triples, i.e. many atomic statements bound together via predicates, make up the RDF graph. If we visualize some of these triples, the resulting RDF graph looks like this:

Representing concepts and terms as an RDF graph

Concepts and terms are classes (in green and blue), predicates are graph edges (above the lines).

The complete set of triples would be serialized as follows in RDF / Turtle:

@prefix coreon: <https://www.coreon.com/coreon-rdf#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<https://www.coreon.com/coreon-instance> a owl:Ontology;
  owl:imports <https://www.coreon.com/coreon-rdf>;
  owl:versionInfo "Created through Coreon export" .
coreon:607ed17b318e0c181786b547 a coreon:Edge;
  coreon:edgeSource coreon:606336dab4dbcf018ed99308;
  coreon:edgeTarget coreon:607ed17b318e0c181786b545;
  coreon:type "SUPERCONCEPT_OF" .
coreon:606336dab4dbcf018ed99307 a coreon:Term;
  coreon:value "peripheral device"@en .
coreon:606336dab4dbcf018ed99308 a coreon:Concept;
  coreon:hasTerm coreon:606336dab4dbcf018ed99307 .
coreon:607ed17b318e0c181786b545 a coreon:Concept;
  coreon:hasTerm coreon:607ed195318e0c181786b55e,
    coreon:607ed17b318e0c181786b549 .
coreon:607ed17b318e0c181786b549 a coreon:Term;
  coreon:value "screen"@en .
coreon:607ed195318e0c181786b55e a coreon:Term;
  coreon:value "Bildschirm"@de .

You may also have noticed in the above syntax that two or more statements are serialized together – separated via semicolons and ending with a full stop. Line 18 indicates that the resource with the ID 606336dab4dbcf018ed99308 is of OWL class coreon:Concept, and line 19 further indicates that it contains a term which has the ID 606336dab4dbcf018ed99307.

No RDF without URIs

Now all the pieces of information are bound together via RDF statements: the triples. They have a pretty atomic, isolated nature. This is quite different to how XML and other standard formats organize information. In RDF and LOD all data is stored in this atomic manner, uniquely identifiable through the URI.

Via the URIs and the predicates such as hasTerm , the resources are bound together. Only then does it become meaningful for an application or a human, as the URIs are an indispensable prerequisite. All information elements that are represented as classes have unique identifiers. The namespace coreon: , together with the unique IDs, unambiguously identifies a given resource. This is regardless of whether it is a concept, term, property, or even a concept relation. Fortunately we stored all data with URIs when we created the fundamental design of Coreon. Phew.

Build the Coreon RDF Vocabulary

After researching the basic approach described above, we analyzed all elements of the Coreon data structure and rethought them as a member of our RDF vocabulary. The following table lists the most important ones:

OWL TypeCoreon RDF Vocabulary
Classesowl:Classcoreon:Admin, coreon:Edge, coreon:Concept, coreon:Flagset, coreon:Property, coreon:Term
Predicatesowl:ObjectPropertycoreon:hasAdmin, coreon:hasFlagset, coreon:hasProperty, coreon:hasTerm
Valuesowl:AnnotationPropertycoreon:edgeSource, coreon:edgeTarget, coreon:id, coreon:name, coreon:type, coreon:value

For the predicates we also specified what kind of information can be used, defining owl:range and owl:domain. For instance, the predicate hasTerm can only accept resources of type coreon:Concept as a subject (owl:domain). As an example: the full specification of the predicate hasTerm looks as follows:

coreon:hasTerm
  rdf:type owl:ObjectProperty ;
  rdfs:comment "makes a term member of a concept" ;
  rdfs:domain coreon:Concept ;
  rdfs:label "has term" ;
  rdfs:range coreon:Term .

Publish as an Offline Resource

Once our RDF vocabulary was ready, the first step to implement it into Coreon was to add an RDF publication mechanism to the export engine. Equipped with this, Coreon can now export its repositories in RDF, including various syntaxes (Turtle, N3, JSON-LD and more).

Real-Time Access via a SPARQL Endpoint

The final yet most complicated step was to equip the Coreon cloud service with a real-time accessible SPARQL endpoint. We chose Apache Fuseki. It runs as a secondary index in parallel to a repository’s data store, updated in real-time. Thus any change a data maintainer makes is immediately accessible via the SPARQL endpoint!

Let me illustrate the ease and power of SPARQL with some examples…

Example 1: Query All the Definitions via SPARQL:

SELECT *
   WHERE {
       ?p rdf:type coreon:Property .
       ?p coreon:key "Definition" .
       ?p coreon:value ?v.
    }

We are querying all the objects that are of type coreon:Property (line 3) that also have a key with the name Definition (line 4). This is bound against the result variable p, and then for all these we retrieve the values, which are bound against the variable v.

A typical result table (here from a repository dealing with wine grape varieties) looks as follows:

[p]v
https://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8a1Chardonnay is the most famous and most elevated grape in the region of Northern Burgundy in …
https://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8abA white grape variety which originated in the Rhine region. Riesling is an aromatic grape variety …
https://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8cdPinot noir (French: [pino nwa?]) is a red wine grape variety of the species Vitis vinifera. The …

The first column, representing the variable p, holds the URI of the property; the second column holds the literal value.

Example 2: Query all terms and – if present – also print their usage value:

A more realistic query compared to Example 1: get me all the terms and if they have a usage flag, such as preferred, print it, too.

SELECT ?t ?termvalue ?usagevalue
    WHERE {
        ?t rdf:type coreon:Term .
        ?t coreon:value ?termvalue .
        OPTIONAL {
            ?t coreon:hasProperty ?p .
            ?p coreon:key "Usage" .
            ?p coreon:value ?usagevalue .
        }
    }

A typical result might look as follows:

[t]termvalueusagevalue
https://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8aaRiesling
https://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8bbCabernet SauvignonPreferred
https://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8beCSAlternative
https://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8c2Merlot

This output table shows the term’s URI, then its value, and – if available – the usage recommendation.

Example 3: How many Definitions are in your repository?

A last one to share…do you know how many Definitions or Comments are in your repository, or which are the most used properties? Well, how about this…

SELECT ?k (COUNT(?k) AS ?count)
{
	?uri coreon:key ?k.
}
GROUP BY ?k
ORDER BY DESC(?count)

…which delivers a table looking like this…nice!

kcount
concept status13806
usage status10532
part of speech10408
term type10353
definition5996

European Language Grid and Outlook

We thank the European Language Grid (ELG) for funding substantial parts of this development. It is a significant step and showcases how to complement software for multilingual knowledge with an open SPARQL / LOD access mechanism. The SPARQL endpoint is available to all Coreon customers. A selected set of demo repositories will also be accessible with the SPARQL endpoint through the ELG hub by summer 2021.

We are sharing our experiences with ISO / TC37 SC3 working groups, as a draft for a technical recommendation of how to represent TBX (TermBase eXchange) as RDF. Many of our findings in this journey towards a SPARQL endpoint can be used as a base for an international standard.

Michael Wetzel
Michael Wetzel

Michael has a deep knowledge of multilingual problem solving and long term experience in product management. Michael was for years product manager of TRADOS MultiTerm. He is an active contributor to the ISO TC37/SC3 and DIN NA 105 standards.