Interlinking data with alignments and link keys
This version:
https://moex.gitlabpages.inria.fr/alignapi/tutorial/tutorial4/
Author:
Jérôme Euzenat , INRIA & Univ. Grenoble Alpes
This tutorial explains how it is possible to generate links between
datasets from EDOAL alignments with link
keys .
It optionally illustrates similar taks with Silk.
Requirements
The tutorial works simply with software embedded in the Alignment API.
However, making is work with Silk or a triple store requires
additional software.
Silk for generating links from similarity specification
Sesame or Virtuoso as a triple store (optional)
As usual, the whole tutorial is performed through command line.
The evaluation of such queries under a triple store, and using named graphs, are illustrated
here .
Set up
First start by cleaning up your environment:
$ cd tutorial6
$ mkdir results
Data sets
We are using two different data sets in files.
Of course, the tutorial can be adapted with your own data sets.
Generating links from link keys
From an alignment comprising link keys, it is possible to generate
sameAs links.
We have several such alignments here:
The goal of the tutorial is that you apply them one after the other, i.e., replacing the number in
the instructions below to see what these link keys do.
This corresponds to
linkkey3.rdf :
<map>
<Cell>
<entity1>
<edoal:Class rdf:about="&insee;Departement"/>
</entity1>
<entity2>
<edoal:Class>
<edoal:and rdf:parseType="Collection">
<edoal:Class rdf:about="&eurostat;NUTSRegion"/>
<edoal:AttributeValueRestriction>
<edoal:onAttribute>
<edoal:Property rdf:about="&eurostat;level"/>
</edoal:onAttribute>
<edoal:comparator rdf:resource="&edoal;equals"/>
<edoal:value><edoal:Literal edoal:type="&xsd;integer" edoal:string="3" /></edoal:value>
</edoal:AttributeValueRestriction>
<edoal:AttributeValueRestriction>
<edoal:onAttribute>
<edoal:Relation>
<edoal:compose rdf:parseType="Collection">
<edoal:Relation rdf:about="&eurostat;hasParentRegion" />
<edoal:Relation rdf:about="&eurostat;hasParentRegion" />
<edoal:Relation rdf:about="&eurostat;hasParentRegion" />
</edoal:compose>
</edoal:Relation>
</edoal:onAttribute>
<edoal:comparator rdf:resource="&edoal;equals"/>
<edoal:value><edoal:Instance rdf:about="&esdata;FR" /></edoal:value>
</edoal:AttributeValueRestriction>
</edoal:and>
</edoal:Class>
</entity2>
<relation>equivalence</relation>
<measure>1.0</measure>
<edoal:linkkey>
<edoal:Linkkey>
<edoal:binding>
<edoal:Intersects>
<edoal:property1>
<edoal:Property rdf:about="&insee;nom" /><!-- xml:lang="fr"-->
</edoal:property1>
<edoal:property2>
<edoal:Property rdf:about="&eurostat;name" />
</edoal:property2>
</edoal:Intersects>
</edoal:binding>
</edoal:Linkkey>
</edoal:linkkey>
</Cell>
</map>
The full alignment is available at: linkkey3.rdf
This is processed by:
$ java -cp $CLASSPATH fr.inrialpes.exmo.align.cli.ParserPrinter file:linkkey1.rdf -r fr.inrialpes.exmo.align.impl.renderer.SPARQLLinkkerRendererVisitor -o results/query.sparql
to generate a set of SPARQL queries.
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ns1:<http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#>
PREFIX owl:<http://www.w3.org/2002/07/owl#>
PREFIX ns2:<http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/>
PREFIX ns0:<http://rdf.insee.fr/geo/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
CONSTRUCT { ?s1 owl:sameAs ?s2 }
WHERE {
?s1 rdf:type ns0:Departement .
?s2 rdf:type ns1:NUTSRegion .
?s2 ns1:level ?o2 .
FILTER (?o2=3)
?s2 ns1:hasParentRegion ?o4 .
?o4 ns1:hasParentRegion ?o5 .
?o5 ns1:hasParentRegion ?o6 .
FILTER (?o6=ns2:FR)
?s1 ns0:nom ?o7 .
?s2 ns1:name ?o8 .
FILTER( lcase(str(?o7)) = lcase(str(?o8)) ) }
Think about what you could do to improve this query?
Processing any of these SPARQL queries, will generate links.
$ java -cp $CLASSPATH arq.query --query results/query.sparql --data regions-2010.rdf --data nuts2008_complete.rdf > results/links.ttl
@prefix geo: <http://rdf.insee.fr/geo/> .
@prefix cc: <http://creativecommons.org/ns#> .
@prefix : <http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ns0: <http://rdf.insee.fr/geo/> .
@prefix ns1: <http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
<http://rdf.insee.fr/geo/2010/DEP_67>
owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR421> .
<http://rdf.insee.fr/geo/2010/DEP_39>
owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/CH025> , <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR432> .
<http://rdf.insee.fr/geo/2010/DEP_2A>
owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR831> .
<http://rdf.insee.fr/geo/2010/DEP_61>
owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR253> .
<http://rdf.insee.fr/geo/2010/DEP_33>
owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR612> .
<http://rdf.insee.fr/geo/2010/DEP_05>
owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR822> .
<http://rdf.insee.fr/geo/2010/DEP_74>
owl:sameAs <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR718> .
...
The full link set is available at: results/links.ttl
Can you spot a problem? Where does it come from? How can it be solved?
Generating links from similarity measures
We use Silk 2.6.1 in order to generate links based on the similarity
between resources.
Silk is driven by scripts which express such similarity.
The scripts are expressed in the Link Specification Language
We have several linkkage rules available they are all in the same
script (identified by no1...no6):
Again, your goal is to process the linkage rules provided in this
script from n1 to n6 and to understand what they do.
Here is the example of a part of script.xml
<?xml version="1.0" encoding="utf-8" ?>
<Silk>
<Prefixes>
<Prefix id="rdf" namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#" />
<Prefix id="owl" namespace="http://www.w3.org/2002/07/owl#" />
<Prefix id="id2" namespace="http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#" />
<Prefix id="id1" namespace="http://rdf.insee.fr/geo/" />
</Prefixes>
<DataSources>
<DataSource id="id1" type="file">
<Param name="file" value="regions-2010.rdf"/>
<Param name="format" value="RDF/XML" />
</DataSource>
<DataSource id="id2" type="file">
<Param name="file" value="nuts2008_complete.rdf"/>
<Param name="format" value="RDF/XML" />
</DataSource>
</DataSources>
<Interlinks>
<Interlink id="no1">
<LinkType>owl:sameAs</LinkType>
<SourceDataset dataSource="id1" var="e1">
<RestrictTo>
?e1 rdf:type id1:Departement .
</RestrictTo>
</SourceDataset>
<TargetDataset dataSource="id2" var="e2">
<RestrictTo>
?e2 rdf:type id2:NUTSRegion .
</RestrictTo>
</TargetDataset>
<LinkageRule>
<Aggregate type="max">
<Compare metric="jaccard">
<TransformInput function="tokenize">
<Input path="?e1\id1:subdivision/id1:nom" />
</TransformInput>
<TransformInput function="tokenize">
<Input path="?e2/id2:name" />
</TransformInput>
</Compare>
<Compare metric="jaccard">
<TransformInput function="tokenize">
<Input path="?e1/id1:nom" />
</TransformInput>
<TransformInput function="tokenize">
<Input path="?e2/id2:name" />
</TransformInput>
</Compare>
</Aggregate>
</LinkageRule>
<Outputs>
<Output type="file">
<Param name="file" value="results/Round1.rdf"/>
<Param name="format" value="alignment"/>
</Output>
</Outputs>
</Interlink>
</Interlinks>
</Silk>
$ java -DconfigFile=script.xml -DlinkSpec=no1 -jar silk.jar
The result is provided as a set of links in a format which is supposed to be the
Alignment format. However, it is not correct. This is fixed here by
applyng the patch:
$ sh bin/fix.sh results/Round1-accepted.rdf
on resulting file (here results/Round1-accepted.rdf).
It is possible to count the number of answers provided by the
evaluation through:
$ grep entity1 results/Round1-accepted.rdf | wc -l
103
Link quality can be tested by comparison with the reference alignment
reflinks.rdf :
$ java -cp $CLASSPATH fr.inrialpes.exmo.align.cli.EvalAlign -i fr.inrialpes.exmo.align.impl.eval.PRecEvaluator file:reflinks.rdf file:results/Round1-accepted.rdf
which answers:
<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
xmlns:map='http://www.atl.external.lmco.com/projects/ontology/ResultsOntology.n3#'>
<map:output rdf:about=''>
<map:input1 rdf:resource="http://rdf.insee.fr/geo/"/>
<map:input2 rdf:resource="http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#"/>
<map:precision>0.9611650485436893</map:precision>
<map:recall>1.0</map:recall>
<map:fMeasure>0.9801980198019802</map:fMeasure>
<map:oMeasure>0.9595959595959596</map:oMeasure>
<result>1.0404040404040404</result>
</map:output>
</rdf:RDF>
It provides all valid links (recall=100%) but not all the links
it found are correct (precision=96%). Could you improve on this?
Try the other proposed linked rule and/or try to improve the linkage
used.
rule ≡ comparison #links prec. rec.
no1 dpt ≡ NR name=nom 103 .96 1.0
no2 dpt ≡ NR&level=3 tok(name)~tok(nom) 100 .99 1.0
no3 dpt ≡ NR AVG(tok(name)~tok(nom), tok(hasParentRegion/name)~tok(\subdivision/name)) 89 1.0 .90
no4 dpt ≡ NR&level=3&hasParentRegion3 =FR AVG(tok(name)~tok(nom), tok(hasParentRegion/name)~tok(\subdivision/name)) 89 1.0 .90
no5 dpt ≡ NR&level=3&hasParentRegion3 =FR name=nom 99 1.0 1.0
no6 dpt ≡ NR&level=3&hasParentRegion3 =FR MAX(tok(name)~tok(nom), tok(hasParentRegion/name)~tok(\subdivision/name)) 439 .23 1.0
no7 dpt ≡ NR&level=3&hasParentRegion3 =FR MIN(tok(name)~tok(nom), tok(hasParentRegion/name)~tok(\subdivision/name)) 89 1.0 .90
Extra work
For the curious, we have a larger example, between the
French communes
in insee-communes.ttl and
those of geonames
(communes_gn.ttl ).
A starting script is geo-script.xml .
This sample data comes from
the LinkKeyDisco
system experiments.
Further exercises
More info: https://moex.gitlabpages.inria.fr/alignapi/tutorial/
https://moex.gitlabpages.inria.fr/alignapi/tutorial/tutorial6/