Using the Alignment API: a small tutorial on the Alignment API

Here is a small tutorial for the alignment API. Most of the tutorial is based on command-lines invocations. Of course, it is not the natural way to use this API: it is made for being embedded in some application programme and we are working towards implementing an alignment server that can help programmes to use the API remotely. The complete tutorial is also available as a self-contained script.sh or script.bat. We make no garantee on the MS-DOS script, it is only here for helping.

A companion tutorial has been designed for the Alignment Server. It follows, as much as possible, the reasoning of this tutorial but provides input and output through a web browser.

Preparation

First you must download the Alignment API and check that it works as indicated here.

You can modify the Alignment API and its implementation. In this tutorial, we will simply learn how to use it.

The goal of this tutorial is only to help you realize the possibilities of the Alignment API and implementation. It can be played by invoking each command line from the command-line interpreter. In this example we use the sh syntax (which only affects the export VARIABLE=VALUE command which can be rewritten as setenv VARIABLE VALUE with c-shells).

The data

Your mission, if you accept it, will be to find the best alignment between two bibliographic ontologies. They can be seen here:

Matching

For demonstrating the use of our implementation of the Alignment API, we implemented a particular processor (fr.inrialpes.exmo.align.cli.Procalign) which:

Let's try to match these two ontologies ($CWD is a variable that has been set up to the directory just above this one):

The result is displayed on the standard output. Since the output is too long we send it to a file by using the -o switch:

Hence, it is possible to display the alignment in HTML by using the adequate renderer:

The result is expressed in the Alignment format. This format, in RDF/XML, is made of a header containing "metadata" about the alignment:

<?xml version='1.0' encoding='utf-8' standalone='no'?> <rdf:RDF xmlns='http://knowledgeweb.semanticweb.org/heterogeneity/alignment#' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:xsd='http://www.w3.org/2001/XMLSchema#' xmlns:align='http://knowledgeweb.semanticweb.org/heterogeneity/alignment#'> <Alignment> <xml>yes</xml> <level>0</level> <type>**</type> <method>fr.inrialpes.exmo.align.impl.method.StringDistAlignment</method> <time>18</time> <onto1> <Ontology rdf:about="https://moex.gitlabpages.inria.fr/alignapi/tutorial/myOnto.owl"> <location>file:///Java/alignapi/html/tutorial/myOnto.owl</location> <formalism> <Formalism align:name="OWL1.0" align:uri="http://www.w3.org/2002/07/owl#"/> </formalism> </Ontology> </onto1> <onto2> <Ontology rdf:about="https://moex.gitlabpages.inria.fr/alignapi/tutorial/edu.mit.visus.bibtex.owl"> <location>file:///Java/alignapi/html/tutorial/edu.mit.visus.bibtex.owl</location> <formalism> <Formalism align:name="OWL1.0" align:uri="http://www.w3.org/2002/07/owl#"/> </formalism> </Ontology> </onto2>

each correspondence is made of two references to the aligned entities, the relation holding between the entities (=) and a confidence measure (1.0) in this correspondence. Here, because the default method that has been used for aligning the ontologies is so simple (it only compares the labels of the entities and find that there is a correspondence if their labels are equal), the correspondences are always that simple. But it is too simple so we will use a more sophisticated method based on an edit distance:

$ java -jar ../../../lib/procalign.jar -i fr.inrialpes.exmo.align.impl.method.StringDistAlignment -DstringFunction=levenshteinDistance file://$CWD/myOnto.owl file://$CWD/edu.mit.visus.bibtex.owl -o results/levenshtein.rdf

This is achieved by specifying the class of Alignment to be used (through the -i switch) and the distance function to be used (-DstringFunction=levenshteinDistance).

We can see that the correspondences now contain confidence factors different than 1.0, they also match strings which are not the same and indeed far more correspondences are available.

More work: you can apply other available alignments classes. Look in the ../../../src/fr/inrialpes/exmo/align/impl/method directory for more simple alignment methods. Also look in the StringDistances class the possible values for stringFunction (they are the names of methods).

Advanced: You can also look at the instructions for installing WordNet and its Java interface and use a WordNet based distance provided with the API implementation by ($WNDIR is the directory where wordnet 3.0 is installed):

$ java -cp ../../../lib/procalign.jar:../../../lib/jwnl/jwnl.jar fr.inrialpes.exmo.align.cli.Procalign -Dwndict=$WNDIR -i fr.inrialpes.exmo.align.ling.JWNLAlignment file://$CWD/myOnto.owl file://$CWD/edu.mit.visus.bibtex.owl -o results/jwnl.rdf

See the output in RDF/XML or HTML (if rendered as before).

Manipulating

As can be seen there are some correspondences that do not really make sense. Fortunately, they also have very low confidence values. It is thus interesting to use a threshold for eliminating these values. Let's try a threshold of .33 over the alignment (with the -t switch):

$ java -jar ../../../lib/procalign.jar file://$CWD/myOnto.owl file://$CWD/edu.mit.visus.bibtex.owl -i fr.inrialpes.exmo.align.impl.method.StringDistAlignment -DstringFunction=levenshteinDistance -t 0.33 -o results/levenshtein33.rdf

As expected we have suppressed some of these inaccurate correspondences. But did we also suppressed accurate ones?

This operation has contributed eliminating a number of innacurate correspondences like Journal-Conference or Composite-Conference. However, there remains some unaccurate correspondences like Institution-InCollection and Published-UnPublished!

$ java -jar ../../../lib/procalign.jar -i fr.inrialpes.exmo.align.impl.method.StringDistAlignment -DstringFunction=smoaDistance file://$CWD/myOnto.owl file://$CWD/edu.mit.visus.bibtex.owl -t 0.5 -o results/SMOA5.rdf

Other manipulations: It is possible to invert an alignment with the following command:

See the output in RDF/XML or HTML (if rendered as before). The results is an alignment from the source to the target. Inverting alignment is only the exchange of the order of the elements in the alignment file. This can be useful when you have an alignment of A to B, an alignment from C to B and you want to go from A to C. The solution is then to invert the second alignment and to compose them.

More work: There is another switch (-T) in Procalign that specifies the way a threshold is applied (hard|perc|prop|best|span) the default being "hard". The curious reader can apply these and see the difference in results. How they work is explained in the Alignment API documentation.

Output

Once a good alignment has been found, only half of the work has been done. In order to actually use our result it is necessary to transform it into some processable format. For instance, if one wants to merge two OWL ontologies, the alignment can be changed into set of OWL "bridging" axioms. This is achieved by "rendering" the alignment in OWL (through the -r switch):

<owl:Class rdf:about="https://moex.gitlabpages.inria.fr/alignapi/tutorial/myOnto.owl#Techreport"> <owl:equivalentClass rdf:resource="https://moex.gitlabpages.inria.fr/alignapi/tutorial/edu.mit.visus.bibtex.owl#Techreport"/> </owl:Class> <owl:ObjectProperty rdf:about="https://moex.gitlabpages.inria.fr/alignapi/tutorial/myOnto.owl#copyright"> <owl:equivalentProperty rdf:resource="https://moex.gitlabpages.inria.fr/alignapi/tutorial/edu.mit.visus.bibtex.owl#hasCopyright"/> </owl:ObjectProperty>

If one wants to use the alignments only for infering on instances without actually merging the classes, she can generate SWRL rules:

<ruleml:imp> <ruleml:_body> <swrl:classAtom> <owllx:Class owllx:name="https://moex.gitlabpages.inria.fr/alignapi/tutorial/myOnto.owl#Techreport"/> <ruleml:var>x</ruleml:var> </swrl:classAtom> </ruleml:_body> <ruleml:_head> <swrlx:classAtom> <owllx:Class owllx:name="https://moex.gitlabpages.inria.fr/alignapi/tutorial/edu.mit.visus.bibtex.owl#Techreport"/> <ruleml:var>x</ruleml:var> </swrl:classAtom> </ruleml:_head> </ruleml:imp> <ruleml:imp> <ruleml:_body> <swrl:individualPropertyAtom swrlx:property="https://moex.gitlabpages.inria.fr/alignapi/tutorial/myOnto.owl#copyright"/> <ruleml:var>x</ruleml:var> <ruleml:var>y</ruleml:var> </swrl:individualPropertyAtom> </ruleml:_body> <ruleml:_head> <swrl:datavaluedPropertyAtom swrlx:property="https://moex.gitlabpages.inria.fr/alignapi/tutorial/edu.mit.visus.bibtex.owl#hasCopyright"/> <ruleml:var>x</ruleml:var> <ruleml:var>y</ruleml:var> </swrl:datavaluedPropertyAtom> </ruleml:_head> </ruleml:imp>

Exchanging data can also be achieved more simply through XSLT transformations which will transform the OWL instance files from one ontology to another:

Evaluating

We will evaluate alignments by comparing them to some reference alignment which is supposed to express what is expected from an alignment of these two ontologies. The reference alignment is refalign.rdf (or HTML, if rendered as before).

For evaluating we use another class than Procalign. It is called EvalAlign we should specify this to java. By default, it computes precision, recall and associated measures. It can be invoked this way:

The first argument is always the reference alignment, the second one is the alignment to be evaluated. The result is given here:

<?xml version='1.0' encoding='utf-8' standalone='yes'?> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:map='http://www.atl.external.lmco.com/projects/ontology/ResultsOntology.n3#'> <map:output rdf:about=''> <map:input1 rdf:resource="https://moex.gitlabpages.inria.fr/alignapi/tutorial/myOnto.owl"/> <map:input2 rdf:resource="https://moex.gitlabpages.inria.fr/alignapi/tutorial/edu.mit.visus.bibtex.owl"/> <map:precision>1.0</map:precision> <map:recall>0.3541666666666667</map:recall> <fallout>0.0</fallout> <map:fMeasure>0.5230769230769231</map:fMeasure> <map:oMeasure>0.3541666666666667</map:oMeasure> <time>22</time> <result>0.3541666666666667</result> </map:output> </rdf:RDF>

Of course, since that method only match objects with the same name, it is accurate, yielding a high precision. However, it has poor recall.

We can now evaluate the edit distance. What to expect from the evaluation of this alignment?

Since it returns more correspondences by loosening the constraints for being a correspondence, it is expected that the recall will increase at the expense of precision.

<?xml version='1.0' encoding='utf-8' standalone='yes'?> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:map='http://www.atl.external.lmco.com/projects/ontology/ResultsOntology.n3#'> <map:output rdf:about=''> <map:input1 rdf:resource="https://moex.gitlabpages.inria.fr/alignapi/tutorial/myOnto.owl"/> <map:input2 rdf:resource="https://moex.gitlabpages.inria.fr/alignapi/tutorial/edu.mit.visus.bibtex.owl"/> <map:precision>0.6486486486486487</map:precision> <map:recall>1.0</map:recall> <fallout>0.35135135135135137</fallout> <map:fMeasure>0.7868852459016393</map:fMeasure> <map:oMeasure>0.4583333333333335</map:oMeasure> <result>1.5416666666666665</result> </map:output> </rdf:RDF>

It is possible to summarize these results by comparing them to each others. This can be achieved by the GroupEval class. This class can output several formats (by default html) and takes all the alignments in the subdirectories of the current directory. Here we only have the results directory:

The results are displayed in the results/eval.html file whose main content is the table:

algo	refalign	equal	SMOA	SMOA5	levenshtein	levenshtein33
test	Prec.	Rec.	FMeas.	Prec.	Rec.	FMeas.	Prec.	Rec.	FMeas.	Prec.	Rec.	FMeas.	Prec.	Rec.	FMeas.	Prec.	Rec.	FMeas.
results	1.00	1.00	1.00	1.00	0.35	0.52	0.57	0.98	0.72	0.72	0.98	0.83	0.55	1.00	0.71	0.65	1.00	0.79
H-mean	1.00	1.00	1.00	1.00	0.35	0.52	0.57	0.98	0.72	0.72	0.98	0.83	0.55	1.00	0.71	0.65	1.00	0.79

n/a: result alignment not provided or not readable
NaN: division per zero, likely due to empty alignment.

More work: As you can see, the PRecEvaluator does not only provide precision and recall but also provides F-measure. F-measure is usually used as an "absolute" trade-off between precision and recall (i.e., the optimum F-measure is considered the best precision and recall). Can you establish this point for SMOA and levenshtein and tell which algorithm is more adapted?