Bibliography on Data interlinking/Liage de données (2017-06-06)
Manel Achichi, Michelle Cheatham, Zlatan Dragisic, Jérôme Euzenat, Daniel Faria, Alfio Ferrara, Giorgos Flouris, Irini Fundulaki, Ian Harrow, Valentina Ivanova, Ernesto Jiménez-Ruiz, Elena Kuss, Patrick Lambrix, Henrik Leopold, Huanyu Li, Christian Meilicke, Stefano Montanelli, Catia Pesquita, Tzanina Saveta, Pavel Shvaiko, Andrea Splendiani, Heiner Stuckenschmidt, Konstantin Todorov, Cássia Trojahn dos Santos, Ondřej Zamazal, Results of the Ontology Alignment Evaluation Initiative 2016, in: Pavel Shvaiko, Jérôme Euzenat, Ernesto Jiménez-Ruiz, Michelle Cheatham, Oktie Hassanzadeh, Ryutaro Ichise (eds), Proc. 11th ISWC workshop on ontology matching (OM), Kobe (JP), pp73-129, 2016
Ontology matching consists of finding correspondences between semantically related entities of two ontologies. OAEI campaigns aim at comparing ontology matching systems on precisely defined test cases. These test cases can use ontologies of different nature (from simple thesauri to expressive OWL ontologies) and use different modalities, e.g., blind evaluation, open evaluation, or consensus. OAEI 2016 offered 9 tracks with 22 test cases, and was attended by 21 participants. This paper is an overall presentation of the OAEI 2016 campaign.
Mustafa Al-Bakri, Manuel Atencia, Jérôme David, Steffen Lalande, Marie-Christine Rousset, Uncertainty-sensitive reasoning for inferring sameAs facts in linked data, in: Gal Kaminka, Maria Fox, Paolo Bouquet, Eyke Hüllermeier, Virginia Dignum, Frank Dignum, Frank van Harmelen (eds), Proc. 22nd european conference on artificial intelligence (ECAI), Der Haague (NL), pp698-706, 2016
Discovering whether or not two URIs described in Linked Data -- in the same or different RDF datasets -- refer to the same real-world entity is crucial for building applications that exploit the cross-referencing of open data. A major challenge in data interlinking is to design tools that effectively deal with incomplete and noisy data, and exploit uncertain knowledge. In this paper, we model data interlinking as a reasoning problem with uncertainty. We introduce a probabilistic framework for modelling and reasoning over uncertain RDF facts and rules that is based on the semantics of probabilistic Datalog. We have designed an algorithm, ProbFR, based on this framework. Experiments on real-world datasets have shown the usefulness and effectiveness of our approach for data linkage and disambiguation.
Michelle Cheatham, Zlatan Dragisic, Jérôme Euzenat, Daniel Faria, Alfio Ferrara, Giorgos Flouris, Irini Fundulaki, Roger Granada, Valentina Ivanova, Ernesto Jiménez-Ruiz, Patrick Lambrix, Stefano Montanelli, Catia Pesquita, Tzanina Saveta, Pavel Shvaiko, Alessandro Solimando, Cássia Trojahn dos Santos, Ondřej Zamazal, Results of the Ontology Alignment Evaluation Initiative 2015, in: Pavel Shvaiko, Jérôme Euzenat, Ernesto Jiménez-Ruiz, Michelle Cheatham, Oktie Hassanzadeh (eds), Proc. 10th ISWC workshop on ontology matching (OM), Bethlehem (PA US), pp60-115, 2016
Ontology matching consists of finding correspondences between semantically related entities of two ontologies. OAEI campaigns aim at comparing ontology matching systems on precisely defined test cases. These test cases can use ontologies of different nature (from simple thesauri to expressive OWL ontologies) and use different modalities, e.g., blind evaluation, open evaluation and consensus. OAEI 2015 offered 8 tracks with 15 test cases followed by 22 participants. Since 2011, the campaign has been using a new evaluation modality which provides more automation to the evaluation. This paper is an overall presentation of the OAEI 2015 campaign.
Jérôme Euzenat, Extraction de clés de liage de données (résumé étendu), in: Actes 16e conférence internationale francophone sur extraction et gestion des connaissances (EGC), Reims (FR), (Bruno Crémilleux, Cyril de Runz (éds), Actes 16e conférence internationale francophone sur extraction et gestion des connaissances (EGC), Revue des nouvelles technologies de l'information E30, 2016), pp9-12, 2016
De grandes quantités de données sont publiées sur le web des données. Les lier consiste à identifier les mêmes ressources dans deux jeux de données permettant l'exploitation conjointe des données publiées. Mais l'extraction de liens n'est pas une tâche facile. Nous avons développé une approche qui extrait des clés de liage (link keys). Les clés de liage étendent la notion de clé de l'algèbre relationnelle à plusieurs sources de données. Elles sont fondées sur des ensembles de couples de propriétés identifiant les objets lorsqu'ils ont les mêmes valeurs, ou des valeurs communes, pour ces propriétés. On présentera une manière d'extraire automatiquement les clés de liage candidates à partir de données. Cette opération peut être exprimée dans l'analyse formelle de concepts. La qualité des clés candidates peut-être évaluée en fonction de la disponibilité (cas supervisé) ou non (cas non supervisé) d'un échantillon de liens. La pertinence et de la robustesse de telles clés seront illustrées sur un exemple réel.
Maroua Gmati, Manuel Atencia, Jérôme Euzenat, Tableau extensions for reasoning with link keys, in: Pavel Shvaiko, Jérôme Euzenat, Ernesto Jiménez-Ruiz, Michelle Cheatham, Oktie Hassanzadeh, Ryutaro Ichise (eds), Proc. 11th ISWC workshop on ontology matching (OM), Kobe (JP), pp37-48, 2016
Link keys allow for generating links across data sets expressed in different ontologies. But they can also be thought of as axioms in a description logic. As such, they can contribute to infer ABox axioms, such as links, or terminological axioms and other link keys. Yet, no reasoning support exists for link keys. Here we extend the tableau method designed for ALC to take link keys into account. We show how this extension enables combining link keys with terminological reasoning with and without ABox and TBox and generate non trivial link keys.
Link key, Tableau method, Description logics, Semantic web
Tatiana Lesnikova, Jérôme David, Jérôme Euzenat, Cross-lingual RDF thesauri interlinking, in: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis (eds), Proc. 10th international conference on Language resources and evaluation (LREC), Portoroz (SI), pp2442-2449, 2016
Various lexical resources are being published in RDF. To enhance the usability of these resources, identical resources in different data sets should be linked. If lexical resources are described in different natural languages, then techniques to deal with multilinguality are required for interlinking. In this paper, we evaluate machine translation for interlinking concepts, i.e., generic entities named with a common noun or term. In our previous work, the evaluated method has been applied on named entities. We conduct two experiments involving different thesauri in different languages. The first experiment involves concepts from the TheSoz multilingual thesaurus in three languages: English, French and German. The second experiment involves concepts from the EuroVoc and AGROVOC thesauri in English and Chinese respectively. Our results demonstrate that machine translation can be beneficial for cross-lingual thesauri interlining independently of a dataset structure.
Cross-lingual data interlinking, owl:sameAs, Thesaurus alignment
Tatiana Lesnikova, RDF data interlinking: evaluation of cross-lingual methods, Thèse d'informatique, Université de Grenoble, Grenoble (FR), May 2016
The Semantic Web extends the Web by publishing structured and interlinked data using RDF. An RDF data set is a graph where resources are nodes labelled in natural languages. One of the key challenges of linked data is to be able to discover links across RDF data sets. Given two data sets, equivalent resources should be identified and linked by owl:sameAs links. This problem is particularly difficult when resources are described in different natural languages. This thesis investigates the effectiveness of linguistic resources for interlinking RDF data sets. For this purpose, we introduce a general framework in which each RDF resource is represented as a virtual document containing text information of neighboring nodes. The context of a resource are the labels of the neighboring nodes. Once virtual documents are created, they are projected in the same space in order to be compared. This can be achieved by using machine translation or multilingual lexical resources. Once documents are in the same space, similarity measures to find identical resources are applied. Similarity between elements of this space is taken for similarity between RDF resources. We performed evaluation of cross-lingual techniques within the proposed framework. We experimentally evaluate different methods for linking RDF data. In particular, two strategies are explored: applying machine translation or using references to multilingual resources. Overall, evaluation shows the effect of cross-lingual string-based approaches for linking RDF resources expressed in different languages. The methods have been evaluated on resources in English, Chinese, French and German. The best performance (over 0.90 F-measure) was obtained by the machine translation approach. This shows that the similarity-based method can be successfully applied on RDF resources independently of their type (named entities or thesauri concepts). The best experimental results involving just a pair of languages demonstrated the usefulness of such techniques for interlinking RDF resources cross-lingually.
Semantic web, Cross-lingual data treatment, Artificial intelligence
Pavel Shvaiko, Jérôme Euzenat, Ernesto Jiménez-Ruiz, Michelle Cheatham, Oktie Hassanzadeh (eds), Proc. 10th ISWC workshop on ontology matching (OM), Bethlehem (PA US), 239p., 2016
Pavel Shvaiko, Jérôme Euzenat, Ernesto Jiménez-Ruiz, Michelle Cheatham, Oktie Hassanzadeh, Ryutaro Ichise (eds), Proc. 11th ISWC workshop on ontology matching (OM), Kobe (JP), 252p., 2016
Mustafa Al-Bakri, Manuel Atencia, Steffen Lalande, Marie-Christine Rousset, Inferring same-as facts from linked data: an iterative import-by-query approach, in: Blai Bonet, Sven Koenig (eds), Proc. 29th conference on Conference on Artificial Intelligence (AAAI), Austin (TX US), pp9-15, 2015
In this paper we model the problem of data linkage in Linked Data as a reasoning problem on possibly decentralized data. We describe a novel import-by-query algorithm that alternates steps of sub-query rewriting and of tailored querying the Linked Data cloud in order to import data as specific as possible for inferring or contradicting given target same-as facts. Experiments conducted on a real-world dataset have demonstrated the feasibility of this approach and its usefulness in practice for data linkage and disambiguation.
LOD, Data interlinking
Tatiana Lesnikova, Jérôme David, Jérôme Euzenat, Interlinking English and Chinese RDF data using BabelNet, in: Pierre Genevès, Christine Vanoirbeek (eds), Proc. 15th ACM international symposium on Document engineering (DocEng), Lausanne (CH), pp39-42, 2015
Linked data technologies make it possible to publish and link structured data on the Web. Although RDF is not about text, many RDF data providers publish their data in their own language. Cross-lingual interlinking aims at discovering links between identical resources across knowledge bases in different languages. In this paper, we present a method for interlinking RDF resources described in English and Chinese using the BabelNet multilingual lexicon. Resources are represented as vectors of identifiers and then similarity between these resources is computed. The method achieves an F-measure of 88%. The results are also compared to a translation-based method.
Cross-lingual instance linking, Cross-lingual link discovery, owl:sameAs
Manuel Atencia, Jérôme David, Jérôme Euzenat, Data interlinking through robust linkkey extraction, in: Torsten Schaub, Gerhard Friedrich, Barry O'Sullivan (eds), Proc. 21st european conference on artificial intelligence (ECAI), Praha (CZ), pp15-20, 2014
Links are important for the publication of RDF data on the web. Yet, establishing links between data sets is not an easy task. We develop an approach for that purpose which extracts weak linkkeys. Linkkeys extend the notion of a key to the case of different data sets. They are made of a set of pairs of properties belonging to two different classes. A weak linkkey holds between two classes if any resources having common values for all of these properties are the same resources. An algorithm is proposed to generate a small set of candidate linkkeys. Depending on whether some of the, valid or invalid, links are known, we define supervised and non supervised measures for selecting the appropriate linkkeys. The supervised measures approximate precision and recall, while the non supervised measures are the ratio of pairs of entities a linkkey covers (coverage), and the ratio of entities from the same data set it identifies (discrimination). We have experimented these techniques on two data sets, showing the accuracy and robustness of both approaches.
Manuel Atencia, Michel Chein, Madalina Croitoru, Jérôme David, Michel Leclère, Nathalie Pernelle, Fatiha Saïs, François Scharffe, Danai Symeonidou, Defining key semantics for the RDF datasets: experiments and evaluations, in: Proc. 21st conference on International Conference on Conceptual Structures (ICCS), Iasi (RO), (Graph-Based Representation and Reasoning (Proc. 21st conference on International Conference on Conceptual Structures (ICCS)), Lecture notes in artificial intelligence 8577, 2014), pp65-78, 2014
Many techniques were recently proposed to automate the linkage of RDF datasets. Predicate selection is the step of the linkage process that consists in selecting the smallest set of relevant predicates needed to enable instance comparison. We call keys this set of predicates that is analogous to the notion of keys in relational databases. We explain formally the different assumptions behind two existing key semantics. We then evaluate experimentally the keys by studying how discovered keys could help dataset interlinking or cleaning. We discuss the experimental results and show that the two different semantics lead to comparable results on the studied datasets.
semantics of a key, data interlinking
Manuel Atencia, Jérôme David, Jérôme Euzenat, What can FCA do for database linkkey extraction?, in: Proc. 3rd ECAI workshop on What can FCA do for Artificial Intelligence? (FCA4AI), Praha (CZ), pp85-92, 2014
Links between heterogeneous data sets may be found by using a generalisation of keys in databases, called linkkeys, which apply across data sets. This paper considers the question of characterising such keys in terms of formal concept analysis. This question is natural because the space of candidate keys is an ordered structure obtained by reduction of the space of keys and that of data set partitions. Classical techniques for generating functional dependencies in formal concept analysis indeed apply for finding candidate keys. They can be adapted in order to find database candidate linkkeys. The question of their extensibility to the RDF context would be worth investigating.
Zhengjie Fan, Concise pattern learning for RDF data sets interlinking, Thèse d'informatique, Université de Grenoble, Grenoble (FR), April 2014
There are many data sets being published on the web with Semantic Web technology. The data sets contain analogous data which represent the same resources in the world. If these data sets are linked together by correctly building links, users can conveniently query data through a uniform interface, as if they are querying one data set. However, finding correct links is very challenging because there are many instances to compare. Many existing solutions have been proposed for this problem. (1) One straight-forward idea is to compare the attribute values of instances for identifying links, yet it is impossible to compare all possible pairs of attribute values. (2) Another common strategy is to compare instances according to attribute correspondences found by instance-based ontology matching, which can generate attribute correspondences based on instances. However, it is hard to identify the same instances across data sets, because there are the same instances whose attribute values of some attribute correspondences are not equal. (3) Many existing solutions leverage Genetic Programming to construct interlinking patterns for comparing instances, while they suffer from long running time. In this thesis, an interlinking method is proposed to interlink the same instances across different data sets, based on both statistical learning and symbolic learning. The input is two data sets, class correspondences across the two data sets and a set of sample links that are assessed by users as either "positive" or "negative". The method builds a classifier that distinguishes correct links and incorrect links across two RDF data sets with the set of assessed sample links. The classifier is composed of attribute correspondences across corresponding classes of two data sets, which help compare instances and build links. The classifier is called an interlinking pattern in this thesis. On the one hand, our method discovers potential attribute correspondences of each class correspondence via a statistical learning method, the K-medoids clustering algorithm, with instance value statistics. On the other hand, our solution builds the interlinking pattern by a symbolic learning method, Version Space, with all discovered potential attribute correspondences and the set of assessed sample links. Our method can fulfill the interlinking task that does not have a conjunctive interlinking pattern that covers all assessed correct links with a concise format. Experiments confirm that our interlinking method with only 1% of sample links already reaches a high F-measure (around 0.94-0.99). The F-measure quickly converges, being improved by nearly 10% than other approaches.
Interlinking, Ontology Matching, Machine Learning
Zhengjie Fan, Jérôme Euzenat, François Scharffe, Learning concise pattern for interlinking with extended version space, in: Dominik l zak, Hung Son Nguyen, Marek Reformat, Eugene Santos (eds), Proc. 13th IEEE/WIC/ACM international conference on web intelligence (WI), Warsaw (PL), pp70-77, 2014
Many data sets on the web contain analogous data which represent the same resources in the world, so it is helpful to interlink different data sets for sharing information. However, finding correct links is very challenging because there are many instances to compare. In this paper, an interlinking method is proposed to interlink instances across different data sets. The input is class correspondences, property correspondences and a set of sample links that are assessed by users as either "positive" or "negative". We apply a machine learning method, Version Space, in order to construct a classifier, which is called interlinking pattern, that can justify correct links and incorrect links for both data sets. We improve the learning method so that it resolves the no-conjunctive-pattern problem. We call it Extended Version Space. Experiments confirm that our method with only 1% of sample links already reaches a high F-measure (around 0.96-0.99). The F-measure quickly converges, being improved by nearly 10% than other comparable approaches.
Tatiana Lesnikova, Jérôme David, Jérôme Euzenat, Interlinking English and Chinese RDF data sets using machine translation, in: Johanna Völker, Heiko Paulheim, Jens Lehmann, Harald Sack, Vojtech Svátek (eds), Proc. 3rd ESWC workshop on Knowledge discovery and data mining meets linked open data (Know@LOD), Hersounisos (GR), 2014
Data interlinking is a difficult task particularly in a multilingual environment like the Web. In this paper, we evaluate the suitability of a Machine Translation approach to interlink RDF resources described in English and Chinese languages. We represent resources as text documents, and a similarity between documents is taken for similarity between resources. Documents are represented as vectors using two weighting schemes, then cosine similarity is computed. The experiment demonstrates that TF*IDF with a minimum amount of preprocessing steps can bring high results.
Semantic web, Cross-lingual link discovery, Cross-lingual instance linking, owl:sameAs
Tatiana Lesnikova, Interlinking RDF data in different languages, in: Christophe Roche, Rute Costa, Eva Coudyzer (eds), Proc. 4th workshop on Terminology and Ontology: Theories and applications (TOTh), Bruxelles (BE), 2014
Semantic web, Cross-lingual resource discovery, Multi-lingual instance matching, owl:sameAs
Tatiana Lesnikova, Interlinking cross-lingual RDF data sets, in: Proc. conference on ESWC PhD symposium, Montpellier (FR), (Philipp Cimiano, Óscar Corcho, Valentina Presutti, Laura Hollink, Sebastian Rudolph (eds), The semantic web: research and applications (Proc. 10th conference on European semantic web conference (ESWC)), Lecture notes in computer science 7882, 2012), pp671-675, 2013
Linked Open Data is an essential part of the Semantic Web. More and more data sets are published in natural languages comprising not only English but other languages as well. It becomes necessary to link the same entities distributed across different RDF data sets. This paper is an initial outline of the research to be conducted on cross-lingual RDF data set interlinking, and it presents several ideas how to approach this problem.
Multilingual Mappings, Cross-Lingual Link Discovery, Cross-Lingual RDF Data Set Linkage
Tatiana Lesnikova, NLP for interlinking multilingual LOD, in: Proc. conference on ISWC Doctoral consortium, Sydney (NSW AU), (Lora Aroyo, Natalya Noy (eds), Proceedings of the ISWC Doctoral Consortium (Proc. conference on ISWC Doctoral Consortium), Sydney (NSW AU), 2013), pp32-39, 2013
Nowadays, there are many natural languages on the Web, and we can expect that they will stay there even with the development of the Semantic Web. Though the RDF model enables structuring information in a unified way, the resources can be described using different natural languages. To find information about the same resource across different languages, we need to link identical resources together. In this paper we present an instance-based approach for resource interlinking. We also show how a problem of graph matching can be converted into a document matching for discovering cross-lingual mappings across RDF data sets.
Multilingual Mappings, Cross-Lingual Link Discovery, Cross-Lingual RDF Data Set Linkage
Manuel Atencia, Jérôme David, François Scharffe, Keys and pseudo-keys detection for web datasets cleansing and interlinking, in: Proc. 18th international conference on knowledge engineering and knowledge management (EKAW), Galway (IE), (Annette ten Teije, Johanna Voelker, Siegfried Handschuh, Heiner Stuckenschmidt, Mathieu d'Aquin, Andriy Nikolov, Nathalie Aussenac-Gilles, Nathalie Hernandez (eds), Knowledge engineering and knowledge management, Lecture notes in computer science 7603, 2012), pp144-153, 2012
This paper introduces a method for analyzing web datasets based on key dependencies. The classical notion of a key in relational databases is adapted to RDF datasets. In order to better deal with web data of variable quality, the definition of a pseudo-key is presented. An RDF vocabulary for representing keys is also provided. An algorithm to discover keys and pseudo-keys is described. Experimental results show that even for a big dataset such as DBpedia, the runtime of the algorithm is still reasonable. Two applications are further discussed: (i) detection of errors in RDF datasets, and (ii) datasets interlinking.
Data Interlinking, Semantic Web, RDF Data Cleaning
Jérôme David, François Scharffe, Détection de clefs pour l'interconnexion et le nettoyage de jeux de données, in: Actes 23e journées francophones sur Ingénierie des connaissances (IC), Paris (FR), pp401, 2012
Cet article propose une méthode d'analyse de jeux de données du Web publiés en RDF basée sur les dépendances de clefs. Ce type particulier de dépendances fonctionnelles, largement étudié dans la théorie des bases de données, permet d'évaluer si un ensemble de propriétés constitue une clef pour l'ensemble de données considéré. Si c'est le cas, il n'y aura alors pas deux instances possédant les mêmes valeurs pour ces propriétés. Après avoir donné les définitions nécessaires, nous proposons un algorithme de détection des clefs minimales sur un jeu de données RDF. Nous utilisons ensuite cet algorithme pour détecter les clefs de plusieurs jeux de données publiées sur le Web et appliquons notre approche pour deux applications: (1) réduire le nombre de propriétés à comparer dans le but de détecter des ressources identiques entre deux jeux de données, et (2) détecter des erreursà l'intérieur d'un jeu de données.
web sémantique, web de données, interconnexion, ontologies, clefs, dépendances fonctionnelles, nettoyage de données, RDF
Jérôme Euzenat, A modest proposal for data interlinking evaluation, in: Pavel Shvaiko, Jérôme Euzenat, Anastasios Kementsietsidis, Ming Mao, Natalya Noy, Heiner Stuckenschmidt (eds), Proc. 7th ISWC workshop on ontology matching (OM), Boston (MA US), pp234-235, 2012
Data interlinking is a very important topic nowadays. It is sufficiently similar to ontology matching that comparable evaluation can be overtaken. However, it has enough differences, so that specific evaluations may be designed. We discuss such variations and design.
Data interlinking, Evaluation, Benchmark, Blocking, Instance matching
Zhengjie Fan, Data linking with ontology alignment, in: Proc. 9th conference on European semantic web conference (ESWC), Heraklion (GR), (Elena Simperl, Philipp Cimiano, Axel Polleres, Óscar Corcho, Valentina Presutti (eds), The semantic web: research and applications (Proc. 9th European semantic web conference poster session), Lecture notes in computer science 7295, 2012), pp854-858, 2012
It is a trend to publish RDF data on the web, so that users can share information semantically. Then, linking isolated data sets together is highly needed. I would like to reduce the comparison scale by isolating the types of resources to be compared, so that it enhances the accuracy of the linking process. I propose a data linking method for linked data on the web. Such a method can interlink linked data automatically by referring to an ontology alignment between linked data sets. Alignments can provide them entities to compare.
François Scharffe, Ghislain Atemezing, Raphaël Troncy, Fabien Gandon, Serena Villata, Bénédicte Bucher, Fayçal Hamdi, Laurent Bihanic, Gabriel Képéklian, Franck Cotton, Jérôme Euzenat, Zhengjie Fan, Pierre-Yves Vandenbussche, Bernard Vatant, Enabling linked data publication with the Datalift platform, in: Proc. AAAI workshop on semantic cities, Toronto (ONT CA), 2012
As many cities around the world provide access to raw public data along the Open Data movement, many questions arise concerning the accessibility of these data. Various data formats, duplicate identifiers, heterogeneous metadata schema descriptions, and diverse means to access or query the data exist. These factors make it difficult for consumers to reuse and integrate data sources to develop innovative applications. The Semantic Web provides a global solution to these problems by providing languages and protocols for describing and accessing datasets. This paper presents Datalift, a framework and a platform helping to lift raw data sources to semantic interlinked data sources.
François Scharffe, Jérôme David, Manuel Atencia, Keys and pseudo-keys detection for web datasets cleansing and interlinking, Deliverable 4.1.2, Datalift, 18p., 2012
This report introduces a novel method for analysing web datasets based on key dependencies. This particular kind of functional dependencies, widely studied in the field of database theory, allows to evaluate if a set of properties constitutes a key for the set of data considered. When this is the case, there won't be any two instances having identical values for these properties. After giving necessary definitions, we propose an algorithm for detecting minimal keys and pseudo-keys in a RDF dataset. We then use this algorithm to detect keys in datasets published as web data and we apply this approach in two applications: (i) reducing the number of properties to compare in order to discover equivalent instances between two datasets, (ii) detecting errors inside a dataset.
data linking, instance matching, record linkage, co-reference resolution, ontology alignment, ontology matching
Jérôme Euzenat, Nathalie Abadie, Bénédicte Bucher, Zhengjie Fan, Houda Khrouf, Michael Luger, François Scharffe, Raphaël Troncy, Dataset interlinking module, Deliverable 4.2, Datalift, 32p., 2011
This report presents the first version of the interlinking module for the Datalift platform as well as strategies for future developments.
data interlinking, linked data, instance matching
François Scharffe, Jérôme Euzenat, MeLinDa: an interlinking framework for the web of data, Research report 7641, INRIA, Grenoble (FR), 21p., July 2011
The web of data consists of data published on the web in such a way that they can be interpreted and connected together. It is thus critical to establish links between these data, both for the web of data and for the semantic web that it contributes to feed. We consider here the various techniques developed for that purpose and analyze their commonalities and differences. We propose a general framework and show how the diverse techniques fit in the framework. From this framework we consider the relation between data interlinking and ontology matching. Although, they can be considered similar at a certain level (they both relate formal entities), they serve different purposes, but would find a mutual benefit at collaborating. We thus present a scheme under which it is possible for data linking tools to take advantage of ontology alignments.
Semantic web, Data interlinking, Instance matching, Ontology alignment, Web of data
François Scharffe, Jérôme Euzenat, Linked data meets ontology matching: enhancing data linking through ontology alignments, in: Proc. 3rd international conference on Knowledge engineering and ontology development (KEOD), Paris (FR), pp279-284, 2011
The Web of data consists of publishing data on the Web in such a way that they can be connected together and interpreted. It is thus critical to establish links between these data, both for the Web of data and for the Semantic Web that it contributes to feed. We consider here the various techniques which have been developed for that purpose and analyze their commonalities and differences. This provides a general framework that the diverse data linking systems instantiate. From this framework we consider the relation between data linking and ontology matching activities. Although, they can be considered similar at a certain level (they both relate formal entities), they serve different purposes: one acts at the schema level and the other at the instance level. However, they would find a mutual benefit at collaborating. We thus present a scheme under which it is possible for data linking tools to take advantage of ontology alignments. We present the features of expressive alignment languages that allows linking specifications to reuse ontology alignments in a natural way.
Semantic web, Linked data, Data linking, Ontology alignment, Ontology matching, Entity reonciliation, Object consolidation
François Scharffe, Zhengjie Fan, Alfio Ferrara, Houda Khrouf, Andriy Nikolov, Methods for automated dataset interlinking, Deliverable 4.1, Datalift, 34p., 2011
Interlinking data is a crucial step in the Datalift platform framework. It ensures that the published datasets are connected with others on the Web. Many techniques are developed on this topic in order to automate the task of finding similar entities in two datasets. In this deliverable, we first clarify terminology in the field of linking data. Then we classify and overview many techniques used to automate data linking on the web. We finally review 11 state-of-the-art tools and classify them according to which technique they use.
François Scharffe, Jérôme Euzenat, Méthodes et outils pour lier le web des données, in: Actes 17e conférenceAFIA-AFRIF sur reconnaissance des formes et intelligence artificielle (RFIA), Caen (FR), pp678-685, 2010
Le web des données consiste à publier des données sur le web de telle sorte qu'elles puissent être interprétées et connectées entre elles. Il est donc vital d'établir les liens entre ces données à la fois pour le web des données et pour le web sémantique qu'il contribue à nourrir. Nous proposons un cadre général dans lequel s'inscrivent les différentes techniques utilisées pour établir ces liens et nous montrons comment elles s'y insèrent. Nous proposons ensuite une architecture permettant d'associer les différents systèmes de liage de données et de les faire collaborer avec les systèmes développés pour la mise en correspondance d'ontologies qui présente de nombreux points communs avec la découverte de liens.
Semantic web, Data interlinking, Instance matching, Ontology alignment, Web of data