Tag Archive for: Linked-Data

Turning Big useless Data into useful Big Data

The S3Model approach can help organizations transform large amounts of disparate, unconnected data into meaningful data. This will save time and costs when attempting to exploit this data across a wide range of applications, write our authors.

Data is being collected everywhere. From the international conglomerate to the shop on the corner; from hospitals to all levels of government. These organizations often store data and do not use it. They often do not know why they are storing this data there is a general feeling that because everybody else in the business has a system to collect data, we must also do that.

Moreover, there are some questions which we need to ask them about all this data:

  • Are some of the data points collected incomplete?
  • How much time is devoted to massaging data?
  • Do reports come from tables in databases that no one seems to understand?
  • Does data from one department look similar to data from another; but no one is sure if or how it is connected?

If your organization answered yes to any of these questions, it has a data quality problem.

The terms Big Data, Internet of Things, Artificial Intelligence, Blockchain, Data Lakes and other panaceas are tossed around all over the tech industry and throughout the popular press. These technologies are often touted as the next big thing. But everyone is talking about the value of collecting all these disconnected data, but nobody is talking about its quality. The underlying data quality issues are as good as ignored: data quality is the elephant in the room.

When attempting to fix things, there’s a tendency to dump everything into a data lake, without a schema, hoping that it will be sorted out later. This is an even bigger mistake. It might be easy to save everything but in the end, it won’t be useful if its meaning is unknown.

To get a better quality of the data

With more and more data being committed into schemaless environments, data quality is becoming worse and we can expect it to continue to do so. It is now more difficult to determine the quality of the data for any particular purpose. There is often little to no governance information and there is no idea of the context of collected data.

The result is that there’s a lot of data is being collected, but nobody does anything with it because of its questionable quality, meaning and manual labor required to make the data useful. How can we use this data, if nobody really knows how good or bad the information is? The manual intervention required in data preparation is too time-consuming to be cost-effective.

All of us should be concerned about data quality. Any reasonable proposal for improving it should include a way to properly capture data semantics. The semantics include not only the specific meaning of a given data point but also what it means in relation to the other data points captured at the same time and place.  Linked data can help with retaining these semantics – when properly applied.

The use of semantic graph databases is the new trend. The utility of using semantic graphs for knowledge discoveryis also well known at some of the largest content companies. However, there is currently no implemented, systematic approach to creating semantic graphs that:

  • will ensure data quality
  • will provide the complete context of the data
  • will provide a pathway to integrate existing data

The current process is expensive, slow, manual and error-prone. If you know of a different approach, let us know.

This is our “change my mind” meme challenge.

When determining that certain data should be captured for re-use, there are specific considerations that should be made. Starting with a very simple question: “What is the datatype of the data” and going on with:

  • For quantitative data, is there a unit of measure?
  • Is this data captured at specific locations only?
  • What are the rules or guidelines in place that govern what this data means?
  • Why do we want to capture it now?

Until we reach to the very broad “What does this data mean in the context it was captured”?

Data need surrounding context

When these questions are answered and recorded in a computable manner; the data becomes information. This information is what humans can use to make decisions. Simple data points are not very useful without the surrounding context. Since we build computers to emulate our decision-making process, then we must be able to encode this context in a way that the computer can interpret and process.

It is impractical to encode this context in every existing and future software applications that might need to process this information. A sharable, computable model solves this problem What happens in the current world is that new context and semantics may be added or changed so that the meaning of the original data may be lost or obfuscated. Each time the data is exploited in a new application its meaning may be changed kind of like the Telephone Game; you seldom get back what was originally intended.

But it is possible (although not trivial) to record and share contextual information using standards-based technology. Using standard XML, RDF and OWL we have defined an approach and process that we call the Shareable, Structured, Semantic Model(S3Model).

Data will be modeled for the user

S3Model is a new foundation for harmonizinglinked dataacross information domains. It consists of a small core ontology of 13 classes and 10 object properties used to organize the components of the information model, which consists of nine base classes used to represent types of information. We also allow the definition of the spatial, temporal and ontological context of each individual data item.

The data modeling process in S3Model consists of arranging items as a document-like component. The content and structure of the component are determined by the person who is defining the data that will be collected. We call this person the Data Model Designer, who is someone that understands what must be included in the application because she is the final user or she knows very well the needs of the final user.

The final result of the work done by the Data Model Designer is a set of Data Models, which are used to constrain the information in a data instance as an XML Schema (canonical expression) or another programming language such as Python, Go, Ruby, C++ or Java.

The Data Model Designer can express the appropriate semantics using popular open ontologies and vocabularies as well as any internal private vocabularies. Once defined in this model, the semantics and structure can be easily exchanged across all platforms without loss of fidelity. Data instances can be serialized as XML, JSON, and RDF to allow for the maximum cross-platform exchange and ease of analysis capabilities.

The S3Model

The key to widespread S3Model-based systems implementation is the tooling: the S3Model Tool Suitewas developed to allow domain experts, with minimal training, to become Data Model Designers. With the available online training, a domain expert can design a Data Model that is as technically robust as it is semantically rich.

In addition to building models from scratch, tools are available to convert existing data into semantically rich S3Model data: the open source S3Model Translatortool helps create a model for any Comma Separated Value (CSV) file such as a database extract or spreadsheet export. After being imported into the S3Model Translator, the tool guides the Data Model Designer through creating an S3Model-based Data Model, which can be used to validate data from the imported CSV file or any CSV file that has the same structure. This provides a pathway for your data from the flat, table-like world into the semantically rich linked data world.

The future of sharable, computable models for all datasets sounds wonderful. But there is no way that a rip and replace approach can ever work. That is why part of the design process was to ensure that there was a gradual pathway built into the S3Model ecosystem. Therefore the translator tool is a great step towards achieving this process. The tutorials and examples of the open source tools as well as our upcoming courses demonstrate in great detail the advantages of being able to connect data across models and across domains when the models are built by subject matter experts (SME) that understand how to use open vocabularies and ontologies to semantically markup their data.

The additional advantages of S3Model lay in the ability for an SME to model required governance requirements especially in domains such as healthcare, finance, etc. where there are specific legal constraints. These constraints can be for privacy or recording all contributors to data sourcing or editing throughout the workflow.

As examples and to demonstrate the broad capability and robustness of S3Model, we have translated all of the HL7 FHIR resources and all of the unique entries in the US NIH Common Data Element repository. We also did some proof of concept models for the NIEM models and XBRL.


We invite you to take a look at our open source offerings and participate in improving these tools so that we can all have better quality data.

PDF erstellen

Related Posts

None found

An international Knowledge Base for all Heritage Institutions (Part 2*)

Heritage institutions are places in which works of art, historical records, and other objects of cultural or scientific interest are sheltered and made accessible to the public.  The equivalent of that in the digital world, is already taking shape, through digitization and sharing of digital-born or digitized objects on online platforms. In this second part, we describe the different modules of the project in more detail and sketch an avenue for the internationalization of the project. In part 1 of this article, we have described how Wikipedia and related Wikimedia projects play a special role in the emerging data and platform ecosystem, and we have shorty presented the “Sum of All GLAM” project[1], which proposes to improve the coverage of heritage institutions in Wikidata and Wikipedia. 

Curation of existing data

Before ingesting new data, it usually makes sense to analyse the existing data on Wikidata and to correct any instances of bad data modelling. One common problem are Wikidata entries concerning heritage institutions not properly differentiating between “building” and “organization”. Yet to avoid extra work later, it is crucial to make this distinction and correct any other data modelling issues before adding anything to these entries. To coordinate the resolution of data modelling issues, the data cleansing tasks carried out on the Brazilian dataset will be documented and serve as an example to guide similar data cleansing tasks in other countries. The plan is to have these tasks carried out in a coordinated manner by Wikidataists around the world.

In parallel to the cleansing of existing data, some fundamental questions need to be asked about the data:

  • To what extent is the data complete? – Is there a Wikidata entry for every existing heritage institution in that country? To what extent is all the information needed for the Wikipedia infoboxes already present in Wikidata?
  • How good is the data? – Is the data correct and up-to-date or is it outdated? Is outdated information properly historicized? Are the internal structures of heritage institutions properly represented? Is all the data properly sourced?

After this initial analysis, a strategy for further improvement of the data can be devised on a country-by-country basis. Apart from the manual enhancement of the data by existing members of the Wikidata community, two important avenues need to be pursued to ensure the provision of complete, high-quality data: the integration of existing databases as well as crowdsourcing campaigns targeting both heritage professionals and Wikipedians alike.

Data provision through cooperation with maintainers of GLAM databases

The easiest way to incorporate large quantities of high-quality data into Wikidata and properly reference them to a reliable source is to cooperate with maintainers of official GLAM databases. As the experience in the OpenGLAM Benchmark Surveyhas shown, it is quite easy in some countries to get access to well-curated and complete databases of heritage institutions, while in other countries, such databases are less complete, not that well curated, or may not even exist. In several countries, such as Brazil, Switzerland, or Ukraine, data about all known heritage institutions have already been incorporated. In several other countries, databases are available, but data has not yet been ingested. Itis the project’s goal not only to incorporate data once, but also to establish long-term partnerships with the maintainers of relevant databases to ensure regular updating of the data in Wikidata. At the same time, maintainers of the databases are likely to benefit from many pairs of eyes spotting errors in the data or enhancing existing databases by adding further information.

Data provision and maintenance by means of crowdsourcing campaigns

Where existing databases do not exist, crowdsourcing campaigns are envisaged that will address heritage professionals and Wikipedians alike. For this purpose, data maintenance and improvement tasks need to be documented and broken down into easily understandable, manageable chunks. This documentation will be developed over the coming months in cooperation with test users, and trials will be carried out both in Brazil and Switzerland. Larger campaigns will be scheduled for 2020.

Implementation of Wikidata-powered Infoboxes

To gain more visibility for the ingested data and to close the feedback loop between data provision and data use, Wikidata-powered infoboxes will be rolled-out across Wikipedia. This will require negotiation with various Wikipedia communities, which in the past have adopted differing policies with regard to the use of data from Wikidata inthe article name space. In some Wikipedias, such as the Catalan Wikipedia, Wikidata-powered infoboxes are in widespread use, while other communities, such as the ones on the German or the English Wikipedia, have been more reticent – partly due to quality considerations. Entering a dialogue with the more demanding communities is therefore important to drive efforts to enhance the data quality on Wikidata. While engaging in these dialogues, the project team will document use cases which will provide an empirical basis for the assessment of data completeness and guide further efforts. On the Wikipedia side, transcluding data directly from Wikidata will lead to important benefits, as information that currently must be updated in a myriad of different language versions separately, will be stored in a central place on Wikidata and maintained in a collaborative effort by the various language communities. For smaller communities, this is the only way to cope with an ever-growing amount of structured data in a Wikipedia environment facing a stagnating or shrinking contributor base. And for larger language communities, it is a good way to help provide up-to-date information about their own geographic areas in other languages. To enhance the chances of buy-in from many communities and to facilitate the roll-out of infoboxes across the various language versions of Wikipedia, it is important to make high-quality and properly sourced data available on Wikidata. Furthermore, according to the best practice when creating Wikidata-powered infoboxes, it will always be possible to overwrite information in infoboxes locally by the Wikipedia community if necessary. And last but not least, the roll-out will take place across several language communities in a flexible manner, following the pace of the different communities. Currently, Wikidata-powered infobox templates for museums have already been implemented on the Portuguese(see figure 5) and on the Italian Wikipedias; another one for archives has been prepared in the Portugueseversion. To spread the practice more quickly at an international level, it would be helpful if the templates could be rolled out on English Wikipedia at an early stage of the project.

Figure 5: Wikidata-powered Wikipedia infobox for Museums on the Portuguese Wikipedia

Mbabel template to support edit-a-thons or editing campaigns

In addition to providing data for infoboxes, the entries on Wikidata can also be used to create article stubs to aid the creation of new articles about heritage institutions. This is where the Mbabel tool comes in; it lets Wikipedia editors automatically create draft articles in their user namespace by providing the structure of an article based on the data contained in Wikidata. This structure includes an introductory sentence and the infobox template prefilled with data from Wikidata. The editors can then complement the draft articles with further information before publishing them in the article namespace. This not only facilitates the work of existing contributors, but also greatly simplifies the job of new editors who participate in edit-a-thons or editing campaigns. By this means, the project team intends to leverage the power of Wikidata to also promote the writing of new Wikipedia articles about heritage institutions that have not yet been covered in a particular language. The tool consists of a template that has so far been implemented on Portuguese Wikipedia for subjects including museums, books, movies, earthquakes, newspapers and the Brazilian elections. In the course of the project, the tool will also be implemented for articles about libraries and archives, before being rolled out in other language versions.

Figure 6: Stub-article automatically created by means of the Mbabel tool

Internationalization of the Project

The internationalization of the approaches described in this article will be facilitated by the model project implemented in Brazil and on Portuguese Wikipedia, which is currently funded by the Geneva-based MY-D Foundationand by a private sponsor. As the current project funding is limited to the implementation of the Brazilian model project and the provision of documentation, the deployment of the project in other countries and on other language versions of Wikipedia will rely on the involvement of volunteers in various countries as well as local sponsoring and/or funding through Wikimedia Foundation channels, perhaps taking a form similar to the funding of other international outreach campaigns, such as Wiki Loves Monuments.

Outlook

As illustrated in figure 1, the project provides an important cornerstone for any other activity targeting the other layers of information about heritage institutions. Thus, it could serve as a starting point for a more detailed description of archives and collections, and it extends the work that is already been carried out in other GLAM-Wiki initiatives dedicated to the description of specific heritage objects, such as the Sum of all Paintings Project, which repertorizes and systematically gathers information about all paintings held by heritage institutions. Another logical extension of the project lies in the development of further cooperation with individual heritage institutions to improve the coverage of their collection on Wikipedia. And, last but not least, the project may be expanded to cover other entities, such as performing arts organizations, historical monuments or cultural venues.


*This is Part 2 of this article. Part 1 was published here.


Reference

[1]The working title, GLAM stands for “Galleries, Libraries, Archives, Museums”; the acronym is commonly used to refer to heritage institutions.

PDF erstellen

Related Posts

None found

An International Knowledge Base for all Heritage Institutions (Part 1*)

Heritage institutions are places in which works of art, historical records, and other objects of cultural or scientific interest are sheltered and made accessible to the public. The equivalent of that in the digital world, is already taking shape, through digitization and sharing of digital-born or digitized objects on online platforms. In this article we shed light on how the issue of structured data about heritage institutions is being tackled by Wikipedia, and its sister Wikidata, through their “Sum of All GLAM” project.[1].

Access to these objects, and information about them, is provided and mediated both through platforms maintained by the heritage sector itself and through more general-purpose platforms, which often serve as a first point of entry for the wider public. These platforms include Google, Facebook, YouTube, and Wikipedia, which also happen to be among the most visited websites on the Web. In this emerging data and platform ecosystem, Wikipedia and related Wikimedia projects play a special role as they are community-driven, non-profit endeavours. Moreover, these projects are working hard to make data and information available in a free, connected and structured manner, for anybody to re-use.

There are various layers of information about heritage institutions, ranging from descriptions of institutions themselves and descriptions of their collections, to descriptions of individual items. There may be digital representations of these items, and in some cases even searchable content within the items. Figure 1 illustrates how the top four layers of data and information are currently addressed in Wikipedia, with Wikidata and Wikimedia Commons increasingly focussing on providing structured and linked data alongside the unstructured or semi-structured encyclopaedic information contained in Wikipedia articles.

Figure 1: Heritage data and content in the context of Wikipedia and its sister projects

Structured data about institutions and collections, as well as some item-level data are maintained on Wikidata, which serves as Wikipedia’s repository for structured data. Wikimedia Commons serves as Wikipedia’s repository for media files and is currently being prepared for the linked data era through the “Structured Data on Wikimedia Commons” project. This project accompanies the transition of Wikimedia Commons to linked open data, foresees the provision of item level metadata as linked data and monitors the implementation of the IIIF standard, to allow easier cross-platform manipulation and media file sharing. While similar efforts are on their way at all the different levels of information, we will focus the the remainder of the article on a project that is dedicated to improving data quality and completeness of the top layer, i.e. the data about the heritage institutions themselves. This project lays the foundations for an International Knowledge Base for Heritage Institutions. The project is currently managed by Wiki Movement Brazil in cooperation with OpenGLAM Switzerland and will be expanded to further countries in the near future; it is also being coordinated with “FindingGLAMs”, a project run by Wikimedia Sweden, UNESCO, and the Wikimedia Foundation, which pursues similar goals, but addresses different layers of information, including aspects related to structured data on Wikimedia Commons.

To succeed, the International Knowledge Base for Heritage Institutions needs to address all stages of the linked data value chain, from data provision to data use (figure 2):

Figure 2: Core processes of linked data publication (source: eCH-0205 – Linked Open Data)

Parts of the data have already been ingestedinto Wikidata and the relevant elements of the ontology have already been implemented so currently most of the effort is going into data maintenance. The goal is to provide the data in a coherent way that makes it fit for its use in Wikidata infoboxes (see figure 3 for an example). However, before it is ready to go there are various issues to be addressed, such as data quality, correct data modelling, and data completeness.

Figure 3: Wikipedia article with infobox containing structured data

The goal of the “Sum of All GLAM” project is to complete entries for all the heritage institutions of a given country, with all the data that is required for the infoboxes. To monitor progress in achieving this goal, we are currently putting in place various instruments that can be used by community members to focus their efforts in improving existing data entries (see figure 4 for an example). While the issues related to data modelling will be addressed by members of the Wikidata community at an international level, the project team is planning to involve members of the heritage community in the various countries to help improve completeness of the data and to make sure that all the data entries are properly sourced. While existing Wikidata community members are expected to work on this hand-in-hand with members of the heritage community, the project will heavily rely on the heritage sector to help keep the information about their institutions up-to-date in the longer run. In fact, in the case of medium-sized and larger institutions, regularly updating their existing Wikidata entry should eventually become part of the tasks carried out by an institution’s communication department. For smaller institutions, other solutions need to be found – possibly via the intermediation of umbrella organizations or specialized institutions which take care of coordination at a national level.

Figure 4: Table indicating the completeness of data about museums for different countries

To involve the members of the heritage sector in various countries, internationalization is being pursued early in the project: To do so, a well-documented model project is currently implemented in Brazil, which can in turn be implemented in other countries. To make sure that the project documentation is fit for its purpose, international partners ready to implement parts of the project in their country are currently being recruited. And to facilitate the tailoring of the project to local needs, the model project will be broken down into various modules that can be implemented separately or in combination with other modules, as the local partners see fit. In the second part of this article we will describe some of these modules.


Part 2 of this article is published here.


References

[1]The working title, GLAM stands for “Galleries, Libraries, Archives, Museums”; the acronym is commonly used to refer to heritage institutions.

PDF erstellen

Related Posts

None found