The reconciliation of data reality and the real world – on the way to a data quality model
It has become a truism that data is becoming more and more important and that handling it can lead to a real competitive advantage for companies and a great increase in efficiency for public authorities. In order to also take the importance of data into account with concrete measures, the eGovernment Institute has created the foundations of a data quality model. The challenge of data quality can be illustrated using the example of the use of personal data, which is of great relevance in the context of eGovernment for public authorities, but also for companies. Personal data is collected and stored in one process. Later, the same data is reused in another process, today mostly without checking the quality of the data. For example, data is collected in the context of a vehicle registration by a road traffic office and used at a later time to send invoices for motor vehicle tax. The quality of the data influences the further costs for the administration for returns, address corrections and collection. In other cases, not only do additional costs arise, but the risk of disclosing data to the wrong people increases. With the help of a data quality model, the risks should become known and thus controllable. The model should represent the measuring points and influencing variables at the process level and not, like most existing models, at the data level.
1. Generating real-world data
For the development of a practice-oriented model, a generic process was assumed that includes the collection, processing and use of data. In the first step, acquisition, data is generated from the real world. This can be measurements in the broader sense, as the automated transfer of data from sensors or databases. Furthermore, the data can also be collected on the basis of declarations by actors. In addition to the data, the metadata collected also play a decisive role in the further considerations of data quality management: only with the information about the time, the context of the data collection or the origin of the data can well-founded assumptions about the quality of the data be made, even over a longer full stop of time.
2. Creating isolated data reality
In a second step, the processing and linking of the collected data creates a data reality that is now isolated from the real world and is confirmed or extended by each consistent addition. Inconsistencies in this data reality can be detected and processed, provided that the extent and quality of the metadata allow statements from different sources about the same contrast to be recognised as such. The processing steps of this data reality must also be documented in the metadata so that it can be traced at any time how the discrepancies between raw data and processed data occur and which are the original, unprocessed data.
3. Reconciliation of data reality with the real world
In a final process step, actions are triggered from the data reality that allow a comparison of the data reality with the real world. For example, invoices are sent. By measuring the results of the activity, metadata can in turn be obtained on the quality of the data used – i.e. the distance between the data world and the real world. Thus, a paid invoice can be seen as an indicator that the address is most likely correct.
Figure 1 – Phases of the data quality model
Using the example of personal data in an ordering process, the individual phases can be defined again. In the entry step, the data is entered by the person him/herself; if necessary, other data sources can be requested for information on a person, e.g. Linkedin.
Comparison with data model
In the second phase, the information entered is compared with the existing data model, i.e. checked against existing information on the person, address or other details. In case of confirmation, the new data set experiences an increase in its reliability, whereby this cannot exceed that of the respective data reference. During the matching process, inconsistencies (e.g. address does not exist) or deviations from previous information may be found. These must be processed and resolved manually in the first instance. The data output as the last phase is asynchronous to the two preceding steps. A data output can directly follow the first two phases. However, it is also conceivable that the first two phases are run through for countless new data sets before the first data output takes place. The quality requirements that need to be met in order to make the step to output must be defined in the corresponding process and strongly depends on the intended use of the output.
These model considerations show that the question of data quality is not primarily a technical challenge, but can be addressed through the definition of processes and the consistent collection and use of metadata. These initial categorical considerations must be concretised and further refined in a next step using a use case. This includes defining the individual process steps, the quality measurement points and the data sources used. Particular attention should be paid to the metadata collected (and its use for quality measurement). The use of different data sources for the confirmation of attributes must be examined for their feasibility in the concrete use case, because the question of quality also arises with other data sources. For quantitative assessment of data quality, meaningful scales suitable for the application must be developed, based on existing standards. In this way, operationalisation and automation of the processes can be achieved. Whether traditional algorithms are sufficiently suitable for such a step or whether machine learning approaches are better suited to create the necessary contextual reference must be reviewed regularly, as technological developments in this area are currently subject to very short cycles.