Data preprocessing

Data preprocessing includes methods and procedures which transform the data obtained in process of data acquisition to a form which is more appropriate for data processing and experimenting.

Data Integration

In the first phase of data preprocessing it is necessary to integrate data from different sources and thus different models into the one. It includes:

  • Unified data model definition – data model should be strong enough to be able to represent all instances and their relations from all input sources. For this purpose the domain ontology for publications metadata was created based on generalizations of relevant parts of source data models.
  • Data mapping definition – for every data source a definition of data mapping between source data model and unified data model is required. In our case, this transformation is executed already in wrappers.

Data Cleaning

This part of the process is aimed at cleaning of inconsistencies, which are present in data because of:

  • Inconsistencies in source data model – inconsistencies and other flaws like data duplicities, incomplete or wrong data can be transferred from source model.
  • Inconsistencies due source integration – because data are integrated from many sources, duplicities of instances for example authors or publications may occur.
  • Inconsistencies created in wrapping process – there is a chance that duplicates are created during the wrapping process, for example not all links to the authors can be followed and checked due to high increase of the process duration.

Single-pass instance cleaning

This phase is designed to correct data flaws in scope of one instance like:

  • correcting the format of some data types, e.g., names, which should start with capital letter,
  • separating data fields, e.g., separating first names and surname,
  • filtering of instances with insufficient data to work with them,
  • filtering of data fields, e.g., conjunctions from key terms.

The instance cleaning is realized as a set of filters, one for each particular task. This solution is based on pipes and filters architecture. Whole process can be realized by one pass through all instances and therefore has linear time complexity. In practical realization it is effective to realize it directly after acquisition of each instance using a wrapper yet before storing it, because the processor is only lightly loaded while downloading data from sources.

Duplicates Identification

The aim of this phase is to identify instances that are describing identical entity (e.g., author or publication). As the acquisition process is based on the semantics of the domain and extracts data together with their meaning to common domain ontology no approaches for comparing concepts and discovering a mapping between them are needed.

We combine two methods, comparing data itself (in terms of ontology data type properties) and working with relations between data (object type properties).

The overall similarity of instances is computed from similarity of their properties that are not empty. For each property of an instance we use weighting method with positive and negative weight. This is more general than simple weighting, which in many cases is not sufficient. For example, consider the country property of two authors – if the similarity is low, it means that the authors are not likely the same person (need for strong weight), but if it is high, it does not mean, the authors likely are the same person (need for low weight). The parameters are three values, positive weight p, negative weight n, and threshold t, which determines where to use positive weight and where negative. Overall similarity S is a value between 0 and 1 and is calculated as:

Equation 1

where n is number of properties, si represents the similarity of i-th properties (value between 0 and 1), pi stands for positive weight, ni for negative (values between 0 and 100) and Fi represents step function, which is calculated as

Equation 2

where i is the index of property, pi is positive weight, ni negative and ti stands for threshold of i-th property (value between 0 and 1).

To decide whether the instances are identical a threshold is applied. If the similarity is above its value the instances are evaluated as identical.

Every instance is compared to every other instance of its class, which leads to quadratic asymptotic time complexity. To shorten the duration of comparing process we use simple clustering method for authors and divide them into groups by the first letter of their surnames.

Comparison of data type properties: This method is based on comparison of corresponding data type properties of two instances of the same class. For the data comparison there are used several string similarity metrics like QGrams, Monge-Elkan distance, Levenshtein distance and many other. Different method can be specified for each property including so called composite metric, which is weighted combination of several metrics (for example QGrams with weight of 0.6 and Monge-Elkan, weight 0.4), where weights are set according results of experiments. We also use special metrics for some properties, e.g., names where we consider abbreviations of their parts. Some properties needs to be compared only for identity, e.g., ISBN of the books.

Comparison of object type properties. The principle of this method is to compare the relations to neighboring instances in the ontology. With increasing number of mutual instances the probability that compared instances are the same grows.For each object property (e.g., authors of two books) we compare their datatype properties to determine how close to each other they are. Each match is included in computing the overall similarity. Thus if two books have three mutual authors, all three partial similarities are counted. The same approach also applies on the authors that are different.

Duplicates Resolving

We identified several possibilities if two instances were identified as identical:

  • mark instances as identical via special object property,
  • manually resolve (this possibility is appropriate, when small number of duplicates was found),
  • delete the instance with less information,
  • join instances, take the data type properties with higher length, join nonfunctional object properties.