A              Clusterer – Method for Graph Clustering

A.1          Basic Information

When the information space consists of objects and relations between these objects (e.g. job-offers and companies which published those offers), graph representation can serve as the basis for easier user navigation in the information space.

Unfortunately, raw graph presentation can be actually quite overwhelming for the user, hence additional structure needs to be created, allowing the user to select him the best-fitting level of granularity. In such cases, clustering is used to partition the dataset into subsets (clusters) of similar objects, supporting easier navigation in the data-set.

A.1.1      Basic Terms

Cluster

Group of objects sharing common features

A.1.2      Method Description

Clusterer composes hierarchy of layers of clusters:

§  Subgraph extraction from ontological repository to relational DB;

§  Hierarchic clustering based on selected clustering method;

§  Graph persistence into relational DB;

§  Communication with ClusterNavigator tool, which visualizes and supports user navigation in the hierarchy of clusters.

Example of a hierarchic composition of clusters is in Fig. 1.

Fig. 1. Example of hierarchic clustering.

 

A.1.3      Scenarios of Use

Clusterer should be used in following conditions:

§  Graph to be clustered is large and its vertices are well-connected, the edges can be either weighted or unweighted;

Typical scenario of use is present in Fig. 2. Note that Clusterer tool is a stand-alone application in the process of graph extraction and clustering, thus ClusterNavigator tool must be present only in the visualization phase.

Fig. 2. Architecture of the clustering tools – from OWL through graph clustering to user navigation.

 

A.1.4      External Links and Publications

§  Suchal J., Vojtek P., Frivolt G.: Interactive Navigation in Large Graphs based on Clustering. In Workshop of Tools for Acquisition, Organization and Maintenance of Knowledge in an Environment of Heterogeneous Information Resources, Poľana, Slovakia, 2007.

§  Frivolt G., Suchal J., Veselý R., Vojtek P., Vozár O., Bieliková M.: Creation, Population and Preprocessing of Experimental Data Sets for Evaluation of Applications for the Semantic Web. In SOFSEM’08, Nový Smokovec, Slovakia. LNCS, 2008.

§  JUNG: Java Universal Network/Graph Framework (http://jung.sourceforge.net/)

§  Sesame: RDF Schema Querying and Storage (http://www.openrdf.org/)

A.2          Integration Manual

Clusterer is developed in Java (Standard Edition 6).

A.2.1      Dependencies

Clusterer uses MySQL database. Visualization of the structure of clusters is dependent on ClusterNavigator tool.

A.2.2      Installation

Running Clusterer as server requires following steps:

  1. Edit clusterer.properties file;
  2. Ensure your database is running and contains database with graph and clusters;
  3. Run clusterer.jar.

A.2.3      Configuration

Edit clusterer.properties file:

  1. Set database type and name, database URL, login and password;
  2. Set port number to be used in communication with ClusterNavigator.

A.2.4      Integration Guide

Clusterer can be used in a following way:

§  Extract subgraph of instances and relations between them from OWL.

§  Create cluster hierarchy;

§  Provide access to clusters stored in relational database (in the visualization and navigation in clusters).

Typical scenario of use of the tool Clusterer is following:

  1. Extract graph from OWL;
  2. Create graph clusters and persistently store the results;
  3. Communicate with ClusterNavigator.

Graph Extraction

The process of graph extraction from ontology uses Sesame in order to access the OWL file. Nodes and edges to be extracted are defined in an XML file. An example of a configuration, where instances of types JobOffer, Settlement and ProfessionClassification produce graph nodes and predicate hasDutyLocation creates edges between them is following:

<nodes>

    <node name="n1" predicateConstraint="rdf:type" objectConstraint="http://nazou.fiit.stuba.sk/nazou/ontologies/v0.6.17/offer-job#JobOffer" label="ofr:name"/>

    <node name="n2" predicateConstraint="rdf:type" objectConstraint="http://nazou.fiit.stuba.sk/nazou/ontologies/v0.6.17/region#Settlement" label="rdfs:label"/>                 

    <node name="n3" predicateConstraint="rdf:type" objectConstraint="http://nazou.fiit.stuba.sk/nazou/ontologies/v0.6.17/classification#ProfessionClassification" label="rdfs:label"/>

</nodes>

<edges>

    <edge srcNode="n1" predicateConstraint="jo:hasDutyLocation" destNode="n2" />

</edges>

An extracted graph is stored in a relational database in two tables:

§  Table nodes stores graph vertices and has following attributes:

o  id – unique node ID;

o  type – type of node;

o  text – text label of node;

o  uri –instance URI in ontology.

§  Table edges stores edges between nodes. Following attributes are present:

o  start_id – ID of starting node;

o  end_id – ID of ending node;

o  weight – edge weight.

Hierarchic Clustering

Result of the hierarchic clustering is stored in the same relational database where the tables nodes and edges are stored in. Following tables are used:

§  Table clusters wraps unifies IDs of nodes and cluster nodes:

o  Attribute ID is unique ID;

o  Attribute node_id is from table nodes.id, if the lowest level node are wrapped, otherwise NULL value is used to determine cluster node.

o  Attribute text is textual annotation of a node.

§  Table cluster_hierarchy stores the hierarchic composition of clusters where attributes cluster_id and parent_id are bound together with following meaning: cluster with parent_id contains vertex cluster_id (which can be a single node or cluster). Note that parent_id contains one but usually more cluster_id’s, but cluster_id is only in one parent_id.

§  Table cluster_closure maps clusters to lowest level nodes (IDs of the nodes are in the table clusters) – see Fig. 3 (c).

Fig. 3. Different approaches to cluster hierarchy representation.

A.3          Development Manual

A.3.1      Tool Structure

Clusterer consists of following packages:

§  Clusterer runner (sk.fiit.nazou.clusterer)

§  Database connection (sk.fiit.nazou.common)

§  Graph extraction from OWL (sk.fiit.nazou.extraction)

§  Clustering methods (sk.fiit.nazou.methods)

§  Communication with ClusterNavigator (sk.fiit.nazou.server)

§  Persistence of clusters using relational DB (sk.fiit.nazou.storage)

A.3.2      Method Implementation

Hierarchic clustering

The process of hierarchic clustering in the ClustererRunner class is following:

  1. Create graph gateway (i.e. database connectivity);
  2. Load the graph from the database (i.e. create JUNG graph from tables nodes and edges);
  3. Choose the clustering method and its parameters;
  4. Initialize HierarchicClusterer object;
  5. Fill Collection of CompositeLayers (the layers of hierarchy of cluster) with values returned by HierarchicClusterer;
  6. Store the clustering results into the database using ClusterGateway.

Adding New Clustering Method

Each clustering method must implement GraphClusterer interface (according to JUNG library) – this step usually consist only of implementation of the extract method, which has following declaration:

public ClusterSet extract(ArchetypeGraph graph)   

Please refer to the source code of DummyGraphClusterer for a simple example of a clustering method.

A.4          Manual for Adaptation to Other Domains

Clusterer is capable to clusterize any input graph. According to the qualities of the input graph (e.g., number of vertices, graph connectivity degree, weighted/unweighted graph), adequate clustering method should be selected.

A.4.1      Configuring to Other Domain

Switching to other domain usually requires:

§  Reconfiguration of the graph extractor;

§  Determining proper clustering method and appropriate parameters of the method;

§  If different database than MySQL is used, database connector must be changed and some SQL statements may be incorrect.

A.4.2      Dependencies

Clusterer depends on MySQL. Ontology to graph extractor depends on Sesame, graph visualization uses ClusterNavigator.