Strukturelle Analyse Web-basierter Dokumente

Dehmer, Matthias (2005)
Strukturelle Analyse Web-basierter Dokumente.
Technische Universität Darmstadt
Ph.D. Thesis, Primary publication

Preview

PDF
diss_dehmer.pdf
Copyright Information: In Copyright.
Download (1MB) | Preview

Item Type:

Ph.D. Thesis

Type of entry:

Primary publication

Title:

Strukturelle Analyse Web-basierter Dokumente

Language:

German

Referees:

Mühlhäuser, Prof. Dr. Max ; Mehler, Jun.-Prof. Alexander

Date:

12 October 2005

Place of Publication:

Darmstadt

Date of oral examination:

25 September 2005

Abstract:

Im Zuge der web-basierten Kommunikation und in Anbetracht der gigantischen Datenmengen, die im World Wide Web verfügbar sind, erlangt das so genannte Web Mining eine immer stärkere Bedeutung. Ziel des Web Mining ist die Informationsgewinnung und Analyse web-basierter Daten auf der Grundlage von Data Mining-Methoden. Die eigentliche Problemstellung des Data Mining ist die Entdeckung von Mustern und Strukturen in großen Datenbeständen. Web Mining ist also eine Variante des Data Mining; es kann grob in drei Bereiche unterteilt werden: Web Structure Mining, Web Content Mining und Web Usage Mining. Die zentrale Problemstellung des Web Structure Mining, die in dieser Arbeit besonders im Vordergrund steht, ist die Erforschung und Untersuchung struktureller Eigenschaften web-basierter Dokumente. Das Web wird in dieser Arbeit wie üblich als Hypertext aufgefasst. In der Anfangsphase der Hypertextforschung wurden graphbasierte Indizes zur Messung struktureller Ausprägungen und Strukturvergleichen von Hypertexten verwendet. Diese sind jedoch im Hinblick auf die ähnlichkeitsbasierte Gruppierung graphbasierter Hypertextstrukturen unzureichend. Daher konzentriert sich die vorliegende Arbeit auf die Entwicklung neuer graphentheoretischer und ähnlichkeitsbasierter Analysemethoden. Ähnlichkeitsbasierte Analysemethoden, die auf graphentheoretischen Modellen beruhen, können nur dann sinnvoll im Hypertextumfeld eingesetzt werden, wenn sie aussagekräftige und effiziente strukturelle Vergleiche graphbasierter Hypertexte ermöglichen. Aus diesem Grund wird in dieser Arbeit ein parametrisches Graphähnlichkeitsmodell entwickelt, welches viele Anwendungen im Web Structure Mining besitzt. Dabei stellt die Konstruktion eines Verfahrens zur Bestimmung der strukturellen Ähnlichkeit von Graphen eine zentrale Herausforderung dar. Klassische Verfahren zur Bestimmung der Graphähnlichkeit beruhen in den meisten Fällen auf Isomorphie- und Untergraphisomorphiebeziehungen. Dagegen wird in dieser Arbeit ein Verfahren zur Bestimmung der strukturellen Ähnlichkeit hierarchisierter und gerichteter Graphen entwickelt, welches nicht auf Isomorphiebeziehungen aufbaut. Oft wird im Rahmen von Analysen web-basierter Dokumentstrukturen das bekannte Vektorraummodell zu Grunde gelegt. Auf der Basis eines graphbasierten Repräsentationsmodells wird dagegen in dieser Arbeit die These vertreten und belegt, dass die graphbasierte Repräsentation einen sinnvollen Ausgangspunkt für die Modellierung web-basierter Dokumente darstellt. In einem experimentellen Teil werden die entwickelten Graphähnlichkeitsmaße erfolgreich evaluiert und die aus der Evaluierung resultierenden Anwendungen vorgestellt.

Alternative Abstract:

Alternative Abstract

Language

In the course of web-based communication and in consideration of the huge amount of data on the web, the so-called Web Mining receives considerable interest. The main goal of Web Mining is the mining of information and the analysis of web-based hypertext data on the basis of well-known Data Mining methods. The problem of these Data Mining methods is the discovery of patterns and structures in large amounts of data. Web Mining can be divided into three major fields: Web Structure Mining, Web Content Mining und Web Usage Mining. The main focus of this thesis is on Web Structure Mining, that is the exploration and examination of structural properties of web-based documents. In this thesis the web will be considered as a hypertext. Graph-theoretic indices for measuring structural characteristics of hypertexts were used in the initial phase of hypertext research. In terms of similarity-based clustering, graph-theoretic indices are inadequate. Thus, the present thesis deals with the development of new graph-theoretic and similarity-based methods for analyzing hypertext structures. Graph-theoretic methods are only reasonably applicable if they allow a meaningful and efficient structural comparison of hypertexts. Therefore, in the following thesis a parametric graph similarity model with applications in Web Structure Mining will be developed. Developing such a model is challenging because classical methods for measuring the structural similarity of graphs are based on isomorphic relations between the underlying graphs or subgraphs. It is well-known that the subgraph isomorphism problem is NP-complete. In contrast to this, a new method for measuring the structural similarity of graphs will be presented in this thesis that is not based on isomorphic relations. In the context of analyzing hypertext structures, the vector space model will be used frequently. On the basis of a graph-oriented hypertext representation the hypothesis will be presented and proven, that the graph-oriented representation is a meaningful starting point for modelling web-based hypertexts. On the basis of experimental examinations the developed graph similarity measures will be evaluated successfully. Furthermore, applications of the new graph similarity measures will be presented.

English

Uncontrolled Keywords:

Hypertext, Graph Theory, Similarity, World Wide Web

Alternative keywords: