On the utility of bytewise approximate matching in computer science with a special focus on digital forensics investigations

Breitinger, Frank (2014)
On the utility of bytewise approximate matching in computer science with a special focus on digital forensics investigations.
Technische Universität Darmstadt
Ph.D. Thesis, Primary publication

Preview

Text
20140717_Dissertation_Breitinger_final.pdf
Copyright Information: CC BY-NC-ND 2.5 Generic - Creative Commons, Attribution, NonCommercial, NoDerivs .
Download (3MB) | Preview

Item Type:

Ph.D. Thesis

Type of entry:

Primary publication

Title:

On the utility of bytewise approximate matching in computer science with a special focus on digital forensics investigations

Language:

English

Referees:

Katzenbeisser, Prof. Dr. Stefan ; Baier, Prof. Dr. Harald ; Freiling, Prof. Dr. Felix

Date:

2014

Place of Publication:

Darmstadt

Date of oral examination:

30 June 2014

Abstract:

Handling hundreds of thousands of files is a major challenge in today’s digital forensics. In order to cope with this information overload, investigators often apply hash functions for automated input identification. Besides identifying exact duplicates, which is mostly solved running cryptographic hash functions, it is also necessary to cope with similar inputs (e.g., different versions of files), embedded objects (e.g., a JPG within a office document), and fragments (e.g., network packets). Thus, the essential idea is to complement the use of cryptographic hash functions, to detect data objects with bytewise identical representation, with the capability to find objects with bytewise similar representations. Unlike cryptographic hash functions, which have a wide range of applications and have been studied as well as tested for a long time, approximate matching algorithms are still in their early development stages. More precisely, currently the community is missing a definition, an evaluation methodology and (additional) fields of application. Therefore, this thesis aims at establishing approximate matching in computer sciences with a special focus on digital forensic investigations. One of our firsts step was to develop a generic definition for approximate matching, in collaboration with the National Institute of Standards and Technology (NIST) which is applicable to the different levels approximate matching, e.g., bytewise and semantic. A subsequent detailed analysis of both existing approaches uncovers different strengths and weaknesses, therefore we present improvements. To extend the range of algorithms, this work introduces three of our new algorithms, that are based on well-known techniques of computer sciences. A core contribution of this thesis is the open source evaluation framework called FRASH which assesses tools on different criteria. Besides traditional properties (borrowed from hash functions) like generation efficiency and space efficiency (compression), we conceive methods to determine precision and recall rates based on synthetic as well as real world data. Since digital investigations are often time critical, we improve the performance of automated file identification by a mechanism we call prefetching. Compared to a straight forward analysis, the performance increases by almost 40% without additional hardware. In this context we also discuss the impact of different hashing/approximate matching algorithms for digital investigations and conclude that it is absolutely reasonable to apply crypto hashing as well as bytewise/semantic approximate matching algorithms in a prosecution. To extend the fields of application, this thesis demonstrates the capabilities of applying approximate matching on network traffic analysis and biometric template protection. Our research shows that approximate matching is perfectly suited for data leakage prevention and can also be applied for biometric template protection, biometric data compression and efficient biometric identification.

Alternative Abstract:

Alternative Abstract

Language

Heutzutage ist das Bearbeiten von hunderttausenden Dateien eine der wesentlichen Herausforderungen der digitalen Forensik. Um diese Datenflut zu bewältigen verwenden Ermittler oft Hashfunktionen zur automatisierten Dateierkennung. Neben exakten Duplikaten, wofür meist kryptographische Hashfunktionen verwendet werden, ist es jedoch auch hilfreich ähnliche Dateien (z.B. verschiedene Versionen einer Datei), eingebettete Objekte (z.B. JPG in einem Office Dokument) und Fragmente (z.B. Netzwerkpakete) zu erkennen. Die wesentliche Idee ist den klassischen Ansatz der kryptographischen Hash- funktionen zu ergänzen und somit auch ähnliche Bytesequenzen zu klassifizieren. Im Gegensatz zu kryptographischen Hashfunktionen, die zahlreiche Anwendungsgebiete haben und ausgiebig getestet werden, sind ähnlichkeitserhaltende Algorithmen (engl. Approximate matching algorithms) noch in einer frühen Entwicklungsphase. Genaugenommen fehlt der Community derzeit eine Definition, eine Evaluations-Methodologie und weitere Anwendungsgebiete.

Diese Dissertation hat daher das Ziel ähnlichkeitserhaltende Algorithmen in der In- formatik, mit besonderem Blick auf die digitale Forensik, zu etablieren. In einem ersten Schritt präsentieren wir eine generische Definition, die in Zusammenarbeit mit dem National Institute of Standards and Technology (NIST) entstanden ist. Im Anschluss werden die beiden existierenden Algorithmen untersucht um Stärken / Schwächen zu analysieren. Um das Sortiment dieser Algorithmen zu erweitern, präsentieren wir drei weitere, eigene Algorithmen, die auf bewehrten Techniken der Informatik basieren.

Ein wesentlicher Bestandteil dieser Arbeit ist das open source Evaluationsframework FRASH welches Tools anhand verschiedener Kriterien bewertet. Neben den traditionellen Eigenschaften wie Laufzeit oder Kompression, haben wir Methoden entwickelt um die Precision & Recall Raten auf synthetischen und realen Daten zu messen.

Da Ermittlungen in der digitalen Forensik oft zeitkritisch sind, haben wir ein Verfahren namens ‘prefetching’ entwickelt, welches die Laufzeit um ca. 40% verbessert, ohne zusätzliche Hardware. In diesem Zusammenhang wird auch der Einfluss unterschiedlicher Verfahren für kriminalistische Ermittlungen untersucht mit dem Entschluss,dass es durchaus Bedarf für unterschiedliche Algorithmen (d.h. auch kryptographische Hashfunktionen, semantische Hashfunktionen) in der Strafverfolgung gibt.

Abschließend zeigen wir weitere Anwendungsgebiete und wenden ähnlichkeitserhaltende Algorithmen zur Netzwerkanalyse und zum Schutze biometrischer Daten an. Die Ergebnisse zeigen, dass diese Techniken sich zum Schutze sensibler Daten eignen (Data leakage prevention) und auch in der Biometrie für Template Protektion eingesetzt werden können.

German

URN:

urn:nbn:de:tuda-tuprints-40559

Classification DDC:

000 Generalities, computers, information > 004 Computer science

Divisions:

20 Department of Computer Science

Date Deposited:

28 Jul 2014 10:42

Last Modified:

09 Jul 2020 00:45

URI: