In this paper, we propose the Hierarchical Document Transformer (HDT), a novel sparse Transformer architecture tailored for structured hierarchical documents. Such documents are extremely important in numerous domains, including science, law or medicine. However, most existing solutions are inefficient and fail to make use of the structure inherent to documents. HDT exploits document structure by introducing auxiliary anchor tokens and redesigning the attention mechanism into a sparse multi-level hierarchy. This approach facilitates information exchange between tokens at different levels while maintaining sparsity, thereby enhancing computational and memory efficiency while exploiting the document structure as an inductive bias. We address the technical challenge of implementing HDT’s sample-dependent hierarchical attention pattern by developing a novel sparse attention kernel that considers the hierarchical structure of documents. As demonstrated by our experiments, utilizing structural information present in documents leads to faster convergence, higher sample efficiency and better performance on downstream tasks.

Freie Schlagworte

compute & memory effi...

sparse attention

encoder-only

encoder-decoder

long-text Transformer...

Sprache

Englisch

Alternatives Abstract

A compute efficient Transformer using hierarchical sparse attention.

Fachbereich/-gebiet

20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung

DDC

000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik

Institution

Universitäts- und Landesbibliothek Darmstadt

Ort

Darmstadt

Veranstaltungstitel

Conference on Language Modeling

Veranstaltungsort

Philadelphia, PA, USA

Startdatum der Veranstaltung

07.10.2024

Enddatum der Veranstaltung

09.10.2024

Ort der Erstveröffentlichung

[nicht ermittelbar]

Publikationsjahr der Erstveröffentlichung

2024

PPN

53028846X

Zusätzliche Links (Organisation)

https://2024.colmweb.org/index.html