A Heterogeneous System Architecture for Low-Power Wireless Sensor Nodes in Compute-Intensive Distributed Applications

Eine heterogene Architektur für energieeffiziente drahtlose Sensorknoten in rechenintensiven verteilten Anwendungen
Zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) genehmigte Dissertation von Dipl.-Inform. Andreas Engel aus Sonneberg
Tag der Einreichung: 09.11.2015, Tag der Prüfung: 17.12.2015
Darmstadt 2016 — D 17

1. Gutachten: Prof. Dr.-Ing. Andreas Koch
2. Gutachten: Prof. Dr.-Ing. Christian Hochberger
A Heterogeneous System Architecture for Low-Power Wireless Sensor Nodes in Compute-Intensive Distributed Applications

Eine heterogene Architektur für energieeffiziente drahtlose Sensorknoten in rechenintensiven verteilten Anwendungen

Genehmigte Dissertation von Dipl.-Inform. Andreas Engel aus Sonneberg

1. Gutachten: Prof. Dr.-Ing. Andreas Koch
2. Gutachten: Prof. Dr.-Ing. Christian Hochberger

Tag der Einreichung: 09.11.2015
Tag der Prüfung: 17.12.2015

Darmstadt 2016 — D 17

Bitte zitieren Sie dieses Dokument als:
URN: urn:nbn:de:tuda-tuprints-57782
URL: http://tuprints.ulb.tu-darmstadt.de/id/eprint/5778

Dieses Dokument wird bereitgestellt von tuprints,
E-Publishing-Service der TU Darmstadt
http://tuprints.ulb.tu-darmstadt.de
tuprints@ulb.tu-darmstadt.de

Die Veröffentlichung steht unter folgender Creative Commons Lizenz:
Namensnennung – Keine kommerzielle Nutzung – Keine Bearbeitung 4.0 International
https://creativecommons.org/licenses/by-nc-nd/4.0
Erklärung zur Dissertation

Hiermit versichere ich, die vorliegende Dissertation ohne Hilfe Dritter nur mit den angegebenen Quellen und Hilfsmitteln angefertigt zu haben. Alle Stellen, die aus Quellen entnommen wurden, sind als solche kenntlich gemacht. Diese Arbeit hat in gleicher oder ähnlicher Form noch keiner Prüfungsbehörde vorgelegen.

Darmstadt, den 22.11.2016

(Andreas Engel)
Abstract

Wireless Sensor Networks (WSNs) combine embedded sensing and processing capabilities with a wireless communication infrastructure, thus supporting distributed monitoring applications. WSNs have been investigated for more than three decades, and recent social and industrial developments such as home automation, or the Internet of Things, have increased the commercial relevance of this key technology. The communication bandwidth of the sensor nodes is limited by the transportation media and the restricted energy budget of the nodes. To still keep up with the ever increasing sensor count and sampling rates, the basic data acquisition and collection capabilities of WSNs have been extended with decentralized smart feature extraction and data aggregation algorithms. Energy-efficient processing elements are thus required to meet the ever-growing compute demands of the WSN motes within the available energy budget.

The Hardware-Accelerated Low Power Mote (HaLoMote) is proposed and evaluated in this thesis to address the requirements of compute-intensive WSN applications. It is a heterogeneous system architecture, that combines a Field Programmable Gate Array (FPGA) for hardware-accelerated data aggregation with an IEEE 802.15.4 based Radio Frequency System-on-Chip for the network management and the top-level control of the applications. To properly support Dynamic Power Management (DPM) on the HaLoMote, a Microsemi IGLOO FPGA with a non-volatile configuration storage was chosen for a prototype implementation, called Hardware-Accelerated Low Energy Wireless Embedded Sensor Node (HaLOEWEn). As for every multi-processor architecture, the inter-processor communication and coordination strongly influences the efficiency of the HaLoMote. Therefore, a generic communication framework is proposed in this thesis. It is tightly coupled with the DPM strategy of the HaLoMote, that supports fast transitions between active and idle modes. Low-power sleep periods can thus be scheduled within every sampling cycle, even for sampling rates of hundreds of hertz.

In addition to the development of the heterogeneous system architecture, this thesis focuses on the energy consumption trade-off between wireless data transmission and in-sensor data aggregation. The HaLOEWEn is compared with typical software processors in terms of runtime and energy efficiency in the context of three monitoring applications. The building blocks of these applications comprise hardware-accelerated digital signal processing primitives, lossless data compression, a precise wireless time synchronization protocol, and a transceiver scheduling for contention free information flooding from multiple sources to all network nodes. Most of these concepts are applicable to similar distributed monitoring applications with in-sensor data aggregation.

A Structural Health Monitoring (SHM) application is used for the system level evaluation of the HaLoMote concept. The Random Decrement Technique (RDT) is a particular SHM data aggregation algorithm, which determines the free-decay response of the monitored structure for subsequent modal identification. The hardware-accelerated RDT executed on a HaLOEWEn mote requires only 43% of the energy that a recent ARM Cortex-M based microcontroller consumes for this algorithm. The functionality of the overall WSN-based SHM system is shown with a laboratory-scale demonstrator. Compared to reference data acquired by a wire-bound laboratory measurement system, the HaLOEWEn network can capture the structural information relevant for the SHM application with less than 1% deviation.
Drahtlose Sensornetze (Wireless Sensor Networks, WSNs) kombinieren eingebettete Sensorik und Rechenleistung mit einer drahtlosen Kommunikationsinfrastruktur, wodurch räumlich verteilte Überwachungsanwendungen unterstützt werden. WSNs werden seit mehr als drei Jahrzehnten erforscht, aktuelle soziale und industrielle Trends wie intelligentes Wohnen oder das Internet der Dinge haben aber auch die kommerzielle Bedeutung dieser Schlüsseltechnologie verstärkt. Die übertragbaren Datenmengen sind durch das Transportmedium und die verfügbare Energie der Sensorknoten beschränkt. Um die wachsende Anzahl an Sensoren und die steigenden Abtastraten dennoch zu bewältigen, wurden die ursprünglich als einfache Datenerfassungssysteme ausgelegten WSNs um Fähigkeiten zur dezentralen Datenaggregation erweitert. Um den dadurch ständig wachsenden Bedarf an verteilter Rechenleistung mit den beschränkten Energieressourcen zu realisieren, werden energieeffiziente Recheneinheiten benötigt.


Eine Anwendung aus dem Bereich der Strukturüberwachung (Structural Health Monitoring, SHM) wird für die systemische Evaluation des HaLoMote Konzepts verwendet. Die Random Decrement Technique (RDT) ist ein spezieller Aggregationsalgorithmus, welcher das freie Ausschwingverhalten der überwachten Struktur ermittelt, selbst wenn die eigentliche Anregung der Struktur nicht bekannt ist. Dies ermöglicht eine operative Modalanalyse, welche die Voraussetzung für eine autonome Langzeitüberwachung ist. Die Berechnung der RDT auf der HaLOEWEn Plattform benötigt nur 43% der Energie, welche ein aktueller ARM Cortex-M Mikrocontroller für den glei-
chen Algorithmus verbraucht. Um die Funktionsfähigkeit des gesamten WSN-basierten SHM Systems nachzuweisen, wurde ein Demonstrator im Labormaßstab aufgebaut. Im Vergleich zu einem drahtgebundenen Labormesssystem können die wesentlichen Strukturinformationen vom HaLOEWEen Netzwerk mit weniger als 1% Abweichung erfasst werden.

Danksagung


Prof. Dr.-Ing. Christian Hochberger möchte ich für das kurzfristige Erstellen des Zweitgutachtens danken.


Abstract I
Kurzfassung II
Danksagung III
Contents V

1 Introduction 1
  1.1 Importance of Wireless Sensor Networks and Energy-Efficient Computing 1
  1.2 Targeted Applications and Research Problems 4
  1.3 Thesis Contribution 5
  1.4 Thesis Outline 7

2 Technical Fundamentals 9
  2.1 Wireless Sensor Networks and Wireless Sensor Node Architectures 9
  2.2 Reconfigurable Computing 13

3 Related Work 21
  3.1 Wireless Sensor Network Motes 21
    3.1.1 MCU-Based Motes 21
    3.1.2 ASIC-Based Motes 23
    3.1.3 RCU-Based Motes 23
  3.2 Common Wireless Sensor Network Services 24
    3.2.1 Lossless Data Compression 24
    3.2.2 Network Flooding 25
    3.2.3 Wireless Time Synchronization 26
  3.3 Compute-Intensive Wireless Sensor Network Applications 27
    3.3.1 Condition Monitoring 27
    3.3.2 Structural Health Monitoring 27

4 HaLoMote: Hardware-Accelerated Low Power Mote 31
  4.1 Architecture Overview 31
  4.2 HaLOEWEn Implementation 33
  4.3 Inter-Processor Communication 37
  4.4 Power Management 39
    4.4.1 Flash*Freeze Control 39
    4.4.2 Swapping Live Variables to External Memory 40
    4.4.3 Flexible Clock Generation 42
    4.4.4 Evaluation 46

5 Library of Reusable Hardware Accelerator Blocks 49
  5.1 Digital Filters 49
    5.1.1 Finite Impulse Response Filter 50
    5.1.2 Fast Fourier Transformation 50
  5.2 Lossless Data Compression 51
    5.2.1 Adaptive Differential Pulse Code Modulation and Rice-Encoding 51
    5.2.2 Hardware-Accelerated ADPCM 53
    5.2.3 Test Setup for Energy Efficiency Evaluation 55
  5.3 Linear Regression 57
    5.3.1 Rolling Linear Regression with Coordinate Transformation 58
6 Network Communication Primitives
6.1 Transceiver Scheduling for Multi-Source Flooding
6.1.1 Network Model and Cost Function
6.1.2 Integer Linear Program for Optimal Scheduling
6.1.3 Heuristic Scheduling
6.1.4 Evaluation
6.2 Precise Wireless Time Synchronization
6.2.1 Necessity of Clock Drift Compensation
6.2.2 Organization of Timestamp Exchange and Clock Drift Estimation
6.2.3 Evaluation

7 Targeted Applications
7.1 Monitoring Neural Activities of Primates
7.1.1 Compressibility of the Monitored Data
7.1.2 Energy Efficiency of Hardware-Accelerated Data Compression
7.2 Condition Monitoring of Heavy Industrial Machinery
7.2.1 Compressibility of the Monitored Data
7.2.2 Energy Efficiency of Hardware-Accelerated Data Compression
7.3 Distributed Structural Health Monitoring
7.3.1 Random Decrement Technique
7.3.2 Distributed Random Decrement Technique
7.3.3 Hardware-Accelerated Random Decrement Technique
7.3.4 Operational Modal Analysis and Damage Detection Principles

8 System Level Evaluation: Distributed Monitoring of a Railroad Bridge Model
8.1 Demonstrator Setup
8.2 Accuracy of the Wireless Data Acquisition System
8.3 Performance and Energy Evaluation
8.4 Discussion of Results

9 Summary and Future Work
9.1 Lessons Learned
9.2 Remaining Research and Engineering Challenges

Bibliography
Scientific Publications
Hardware Datasheets
Software Tools
Specifications and Standards
Others

Abbreviations

List of Figures

List of Tables

List of Algorithms

List of Code Listings
CHAPTER 1

Introduction

In this chapter, the history and current trends in Wireless Sensor Networks (WSNs) are summarized to motivate the relevance of compute-intensive distributed applications. Furthermore, reconfigurable computing for application-specific and hence energy-efficient computing architectures is briefly introduced. When joining both topics, relevant research problems arise that will be addressed in this thesis. Furthermore, the main contributions and the organization of the thesis are summarized in this chapter.

1.1 Importance of Wireless Sensor Networks and Energy-Efficient Computing

Similar to many fundamental technologies, WSNs originated from military applications, such as the networks of acoustic sensors already deployed in the 1950s for submarine tracking [Silverstein1978]. The United States Defense Advanced Research Projects Agency (DARPA) focused their research on distributed sensing, computation and communication within the Distributed Sensor Networks (DSN) project in the 1980s [CMUPDCS1978]. However, the radio transceivers available at that time (e.g., satellite links and early cell phones) could not achieve the required bandwidth, energy efficiency, reliability and integration density, so the first DSNs were based on wire-bound communication.

In the 1990s, advances in wireless communication technologies facilitated the shift from wired DSNs to actual WSNs. Furthermore, military projects such as the DARPA SensIT [Kumar2001] were complemented by university research projects like the Wireless Integrated Network Sensors (WINS) at UC Los Angeles [Asada1998] or the PicoRadio program at UC Berkeley [Rabaey2000]. This opened the WSN technology to a broad spectrum of civilian applications, such as the monitoring of environmental parameters and industrial processes, the tracking of objects, and the detection of events [Chong2003].

At the beginning of the 21st century, the standardization of many wireless communication technologies improved the interoperability and the reusability of the WSN devices, which are also called motes:

- IEEE [802.11] since 1997, i.e., Wireless Local Area Networks (WLANs),
- IEEE [802.15.1] since 2002, i.e., Wireless Personal Area Networks (WPANs),
- IEEE [802.15.4] since 2006, i.e., Low Rate WPANs (LR-WPANs), and
- IEEE [802.15.6] since 2011, i.e., Wireless Body Area Networks (WBANs)

In addition to further improvements in the energy efficiency and density of the sensing, computing, and communication devices, this standardization was the foundation for WSNs to evolve from a subject of scientific research to the base technology for consumer products with an overall market volume of about $1 billion in 2014 [Rawat2014].
Nowadays, the research efforts on WSNs are merged with embedded systems and the system of systems design methodologies, yielding Cyber Physical Systems (CPSs) [Mosterman2015]. This class of advanced applications requires the embedded devices to interact with their surroundings more intensively than typical WSNs, e.g., by supporting actuation and Human-Machine Interfaces (HMIs). CPS deployments are no longer isolated from one another to serve only a restricted application. Instead, open communication and interaction standards connect the ever growing number of CPS motes, thus contributing to the development of the Internet of Things (IoT) [Mainetti2011].

In 2014, the global IoT market ($238 billion) was dominated by industrial applications (24 %), automotive applications (22 %), consumer electronics (19 %), and healthcare (13 %) [TMR2015-IoT]. Although the IoT market is not restricted to WSN and CPS products, more and more wirelessly communicating embedded devices are contributing to the IoT domain. Some examples are listed in Figure 1.

Smart Home applications automate the illumination, heating, ventilation, air conditioning, and irrigation of apartments and buildings to minimize their running costs. After its acquisition by Google in 2014, Nest is one of the most popular vendors of Smart Home products. Its Learning [Thermostat], shown in Figure 1a, can be connected to other peripherals like cameras or automated light bulbs via WLAN. As the data rates required by these applications are rather small, IEEE [802.15.4]-based communication is well-suited for smart home applications. Nest thus allied with important home automation players such as Samsung, ARM, Osram, and Somfy to define a proprietary wireless network stack [Thread], which is optimized for Smart Home applications.

Industrial monitoring systems are typically sealed in a robust housing to withstand harsh environmental conditions, and are equipped with large batteries to support an unattended operation of several days. For example, the BDI [STS4] wireless structural testing system, shown in Figure 1b, consists of a Stellaris ARM Cortex-M3 microcontroller (MCU), a four channel Analog-to-Digital Converter (ADC) with 24 bit resolution, a IEEE [802.11] transceiver, and a 67 Wh battery. It is used as a temporarily installed data acquisition system for the one-time assessment of the current state of a structure (e.g., bridges). The captured sensor data (up to 96 kbit/s) is transferred to the network operator without further preprocessing, while consuming about 1.5 W on average.
The BIOTRONIK [BioMonitor], shown in Figure 1c, is an implantable medical device used to monitor the heart signals of a patient. It samples three channels at several hundred hertz and automatically detects different types of arrhythmia. Only those critical observations are transferred wirelessly to a base station, where it can be accessed by the attending physician. Due to this in-sensor data preprocessing and feature extraction, the operational live time of the device reaches up to six years.

The [BioMonitor] is a typical example of how the small WSN devices achieve an operational endurance of several months or years on a limited energy storage. As the wireless transceiver typically is the major power consumer of a WSN mote [Jain2015], in-sensor data aggregation must be applied to reduce the required communication rates, and thus the active time of the transceivers. However, data aggregation such as lossless data compression or lossy feature extraction impose additional computational load on the WSN motes. Their compute demand is increased further by the complex encryption algorithms required to secure the communication channels, as privacy is becoming more and more important for all IoT applications.

The importance of wirelessly communicating embedded systems with energy-efficient processing capabilities is also reflected by public research budgets and expected market volumes. The European Commission provides an overall funding of €1068 million in the Horizon 2020 Information and Communication Technology (ICT) program during 2016 and 2017, out of which €90 million (8.4%) are dedicated to Smart CPS, Smart System Integration, Smart Anything Everywhere, and customized and low-energy computing [H2020-ICT]. The overall Horizon 2020 IoT funding in the same period amounts to €139 million [H2020-FA]. With this budget, the European Commission is strengthening the European industry to participate in a fast growing market. While the worldwide WSN market volume is estimated to exceed $3 billion in 2020 [FrostSullivan2015-WSN], the overall IoT market volume is forecast to $925 billion in 2021 [TMR2015-IoT]. Thus, the demand for wirelessly communicating embedded systems with energy-efficient processing capabilities will keep rising in the next years.

Software-programmable processors like MCUs and Digital Signal Processors (DSPs) are application-independent general purpose compute units. However, they achieve their flexibility at the cost of a reduced energy efficiency [Hameed2010]. In contrast, Application-Specific Integrated Circuits (ASICs) provide the most energy-efficient realization of a particular application. While all building blocks of a WSN processor such as the sensor controllers, digital filters and the radio protocol stacks can be implemented in dedicated hardware, the combination and configuration of these elements is highly application-specific. Therefore, building a complete WSN mote as customized silicon is not economically worthwhile in most scenarios.

As will be detailed in Section 2.2, Reconfigurable Compute Units (RCUs) such as Field Programmable Gate Arrays (FPGAs) provide the flexibility of software-programmable processors, while almost approaching the energy efficiency of ASICs. In the WSN context, FPGAs are well suited to accelerate dataflow-dominated and inherently parallel algorithms, such as multi-channel digital filtering of sensor data, but control flow-dominated algorithms and long-term low-intensity tasks, like time-keeping and the radio protocol, typically do not benefit from hardware acceleration.
1.2 Targeted Applications and Research Problems

In the last section, WSNs have been introduced as a key enabler for emerging technologies such as the IoT or CPS-controlled industrial monitoring, both of which will increase the demand for energy-efficient embedded processing. While the number of WSN installations is expected to grow rapidly, the concrete tasks to be solved by every single mote will vary greatly within the different application domains and deployment scenarios. RCU s have been introduced as processor architectures combining the efficiency of ASICs with the flexibility of software processors. It therefore appears to be promising to integrate reconfigurable computing into a WSN mote to create an energy-efficient platform that can be easily adopted to many different application scenarios.

The heterogeneous WSN mote developed in this thesis targets compute-intensive distributed applications with the potential for aggressive Dynamic Power Management (DPM). The characteristics of these applications can be broken down into the following requirements. First of all, a sufficient amount of computation has to be applied to the node-local raw sensor data, either by means of data aggregation, compression, or encryption. In principle, unless sensing random data, every application can benefit at least from lossless data compression. However, the additional complexity and static power drawn by a dedicated hardware accelerator will not pay off, if the in-sensor processing could be handled by a low-power MCU at a low duty cycle. The monitoring of slowly changing environmental parameters (e.g., temperature, moisture, or luminance in the agriculture domain) with sampling periods of several seconds or minutes, is thus not covered by this thesis. Second, a fixed sampling period allowing for DPM within each sampling cycle is required. For applications, that have to run as fast as possible (e.g., video processing with maximum frame rates), continuously running hardware accelerators are more suitable and DPM becomes moot. The same holds true, if the required sampling periods approach the transition times between the active and the low-power states of the computational devices. As these transitions typically require tens of microseconds, the targeted sampling frequency should not exceed 10 kHz to achieve duty cycles of 10% and below.

When combining the often independent research areas of WSNs and RCU s under these application constraints, many new challenges and fundamental questions arise. The following research problems will thus be addressed by this thesis:

1. Which specific RCU is suited best to be integrated into a WSN mote?

2. How to integrate the RCU into the architecture of a WSN mote? To be more specific, how should the sensors, memories, processing and communication modules be arranged, and how should they communicate?

3. How does the RCU affect the DPM strategy of the WSN mote?

4. How does the RCU affect the trade-off between in-sensor preprocessing and wireless data transmission? For which kinds of application do the improved data aggregation capabilities pay off in terms of overall energy consumption?

5. Once the RCU is integrated as application-specific accelerator, which generic WSN services can also benefit from the improved processing capabilities?
Although a specific vibration-based Structural Health Monitoring (SHM) application is used for the system level evaluation, the key findings of this thesis are applicable to many WSN applications of similar complexity.

1.3 Thesis Contribution

This thesis offers solutions for the research problems listed in the previous section. The main contributions cover:

- A heterogeneous system architecture called Hardware-Accelerated Low Power Mote (HaLoMote), which integrates an RCU as a hardware accelerator into a MCU-based WSN mote.
- A HaLoMote implementation called Hardware-Accelerated Low Energy Wireless Embedded Sensor Node (HaLOEWEn), which is used as a proof-of-concept to evaluate the reliability and the efficiency of the heterogeneous system architecture.
- An application-independent communication framework designed to minimize the inter-processor communication overhead of the heterogeneous architecture.
- The hardware and software support for DPM with a special emphasis on fast shutdown and wake-up to benefit from sleep scheduling at the end of each sampling cycle, even at sampling rates of several hundred hertz.
- The hardware acceleration of application independent modules such as DSP primitives (Finite Impulse Response (FIR) and Fast Fourier Transformation (FFT)), a linear regression, as well as a lossless data compression kernel.
- A precise wireless time synchronization protocol, which reduces the computational efforts required to achieve the synchronization accuracy necessary for the targeted sampling rates.
- A contention-free transceiver scheduling for a minimum-effort n-to-all flooding of messages as required by the targeted SHM application.
- A hardware-accelerated processing kernel for the specific SHM application used for the system level evaluation of the HaLoMote architecture.
- A laboratory-scale demonstrator used for the system level evaluation of the HaLoMote architecture. It consists of a truss bridge model equipped with 20 accelerometers controlled by five HaLOEWEn motes, which capture and aggregate sensor signals appropriate to identify relevant modal parameters of the bridge.

Most of these contributions have already been successfully reviewed by the scientific community. The following publications were presented at international conferences in the WSN, RCU, and SHM domain:

[Engel2011] describes the basic HaLoMote architecture as a technology to trade-off node-local processing against data transmission in WSNs. To show its energy efficiency, the power consumption required by the HaLoMote for FIR computations is compared against commercial off-the-shelf (COTS) 8 bit and 16 bit MCUs and DSPs.
[Engel2012a] presents a concrete implementation of the HaLoMote architecture and a generic framework for the communication between its MCU and its RCU. Furthermore, different schemes for driving the RCU clock domain are proposed and their impact on the DPM are evaluated in the context of an SHM application.

[Engel2012b] describes a laboratory-scale demonstrator for a single HaLOEWEEn mote and discusses its energy efficiency achieved by a decentralized hardware-accelerated FFT.

[Engel2014a] presents the hardware-accelerated implementation of a lossless data compression module. More specifically, an Adaptive Differential Pulse Code Modulation (ADPCM) encoder was developed and applied to sensor data gathered from a vibrating machinery and the neurons of primates. The benefits of the HaLoMote architecture over a software encoder could be shown.

[Engel2014b] describes a wireless routing protocol optimized for the n-to-all flooding required by the SHM application targeted in [Engel2012a]. Based on reachability- and interference graphs, an optimization problem is formulated to find a contention-free minimum-effort scheduling of sending and receiving messages. The solutions provided by a novel Integer Linear Program (ILP) model and a fast heuristic are compared with state-of-the-art flooding protocols.

[Engel2014c] presents an improved implementation of the hardware-accelerated SHM kernel described in [Engel2012a]. Furthermore, a laboratory-scale demonstrator for a network of HaLOEWEEn motes is detailed. This demonstrator is used to evaluate the system level operation of the proposed heterogeneous architecture.

[Engel2015a] describes an improved formulation of the Rolling Linear Regression (RLR) used to optimize clock drift compensation within a high-precision wireless time synchronization protocol. While software-processors already benefit from the proposed algorithm, using a hardware accelerator further reduces the computational overhead of the synchronization protocol.

[Engel2015b] describes a laboratory-scale demonstration of the achievable synchronization accuracy with and without clock drift compensation, as well as the benefits of the hardware-accelerated RLR.

[Engel2015c] compares the measurement accuracy of the HaLoMote-based data acquisition system against a baseline captured by wire-bound laboratory equipment. Furthermore, the energy efficiency of the heterogeneous architecture is compared with the most recent software-processors.

The experiences gained with the HaLoMote architecture and its applications also resulted in additional publications not immediately related to the key topics of this thesis:

[Engel2014d] presents a hardware-accelerated controller for an ultrasonic piezo-electric actuator as the foundation of a haptic feedback system.

[Hochberger2014] presents a Coarse-Grained Reconfigurable Architecture (CGRA) implementation of the ADPCM algorithm described in [Engel2014c]. This paper focuses on aspects of high-level synthesis exploiting instruction-level parallelism.
1.4 Thesis Outline

This thesis is structured as follows.

Chapter 1 motivates the relevance of compute-intensive distributed applications in the context of WSNs. It formulates the research problems that arise when integrating RCUs as energy-efficient hardware accelerator into a WSN mote. Furthermore, it summarizes the main contributions of this research.

Chapter 2 summarizes the main concepts of WSNs and RCUs, as well as the terminology that will be used throughout this thesis.

Chapter 3 discusses related work in the fields of WSN hardware architectures and software services, as well as the class of distributed compute-intensive applications addressed by this thesis.

Chapter 4 presents the proposed heterogeneous WSN architecture HaLoMote and discusses the evolution of its implementation steered by technological developments and application requirements. Furthermore, the communication infrastructure between the compute units, as well as the integration of power management techniques is described.

Chapter 5 details the hardware acceleration of application-independent algorithms such as DSP primitives, lossless data compression, and a linear regression kernel. These modules are used as building blocks for different WSN applications described in Chapter 7.

Chapter 6 presents application-independent wireless communication concepts developed for the HaLoMote, such as a precise time synchronization protocol and a multi-source flooding scheme.

Chapter 7 details the HaLoMote-implementation of three different monitoring applications. The energy efficiency of the hardware accelerators previously described in Chapter 5 is evaluated in the context of these applications.

Chapter 8 gives the system level evaluation results for the targeted SHM application. After describing the laboratory-scale demonstration setup, the measurement accuracy and energy efficiency of the HaLoMote-based measurement system is compared with state-of-the-art wireless and wire-bound measurement systems.

Chapter 9 summarizes the major contributions of this thesis and proposes future research topics that could be addressed to further improve the heterogeneous hardware architecture.
CHAPTER 2
Technical Fundamentals

In this chapter, the main concepts of WSNs and RCUs are introduced to summarize the terminology used throughout this thesis.

2.1 Wireless Sensor Networks and Wireless Sensor Node Architectures

A WSN consists of a number of small embedded systems (termed sensor nodes or motes), which communicate wirelessly to solve a collaborative task. They are typically used as easily deployable data acquisition systems to monitor the temporal and spatial characteristics of ambient physical phenomena such as temperature, humidity, luminance, structural vibration, or acoustic pressure. As shown in Figure 2, the data gathered by the WSN has to be collected at a dedicated gateway node, where it can be stored persistently or forwarded to a remote base station over a Wide Area Network (WAN) such as a cellular mobile radio link. In addition to the monitoring of signals, the tracking of objects that can be detected by the motes or which the motes are attached to is a class of typical WSN applications [Rawat2014].

Also the requirements and capabilities of the motes vary strongly between different applications and even within a certain network, the architecture of a typical WSN mote consists of the five generic units shown in Figure 3. Analog or digital sensors (Figure 3a) capture the relevant signals at sampling periods ranging from several seconds in environmental monitoring [Lazarescu2013] down to several milliseconds in vibration-based structural health monitoring [Battista2013] or even below in acoustic localization applications [Astashov2014].

The sampled data is buffered in the memory module (Figure 3b), before it is further processed or transmitted. The processor-internal memory is typically limited to a few kilobytes, which is not sufficient to capture long-term measurements. Therefore,
additional Static RAM (SRAM) modules or even non-volatile memory (NVM) such as Flash or the more recent Ferroelectric RAM (FRAM) are often integrated into the WSN motes [Yick2008; Zhao2015].

The processor (Figure 3c) is the heart of the mote that coordinates the sensors, the memory and the transceiver. The processor realizes the upper layers of the network communication protocol as shown in Figure 4 and performs other administrative tasks such as the time management. Most WSN motes are based on small MCUs such as the 8 bit Atmel [ATmega128] used by the [Mica2] or the 16 bit TI [MSP430] used by the [TelosB]. For more demanding computations like the online compression or encryption of the sampled data, 32 bit ARM MCUs like the Intel [PXA270] integrated in the [Imote2] or powerful DSPs such as the AD [BlackFin] are used as WSN processors [Ravinagarajan2010]. Beyond these typical software processors, the integration of RCUs and ASICs has been proposed by academic WSN projects such as [Shahzad2014; Walravens2014].

The wireless transceiver (Figure 3d) realizes the lower layers of the communication protocol stack, as shown in Figure 4. It determines the maximum length (i.e., the communication range) as well as the gross data throughput rate of each hop, i.e., the direct link between a sender and a receiver. Each hop can be either a unicast (from one transmitter to one explicitly addressed receiver) or a broadcast (from one transmitter to all reachable receivers). The carrier frequencies of the WSN transceivers are typically located in the freely available Industrial, Scientific and Medical (ISM) bands.
at 2450 MHz (worldwide) and 915 MHz (America) or the Short Range Device (SRD) band at 868 MHz (Europe). The IEEE [802.15.4] standard is the foundation of the most popular WSN communication stacks such as [ZigBee], [WirelessHART] or the International Society of Automation (ISA) [100.11a]. It provides a throughput of 250 kbit/s at an outdoor communication range up to 100 m. Alternative transceiver standards include Bluetooth Low Energy (BLE) (1 Mbit/s at 200 m), Z-wave (40 kbit/s at 30 m), or EnOcean (125 kbit/s at 300 m) [Rawat2014]. If the WSN transceiver and processor are combined into one Integrated Circuit (IC), it is called a Radio Frequency System-on-Chip (RF-SoC).

As the communication range of the wireless transceivers is often smaller than the area covered by the WSN, the data from the leaf nodes most distant from the gateway has to be transferred over intermediate routers. The single-hop interconnectivity between the network nodes can be modeled as a directed graph, which is referred to as the WSN topology. If the covered area is smaller than twice the communication range, the star topology (Figure 5a) is most commonly used as it does not require intermediate routers. The chain (Figure 5b) and the tree (Figure 5c) topology can cover larger areas with multi-hop paths without complex routing logic, as the path between each pair of nodes is unique. In contrast, the routing within the mesh topology (Figure 5d) is rather complex, as the optimal path between two nodes has to be determined at runtime based on varying information such as the maximum end-to-end delay or the workload and the residual energy of each router. Only a mesh network can handle the failure of intermediate routers without loosing the connectivity between the remaining nodes. The mesh topology is thus preferable for large, dense, and dynamically changing networks, such as those consisting of mobile motes.

All modules of a WSN mote (i.e., sensor, memory, processor, and transceiver) are powered by a limited energy storage such as primary or secondary battery cells (Figure 3e). Their capacity is limited by physical constraints, as the motes are typically small to ensure tight integration into the target environment. Two Mignon cells (AA size) occupy about 16 cm$^3$ and may store up to 16 Wh when realized as primary cells with Zinc-Air chemistry (see Figure 6a). A WSN mote with 1 mW average power consumption can be driven for more than two years from this buffer. To extend the maintenance-free system lifetime, ambient energy sources such as electro-magnetic radiation, mechanical motion or temperature gradients have to be tapped (Figure 3f). As shown in Figure 6b, the efficiency of different energy harvesters varies over multiple decades. While only 5 µW/cm$^2$ can be extracted inductively from the radio signals emitted by a 4 km distant 1 MW television broadcast tower [Parks2014], 40 µW/cm$^2$...
As the power supply module determines the physical size of the whole WSN mote, the mote’s overall energy consumption must be minimized to enable a compact design, while ensuring the operational life time required by the application. Many well known general power management techniques could be applied, but often turn out to be unsuitable in the WSN context. Dynamic Voltage and Frequency Scaling (DVFS) trades of the execution speed of the computational unit against its power consumption and requires appropriate hardware support. While most MCUs typically used on WSNs are able to slow down their clock frequencies by programmable pre-dividers, lowering the core supply voltage is rarely supported by MCUs (e.g., the TI [CC430]) that already aim to be low-power in the first place. Due to these limitations, a more common approach in the WSN field instead uses DPM, i.e., completely shutting down unused components such as the sensors, the transceiver, or (parts of) the processor. This is supported by MCUs providing multiple sleep modes, which trade-off wake-up time against power savings in the deeper states.

DPM must focus on minimizing the active time of the radio transceiver, as the transceiver draws tens of milliwatts [Akbari2014] and is thus usually the major power consumer of the sensor node [Jain2015]. To reduce the time spent idly listening, i.e., waiting for a message to be received, the in-network time synchronization must be sufficiently accurate to tightly synchronize the activities of senders and receivers. To reduce the time spent actually sending (or receiving) messages, the raw data volume generated by the sensors must be reduced before transmission. In many cases, the WSN operator often is not interested in the raw sensor data itself, but in certain features that can be derived from that data. This may be the long term average, the crossing of a threshold, the position of a tracked object, or the visual recognition of certain patterns. Even if the feature extraction relies on combining information from all motes, and can thus not be decentralized completely, more general approaches of data aggregation and prefiltering such as noise elimination or (lossless) data compression may be applied to reduce the overall communication demands.
For low data rate applications such as environmental monitoring, the powered-on times required for data aggregation (if any), and entering as well as leaving low-power modes, is negligible compared to the energy required for the actual radio transmissions. A COTS mote based on a low-power, but relatively slow 8 bit or 16 bit MCU is well suited to these use cases. For applications requiring higher sampling rates, such as vibration-based SHM, these weak MCUs generally are not able to perform the required preprocessing within the available sampling interval. On the other hand, more powerful 32 bit processors tend to consume more power than the radio transceiver (e.g., 570 mW for the [Imote2] processor [PXA270]) thus limiting the benefits of in-sensor preprocessing. Furthermore, entering and exiting a sub-milliwatt sleep mode takes too long for these more powerful devices (e.g., more than 135 ms for the [PXA270]) to perform DPM in every sampling interval. Active power management on these nodes thus can only occur when measurements are completely suspended.

2.2 Reconfigurable Computing

Computer architects have to trade-off flexibility and usability against performance when engineering a new compute unit. General purpose software processors implementing the von Neumann architecture [Neumann1945] consist of a Central Processing Unit (CPU) with attached memories and peripherals, as shown in Figure 7a. The CPU sequentially retrieves values stored in registers and applies them to an Arithmetic Logic Unit (ALU), thus executing basic mathematical operation (e.g., addition, multiplication, or comparison), or exchanges data between the registers and a component attached to the data bus. This can be a Random Access Memory (RAM), a Read-Only Memory (ROM), or another peripheral such as a timer, a coprocessor, or a communication module. The CPU controller determines the operations to be executed based on instructions loaded from the memory. In this way, the processor can be configured to solve any problem by storing an appropriate sequence of instructions into the program memory. The software processor thus represents the most flexible and easiest-to-use computer architecture.

This flexibility, however, comes at the expense of a significantly reduced efficiency in terms of energy consumption, execution speed and silicon area. [Hameed2010] reports a speedup of over $500\times$ and a reduction in energy consumption by more than $700\times$ when implementing (parts of) a video codec by an ASIC instead of a general

![Figure 7: Two basic principles of programmable processor architectures](image)
purpose software processor. In this concrete case, only 5% of the energy consumed by the software processor was actually spent for the arithmetic operations performing the computation. One reason for the lack of efficiency is the inherently sequential nature of the software processor, as only a single arithmetic operation can be performed at once. Most signal processing problems, however, can be partitioned into independent problems that can be solved in parallel. Therefore, processors specialized for certain application classes tend to include multiple functional units inside the ALU (e.g., DSPs), provide multi-word buses, registers, and ALUs (e.g., Single Instruction Multiple Data (SIMD) vector processors), allow for multiple parallel operations per instruction (e.g., Very Long Instruction Word (VLIW) processors), or even duplicate the entire CPU resulting in multi-core processors. While these architectures can exploit parallelism at instruction, data, or task level, the degree of parallelism is fixed at fabrication time and often does not match the requirements of a specific problem.

The main drawback of software processors is referred to as von Neumann syndrome [Hartenstein2007] and is caused by the instruction stream hitting the memory wall. Every operation, arithmetic as well as data transfer, is accompanied by an instruction fetch from a CPU-external memory. The loading of the address and data bus by instruction fetches can be avoided by adding a dedicated instruction bus (i.e., the Harvard architecture [Aiken1946]). However, the performance gap between processing and memory access speed amounts to about three decades [Patterson2012, Figure 2.2], and complex cache hierarchies are required to provide the instructions fast enough. On the other side, these caches increase the energy consumption and the required silicon area of the processor.

Instead of treating the effects of the von Neumann syndrome (i.e., providing instructions faster), its cause (i.e., required instruction fetches at runtime) should be eliminated. This can be achieved by switching from an instruction stream-driven software processor to a data stream-driven flowware processor architecture, as shown in Figure 7b [Hartenstein2007]. The latter consists of a (systolic) array of hardwired Processing Elements (PEs), each implementing a (arithmetic) function without (explicit) instruction fetches. Each PE is fed with the output of an Auto-Sequencing Memory (ASM) or the results of its neighboring PEs. The results generated by the systolic array are finally written back to a number of ASMs, which combine a distributed memory with a Generic Address Generator (GAG). The latter contains a data counter and a mapping to the appropriate memory addresses. For a given array of PEs, this mapping determines the overall computed program. Thus, the combined address sequence of all GAGs in the flowware concept corresponds to the instruction sequence of the software concept.

While these data stream processors do not suffer from the von Neumann syndrome, and provide inherently parallel arithmetic, they cannot be regarded as general purpose processors. They are best suited for certain compute-bound regular problems such as convolution, correlation, or matrix multiplications [Kung1982]. To allow for more flexibility, a second level of programmability must be introduced below the flowware level. Instead of using hardwired PEs and inter-PE connections, CGRAs provide mechanisms to reconfigure the behavior of each PE as well as the PE interconnection. This reconfigurability can be achieved by means of programmable switches, multiplexers, Lookup Tables (LUTs), or microcode memories (if the PEs themselves are implemented as tiny software processors). In any case, a number of bits has to be written to a dedicated
configuration memory inside the Programmable Logic Device (PLD). This *bitstream* is sometimes referred to as *configware* [Hartenstein2007].

CGRAs are only one possible realization of an RCU. One of the first conceptual descriptions of reconfigurable computing was proposed by [Estrin1960] as an

*...* inventory of high speed substructures and rules for interconnecting them such that the entire system may be temporarily distorted into a problem oriented special purpose computer.

Depending on the granularity and the realization of these substructures, their density as well as the interconnection and routing complexity, different types of PLDs can be distinguished [Kesel2009, Section 3.7]. As shown in Figure 8, the Simple Programmable Logic Devices (SPLDs) realize (multiple) boolean functions in a two stage approach derived from their disjunctive formulation (i.e., sum of products). The Programmable Read-Only Memory (PROM) realization shown in Figure 8a provides reconfigurable switches only in the second stage (i.e., the OR matrix). Each output function is thus realized by its minterm canonical form, which potentially requires many closed switches in the OR matrix resulting in a high leakage current. To allow for logic minimization within each and between multiple output functions, a Programmable Logic Array (PLA) provides configurable switches in both stages, as shown in Figure 8b. The larger number of switches and inputs per logic gate increases the IO-delay and the required silicon of the PLA. As a countermeasure, the Programmable Array Logic (PAL) eliminates the programmability from the OR matrix as shown in Figure 8c, thus still allowing for some logic minimization without the full overhead of a PLA.

Recent PAL devices like the [TIBPAL22V10] support up to 22 inputs and 10 outputs, which are sufficient to realize small decoders or (parts of) state machines. Increasing the number of inputs to support more complex functions is not practical due to the rapidly growing complexity and delay of the switch matrix. Instead, multiple SPLDs are combined with some storage and feedback elements as well as advanced Input/Output (IO) drivers over a central interconnect to form a Complex Programmable Logic Device (CPLD). The largest device of the Xilinx [CoolRunner-II] family (i.e., XC2C512) provides 32 PLAs, each supporting 40 inputs and 16 outputs. CPLDs are thus well

![Figure 8: Simple Programmable Logic Devices realizing boolean functions](image-url)
suited to implement small cost-effective controllers for memories and other peripheral components [CPLDAPP].

While boolean functions with tens of inputs are the basic reconfigurable units of CPLDs, Field Programmable Gate Arrays (FPGAs) are fine-grain reconfigurable. Their smallest functional units are Logic Cells (LCs) including at least a small LUT, a clocked Flip-Flop (FF) and a programmable multiplexer, as shown in Figure 9a. A LC can thus either realize a simple boolean function or a one bit register. Some vendors add additional logic like full-adders and dedicated carry chains into their LCs to enable more efficient implementations of complex arithmetic operations. Furthermore, the LUT sometimes can be reused as shift-register or distributed RAM, if it is not used to realize a boolean function. The LCs are arranged in a regular array and interconnected by some routing resources, thus constituting the FPGA fabric, as shown in Figure 10. Those interconnects may span the whole FPGA (e.g., for distributing global clock and reset signals), or only a few LCs. Programmable Switch Matrices (SMs) at the junctions of these routing resources consist of six transistors (per junction), as shown in Figure 9b. Each lane is thus connectable to each subset of the three other lanes of the junction.
Besides the LCs, a number of Input/Output Blocks (IOBs) is attached to the routing infrastructure to interface off-chip peripheral components. As shown in Figure 9c, each IO-Pad (P) can be used as input or output, or be driven to high-impedance (Z) to support the implementation of bus protocols. In practice, most IOBs can be configured for different IO standards (i.e., voltage levels, slew rates, differential signaling). Finally, hardwired Functional Blocks (FBs) are connected to the FPGA fabric to reduce the number of logic resources required for some common functionality. This includes Block RAMs (BRAMs), DSP operators, clock conditioning modules like Phase Locked Loops (PLLs), fast serial transceivers, or even complete software processors.

Just as no software engineer manually generates a binary instruction stream to implement a certain algorithm, the bitstream required for an FPGA design is also generated by a toolchain from a high-level representation of the algorithm. Figure 11 shows the tools required to transform the design through four different levels of abstraction. At the system level (Figure 11a), the design entry and verification processes do not differ significantly from a software engineering tool flow. The resulting C, SystemC or even MATLAB models are then converted to the Register Transfer Level (RTL) by High-Level Synthesis (HLS) tools such as the proprietary Xilinx Vivado-HLS or the open source ROCCC compiler. These tools can trade-off execution speed against the number of required logic resources for the resulting hardware by constraints, e.g., for loop unrolling, pipelining, operator folding, and the required operator-specific numeric precision at a granularity of 1 bit.

At the RTL shown in Figure 11b, the design is described by Hardware Description Languages (HDLs) such as Verilog or the Very High Speed Integrated Circuit Hardware Description Language (VHDL) in terms of concurrent and sequential operations on signal paths between synchronized registers. The major HDL concepts comprise the structural description of a module hierarchy as well as behavioral descriptions required to model a synchronized data flow. Besides generating HDL modules manually or by the HLS approach, Intellectual Property (IP) catalogs and core generators provide complete modules for many applications domains. To verify the design, a testbench is generated at the RTL. According to the Universal Verification Methodology [UVM], the testbench applies stimuli to the module under test and observes the resulting behavior.

![Figure 11: FPGA tool flow transforming resources between different abstraction levels](image)
After the behavioral verification succeeded, the RTL model is transferred to the gate level (Figure 11c) by synthesizing a list of interconnected gates and registers called the netlist. Therefore, the synthesis tool first compiles the hardware description into a set of boolean functions. To implement these functions, the subsequent technology mapping selects the appropriate gate primitives directly realizable by the LCs or FBs of the targeted FPGA architecture. Constraints on the achievable clock frequency or the automatic inference of memories and DSP macros can steer this process. The functional verification of the gate level netlist can be driven by the same testbench used at the RTL.

Finally, the gates, registers, and macros of the netlist are assigned to dedicated LCs and FBs inside the target device and interconnected with dedicated routing resources. This Place and Route (PAR) process is controlled by user-defined pin assignment and floorplanning constraints to determine which signal to attach to which physical pin, or to limit the maximum delay between certain gates. After the PAR step, all gate and wire delays of the design can be estimated appropriately and are summarized in the Standard Delay Format (SDF), thus allowing for a timing verification after the Back Annotation (BA) of the netlist. Afterwards, the bitstream can be generated. It may also be compressed and encrypted. This bitstream is programmed into the target device for the final in-field verification at the transistor level (Figure 11d).

With decreasing abstraction level, the transformation tools report increasingly precise estimations of the required logic resources and achievable execution speed of the design. As the low-level tools require detailed knowledge about the targeted FPGA architecture, the synthesis, implementation, and programming steps are typically integrated into a vendor specific design environment such as the Xilinx [Vivado], Altera [Quartus], Microsemi [Libero-SoC], or Lattice [iCEcube]. In addition, those Integrated Development Environments (IDEs) typically provide the appropriate simulators, graphical constraint editors, and toolchains for the software-processors integrated into the devices.

Table 1 lists some implementation details of different FPGAs. Their performance cannot be judged solely by the amount of LCs, as the number and size of LUTs and FFs per LC varies across different architectures. Furthermore, no common terminology has been established, so each vendor uses another designation for the basic LC and

<table>
<thead>
<tr>
<th>Vendor</th>
<th>Xilinx</th>
<th>Altera</th>
<th>Lattice</th>
<th>Microsemi</th>
</tr>
</thead>
<tbody>
<tr>
<td>Market Share</td>
<td>52 %</td>
<td>34 %</td>
<td>6 %</td>
<td>3 %</td>
</tr>
<tr>
<td>Architecture</td>
<td>UltraScale</td>
<td>7-Series</td>
<td>HyperFlex</td>
<td>iCE40</td>
</tr>
<tr>
<td>CMem Type</td>
<td>SRAM</td>
<td>SRAM</td>
<td>SRAM+NVM</td>
<td>LC</td>
</tr>
<tr>
<td>Name for LC</td>
<td>Slice</td>
<td>Slice</td>
<td>ALM</td>
<td>VersaTile</td>
</tr>
<tr>
<td>LUTs per LC</td>
<td>8 (6×1) or 5×2</td>
<td>4 (6×1) or 6×1</td>
<td>4 (7×1) or 4×2</td>
<td>1 (3×1) or 2 (5×1)</td>
</tr>
<tr>
<td>FFs per LC</td>
<td>16 8</td>
<td>4</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Largest Device</td>
<td>VU440</td>
<td>XC7V2000T</td>
<td>GX5500</td>
<td>AGL1000</td>
</tr>
<tr>
<td>LCs</td>
<td>316620</td>
<td>305400</td>
<td>1867680</td>
<td>24576</td>
</tr>
<tr>
<td>IOs</td>
<td>1456</td>
<td>1200</td>
<td>1640</td>
<td>300</td>
</tr>
<tr>
<td>BRAM</td>
<td>90720kbit</td>
<td>46512kbit</td>
<td>170356kbit</td>
<td>144kbit</td>
</tr>
<tr>
<td>DSP Blocks</td>
<td>2880</td>
<td>2160</td>
<td>1980</td>
<td>144kbit</td>
</tr>
<tr>
<td>PLLs</td>
<td>60</td>
<td>24</td>
<td>58</td>
<td>2</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>1 8</td>
</tr>
</tbody>
</table>

Table 1: Market share [TMR2015-FPGA] of leading FPGAs vendors and logic resources provided by some of their devices.
super structures, such as the Xilinx Configurable Logic Block (CLB) clustering one to four slices with dedicated interconnects, or a Xilinx tile which contains a CLB and its surrounding routing logic. Although most datasheets specify some architecture independent measure to rate the available logic resources (e.g., equivalent LCs, LEs, or system gates), the usability of these metrics is quite limited as each vendor uses its own reference value.

Another important difference between the various FPGA architectures shown in Table 1 is the type of the Configuration Memory (CMem) used to store the configuration bitstream. SRAM is typically used for this purpose (76% market share [TMR2015-FPGA]), as it can be written quickly and at the granularity of a single bit, thus allowing for (partial) dynamic reconfiguration at runtime. However, SRAM is volatile and the bitstream thus has to be provided by an FPGA-external entity to be programmed into the FPGA each time the FPGA is powered up. Due to dedicated memory controllers inside the FPGA, an additional off-chip NVM often is sufficient to provide the bitstream. Some architectures like the Lattice [iCE40] actually include this NVM into the FPGA die next to the CMem. The time required for the configuration process varies due to bitstream compression and optional encryption, but several 100 ms are realistic for the larger devices (e.g., of the Xilinx [7-Series]). Instead of using an additional NVM from which the bitstream is copied, the CMem itself can be realized as non-volatile flash (21% market share) or antifuse (3% market share) memory. The non-volatile NVM significantly increases the time and power required for the configuration process (up to several minutes). Once programmed before the deployment, the devices can start operating immediately after the power-up sequence. This feature is essential for mobile applications relying on DPM. In contrast to the flash memory, antifuses can only be programmed once by physically melting down isolation between the endpoints of connections. They are thus not appropriate for prototyping applications.

As shown in Table 1, the logic density of the available devices ranges from a few thousand to a few million LUTs and FFs. In combination with the different CMem technologies, a wide variety of applications can thus be targeted. In general, FPGAs are preferable, if the performance or efficiency (speed or energy consumption) of a software processor is not sufficient, and either the product volume is too small to justify the non-recurring engineering costs (NRE) of an ASIC design, or the implemented functionality is expected to significantly change throughout the product lifetime (e.g., for use cases such as prototyping boards, general purpose High-Performance Computing (HPC) engines, or communication/encryption/encoding modules updatable to revised standards). The most relevant market segments for FPGAs are telecommunication (33%), industrial (21%) as well as automotive (17%) applications, and consumer electronics (12%) [TMR2015-FPGA]. After Intel bought Altera in June 2015, FPGA technology can be expected to offer an alternative to CPUs and Graphics Processing Units (GPUs) in data center applications, with the aim to significantly reduce the energy consumed by server farms for the fast growing big data analytics and cloud computing markets [Putnam2015].
CHAPTER 3
Related Work

Before detailing the contributions of the thesis, this chapter summarizes relevant related research. The review covers the evolution of WSN mote hardware architectures, as well as specific WSN applications and services. The emphasis is placed on compute-intensive tasks and their hardware-accelerated implementation.

3.1 Wireless Sensor Network Motes

Driven by technological advances and application-specific optimization, a large variety of WSN motes have been developed as research prototypes in recent years. Some of these architectures have also become available as COTS devices. Table 2 details the processing and communication elements of some sample WSN motes. Focusing on hardware-accelerated node architectures, this review summarizes software-programmable motes only briefly. A more comprehensive overview can be found in [Lynch2006; Piedra2012].

This discussion is related to the architectural design considerations detailed in Chapter 4. It already has been partially published in [Engel2015c].

3.1.1 MCU-Based Motes

The probably most widely adopted WSN motes were developed at Berkeley. While the Mica mote [Hill2002] is based on an 8 bit AVR MCU, the Telos mote [Polastre2005] is driven by a 16 bit MSP430 processor. Both were redesigned to meet the latest LR-WPAN standard and are now available as COTS devices from MEMSIC as [MicaZ] and [TelosB].

The Libelium [Wasp mote] provides the same processing capabilities as the Mica family, but its modularized radio subsystem can be adapted to Near Field Communication (NFC), LR-WPAN, WPAN, WLAN, or even cellular networking. The same flexibility is provided for the sensor subsystem, as more than ten sensor modules for different applications are available.

For more compute-intensive applications like digital image processing, the MEMSIC [Imote2] provides a 32 bit ARM processor. As the platform draws about 1 mW when sleeping and 100 mW when active, its application requires a sufficiently strong power supply.

From the huge variety of WSN prototypes proposed by different research groups, the following three examples stand out in the SHM domain. The SHiMMer mote [Dondi2010] utilizes a Blackfin DSP and is thus well suited for high performance digital filtering. It is actually used for ultrasonic SHM with 25 MHz sampling. The power drawn by this mote can be expected to be even larger than for the [Imote2], but details are not provided by [Dondi2010].

Another WSN mote specialized for SHM applications is proposed by [Araujo2012]. It employs two MCUs to separate the controlling of the sensor and the radio interface. In addition to this heterogeneous processing, the mote also uses heterogeneous
<table>
<thead>
<tr>
<th>Reference (Name)</th>
<th>Origin</th>
<th>MCU / DSP</th>
<th>RCU</th>
<th>Transceiver</th>
</tr>
</thead>
<tbody>
<tr>
<td>[Waspmote]</td>
<td>Libelium</td>
<td>based on software processors only</td>
<td>exchangeable modules</td>
<td></td>
</tr>
<tr>
<td>[MicaZ]</td>
<td>Berkeley</td>
<td>8 bit AVR</td>
<td>2.4 GHz IEEE [802.15.4]</td>
<td></td>
</tr>
<tr>
<td>[TelosB; TMoteSky]</td>
<td>Berkeley</td>
<td>16 bit MSP430</td>
<td>2.4 GHz IEEE [802.15.4]</td>
<td></td>
</tr>
<tr>
<td>[Imote2]</td>
<td>MEMSIC</td>
<td>32 bit ARM5</td>
<td>2.4 GHz IEEE [802.15.4]</td>
<td></td>
</tr>
<tr>
<td>[Dondi2010] (SHiMmer)</td>
<td>San Diego</td>
<td>32 bit Blackfin</td>
<td>915 MHz IEEE [802.15.4]</td>
<td></td>
</tr>
<tr>
<td>[Arajujo2012]</td>
<td>Madrid</td>
<td>32 bit PIC + ARM9</td>
<td>2.4 GHz IEEE [802.15.4] + [802.11]</td>
<td></td>
</tr>
<tr>
<td>[Zhao2015] (NFC-WISP)</td>
<td>Washington</td>
<td>16 bit MSP430</td>
<td>13.56 MHz RFID</td>
<td></td>
</tr>
<tr>
<td>[Asada1998] (WINS)</td>
<td>Los Angeles</td>
<td>based on ASICs</td>
<td>Spectrum Analyzer</td>
<td>915 MHz custom design</td>
</tr>
<tr>
<td>[Walrvavens2014]</td>
<td>Leuven</td>
<td>parallel prefix computations</td>
<td>none</td>
<td></td>
</tr>
<tr>
<td>[Kohvakka2006]</td>
<td>Tampere</td>
<td>32 bit Nios (Softcore)</td>
<td>Cyclone</td>
<td>2.4 GHz</td>
</tr>
<tr>
<td>[Raval2010]</td>
<td>Dublin</td>
<td>8 bit AVR</td>
<td>Spartan</td>
<td>2.4 GHz IEEE [802.15.4]</td>
</tr>
<tr>
<td>[Voyles2010] (RecoNode)</td>
<td>Denver</td>
<td>8 bit AVR</td>
<td>Virtex 4</td>
<td>2.4 GHz IEEE [802.15.4] + [802.15.1]</td>
</tr>
<tr>
<td>[Mplemenos2012]</td>
<td>Crete</td>
<td>8 bit AVR</td>
<td>Virtex 5</td>
<td>2.4 GHz IEEE [802.15.4]</td>
</tr>
<tr>
<td>[Lombardo2012] (HiReCookie)</td>
<td>Madrid</td>
<td>8 bit AVR</td>
<td>Spartan 6</td>
<td>2.4 GHz IEEE [802.15.4]</td>
</tr>
<tr>
<td>[Goh2012]</td>
<td>Singapore</td>
<td>16 bit MSP430</td>
<td>Spartan 3</td>
<td>2.4 GHz IEEE [802.15.4]</td>
</tr>
<tr>
<td>[Yi2013]</td>
<td>Chicago</td>
<td>32 bit AVR</td>
<td>Spartan 6</td>
<td>2.4 GHz IEEE [802.15.4]</td>
</tr>
<tr>
<td>[Shahzad2014]</td>
<td>Sundsvall</td>
<td>based on SRAM-FPGAs</td>
<td>Cyclone</td>
<td>2.4 GHz</td>
</tr>
<tr>
<td>[Vera-Salas2010]</td>
<td>Querétaro</td>
<td>16 bit MSP430</td>
<td>IGLOOnano</td>
<td>2.4 GHz IEEE [802.15.4]</td>
</tr>
<tr>
<td>[Berder2010] (PowWow)</td>
<td>Rennes</td>
<td>16 bit MSP430</td>
<td>IGLOO</td>
<td>2.4 GHz IEEE [802.15.4]</td>
</tr>
<tr>
<td>[Grassi2012]</td>
<td>Milan</td>
<td>8 bit 8051 (Softcore)</td>
<td>IGLOO</td>
<td>2.4 GHz Xbee Digimesh</td>
</tr>
<tr>
<td>[Nylaenden2014]</td>
<td>Oulo</td>
<td>IGLOO</td>
<td>2.4 GHz IEEE [802.15.4]</td>
<td></td>
</tr>
</tbody>
</table>
communication. A secondary LR-WPAN transceiver is dedicated to wireless time syn-
chronization, while raw sensor data is transferred by the faster (but mote power hungry)
WLAN transceiver. Again, no power consumption details are reported by [Araujo2012].

The NFC-Wireless Identification and Sensing Platform (WISP), proposed by
[Zhao2015] for cold chain monitoring, is located on the other side of the power and
performance spectrum. It is powered by and communicating via an Radio-frequency
Identification (RFID) circuit and requires 369 nJ per temperature sample.

### 3.1.2 ASIC-Based Motes

At the beginning of the WSN research, no discrete sensing, processing and communi-
cation ICs were available. Early devices like the WINS architecture [Asada1998] were
thus completely designed as ASICs.

Nowadays, ASICs are integrated as specialized hardware accelerators to speed-up a
small, yet critical part of the DSP chain in a WSN processor. [Walravens2014] proposed
an ASIC for parallel prefix computations, that can be used, e.g., for searching elements,
detecting peaks, or evaluating polynomials.

Due to their lack of flexibility and high NRE costs, ASICs are not a viable option
for a research prototype, and are thus not reviewed here in greater detail.

### 3.1.3 RCU-Based Motes

As stated in Section 2.2, RCUs can perform complex computations more efficiently than
MCUs and DSPs. In real-time applications, the use of RCUs often enables computa-
tions that cannot be performed by MCUs or DSPs at all under the given constraints.
Table 2 distinguishes between FPGAs based on volatile and non-volatile configuration
storage, as these device classes seriously differ in terms of computing power and energy
consumption.

SRAM-based FPGAs are attractive for use in sensor nodes performing compute-
-intensive applications, such as video and image compression and analysis. While the
RecoNode [Voyles2010] utilizes the visual information in search and rescue applica-
tions, [Shahzad2014] describes a visual particle detection mechanism for a condition
monitoring system. Another hardware accelerator used for chemical process moni-
toring is proposed by [Goh2012]. Other researchers employ the FPGA to accelerate
application-independent WSN tasks like the sensor controllers [Yi2013], the routing
protocol [Mplemenos2012], or the entire radio interface including fast data forwarding
in a multi-transceiver architecture [Kohvakka2006].

However, the energy efficiency of the proposed architectures is rarely addressed by
these authors. [Raval2010] reports an 48 % energy reduction when accelerating an 8 bit
AVR softcore processor by instruction fusing based on application-specific instruction
tracing. A direct comparison against a discrete (i.e., hardwired) processor is still miss-
ing. The main problem of the SRAM-based FPGA hardware accelerators, i.e., the time
required to load the configuration again after each low-power cycle, is quantified by
[Shahzad2014]. The authors state that shutting down and reconfiguring a Spartan 6 de-
design instead of just entering the suspend mode is worthwhile only after a sleep period of
at least 235 ms. DPM within each sampling cycle is thus not reasonable for this device
in compute-intensive applications with sampling frequencies of several hundred hertz.
To actually reduce the energy overhead of the reconfiguration process of SRAM-based
FPGAs, [Lombardo2012] proposed to manually relocate logic resources such that the bitstream is more compressible. For a specific application, a $2.4 \times$ reduction of the reconfiguration overhead was reported.

Nevertheless, when energy actually becomes a first-class design goal, non-volatile FPGA accelerators are far more suitable. Sensor nodes using a combination of a Flash-based Microsemi IGLOO FPGA and a wireless transceiver have been proposed [Vera-Salas2010; Stelte2010; Nylaenden2014]. However, despite the power advantages of Flash configuration storage, these implementations have turned out to be sub-optimal: All processing is performed on the RCU (even long-term low-intensity tasks), and when powered down, the radio transceiver is required to wake up the FPGA again. Thus, at least one of the two power-hungry devices, either the FPGA or the transceiver, has to be enabled all the time. Even refinements which use very simple timekeeping on the RCU, such as employing inverter ring-based oscillators [Grassi2012] to power up the receiver periodically at pre-agreed times for data reception, are still sub-optimal: Due to the large timing inaccuracy (drift) of these oscillators, the power-down phases have to be conservatively shortened, leading to the system drawing higher power for longer intervals.

A better choice is a heterogeneous architecture combining an RCU and a low-power MCU. The Cookie WSN [Rosello2011] and the PowWow [Berder2010] have joined a small Microsemi IGLOO FPGA with a TI MSP430 MCU and an additional radio transceiver. However, both systems utilize the FPGA only for low-level handling of radio messages, instead of preprocessing the sensor data stream. Furthermore, the use of discrete MCU and radio components carries the burden of slower communication between processor and transceiver, as well as a more complex power management. In Section 4.1, the usage of an RF-SoC next to a Flash-based FPGA is examined in greater detail.

### 3.2 Common Wireless Sensor Network Services

In this section, existing research related to the modules and services presented in Chapter 5 and 6 are summarized. Parts of this review have already been published in [Engel2014a; Engel2014b; Engel2015a].

#### 3.2.1 Lossless Data Compression

While generic data compression is applicable to almost all kinds of sensor data, more specialized methods such as compressive sensing [Donoho2006] assume highly specific properties in the input signals and are not addressed in this work.

Data compression in WSNs was investigated frequently in the last decade [Kimura2005]. For example, an LZW-compressor was implemented on an [MSP430] MCU to analyze the effect of reduced data rates on the end-to-end packet delay in a multi-hop network [Deng2012]. The energy savings achievable by a nonlinear ADPCM running on the ARM processor of a Beagle Board were investigated in [Kasirajan2012]. However, these authors erroneously considered the achieved compression ratios directly as energy savings, completely ignoring the energy required for the encoding itself. This gross simplification was not used in [Reinhardt2009], where run-length and adaptive Huffman encoding were implemented on the AVR MCU of a [Mica2] mote. The energy for encoding was determined solely by simulations and datasheet-specifications, using just synthetic data streams with guaranteed statistical properties as inputs.
Hardware-accelerated data compression in context of WSNs focused mainly on lossy image compression in visual surveillance networks, such as the JPEG compression on an Altera EP2C35 FPGA [Zhiyong2009], or the identification of relevant image sections using a Xilinx Virtex II FPGA [Ngau2012]. In addition to not compressing losslessly, these investigations aimed at reducing the required data rate to the throughput limits of the wireless transmission channel, instead of minimizing the overall system energy consumption. The acceleration of a second order ADPCM compressor on a Xilinx XC4000 device was proposed in [Boonyakitmaitree2004], but did not report any energy requirements.

### 3.2.2 Network Flooding

Congestion-induced packet loss leads to throughput degradation in WSNs and often requires energetically expensive retransmissions. Though collisions can be avoided by Carrier Sense Multiple Access (CSMA) mechanisms, the required carrier-sensing also increases the energy consumption of the sensor nodes. Therefore, routing protocols have a significant impact on distributed applications relying on multi-hop wireless communication. The number of nodes involved to deliver information (a data item) from a source to a sink influences the overall energy consumption of the network, as well as the data throughput. Often, shortest routes cannot be easily discovered, or are undesirable (e.g., when balancing communication load equally over all network nodes, or to provide robust communication over redundant paths). In general, application-tailored routing protocols can exploit special characteristics to improve throughput and energy consumption.

In Section 6.1, a multi-source flooding mechanism is proposed to distribute information from at least two sources to all network nodes at the same time. Flooding information into a WSN has been widely adopted for general tasks such as time synchronization [Maroti2004; Gheorghe2010; Xu2009] or route discovery [Perkins1999]. However, basic blind flooding is quite inefficient, especially in dense networks, as each receiver rebroadcasts the message to be distributed [Ni1999]. Intensive research has been carried out to reduce the number of necessary rebroadcasts. Prior work commonly utilizes knowledge about the nodes locations [Arango2004; Jeong2010] or their local neighborhood [Sheng2005; Lim2001; Lou2002; Agathos2011] to select the forwarding nodes covering the most yet uncovered network nodes. However, all of these flooding protocols assume a 1-to-all data propagation. Unfortunately, unlike physical water or acoustic waves, multiple data waves originating from different source nodes interfere with each other due to channel congestion and signal interferences, such as the hidden terminal problem [Wang2012].

Furthermore, the information from multiple sources should be aggregated as soon as possible to avoid repeated data transfers between the same nodes. Many data aggregation and clustering protocols have been reported [Alnuaimi2013]. By selecting specific nodes as cluster-heads, data is collected locally before it is forwarded to a single sink node. The basic idea of clustering is picked up in the design of the greedy heuristic in Section 6.1.3.

As described in Section 6.1, network routing can be considered as finding a proper transceiver schedule, such that information travels along the links between simultaneously scheduled transmitters and receivers. Using ILPs to solve scheduling problems is a widely adopted approach. In the WSN context, ILPs were used, e.g., for fre-
frequency channel assignment [Ahmed2010], task scheduling [DePauw2010], and routing
[Hoa2012]. The later one is closely related to the mechanism proposed in Section 6.1, as it generates the ILP formulation based on a directed graph representing the network topology. The ILP solution proposed in [Hoa2012] describes a multi-cycle schedule for transmissions and receptions with a minimal amount of overall required energy. However, as the schedule is used to transport information from multiple source nodes just to a single sink, it does not perform flooding. Furthermore, it does not provide information aggregation, and does not consider the transmission ranges and interference between the radio transceivers. All of these aspects are considered in Section 6.1.

### 3.2.3 Wireless Time Synchronization

Wireless time synchronization aims to compensate the clock deviation between all sensor nodes, which are caused by the different start-up times and the slightly different frequencies of the sensor node’s oscillators. Energy-efficient synchronization protocols reduce the communication overhead for synchronization-related timestamp exchanges by estimating the clock drift between different nodes at runtime.

A large variety of wireless time synchronization protocols has been proposed in the last decade [Djenouri2014], with only a few of the employed clock drift estimators not being based on a least squares linear regression. The R⁴Syn protocol [Djenouri2012] uses a Maximum-Likelihood-Estimator, which imposes nearly the same computational complexity as the least squares method. In contrast, the Kalman filter used in [Aoun2008] can be executed 10% faster than a linear regression with a table size of two. However, for synchronization periods of up to 10 s, the average synchronization error of the Kalman filter proved to be larger than for the linear regression-based clock drift estimator.

The Flooding Time Synchronization Protocol (FTSP) [Maroti2004] has become one of the most popular reference implementations for high precision synchronization protocols. It performs the clock drift compensation with a linear regression over the last eight synchronization points, but without justifying the selected regression table size. Furthermore, no details about the regression implementation or its resource requirements are provided. At a 30 s synchronization period, FTSP achieves an average clock jitter of 1.5 $\mu$s.

Several improvements of the FTSP have been suggested. In [Castillo-Secilla2013], a temperature-dependent correction factor is introduced to respond faster to temperature-induced clock drift variations. The authors report an average accuracy of 1.1 $\mu$s after a 20 K temperature variation at a single node. The Recursive Time Synchronization Protocol (RTSP) [Akhlaq2013] is also based on FTSP, but it uses automatic Medium Access Control (MAC) timestamping, i.e., the timestamps corresponding to the transmission or reception of the IEEE [802.15.4] start of frame delimiter (SFD) are captured by the radio transceiver without software intervention. Furthermore, RTSP dynamically adjusts the synchronization period depending on the current clock drift and accuracy at each node. The average accuracy was improved to 0.3 $\mu$s, even when restricting the regression table to two entries to simplify the computational effort for the clock drift estimation.
3.3 Compute-Intensive Wireless Sensor Network Applications

This section provides an overview of WSN applications targeted by this thesis. It is thus related to the use cases described in Chapter 7. Parts of this review have already been published in [Engel2015c]. A more comprehensive and general overview of WSN applications can be found in [Yick2008; Rawat2014].

As outlined in Section 1.2, the applications targeted by this thesis have to provide sufficient potential for node-local computation while running at fixed sampling rates of a few hundred hertz. The demand for in-sensor computation can arise from application-independent services such as encryption, which are not addressed by this review. Instead, the focus here lies on application-specific feature extraction from different application domains.

3.3.1 Condition Monitoring

Condition monitoring systems observe relevant runtime parameters of industrial processes and machinery for early fault detection and predictive maintenance. Most applications observing acoustic signals or structural vibrations require sampling rates of several thousand hertz, such as the 50 kHz bearing fault monitoring described in [Nylanden2014]. The authors propose to either calculate a Root Mean Square (RMS) value in the time domain, or to detect and classify peaks in the frequency spectrum. For the latter, a hardware-accelerated 256 point radix-2 FFT is applied for the time-to-frequency domain transformation. A similar approach is reported by [Shahzad2014] for the 100 kHz monitoring of a rotating machinery.

Both systems far exceed the range of sampling rates targeted in this work. In contrast, the monitoring of induction motors based on current flow measurements are less demanding, as the relevant physical phenomena to be observed are located close to the power line frequency (e.g., 50 Hz). [Philipp2012] used a Zoom-FFT to detect peaks slightly off from the power line frequency to detect broken bars inside an induction motor. To capture other faults such as dynamic eccentricity [Philipp2012] or inter-turn short circuits [Martinez2013], sampling rates of up to 10 kHz are still required.

3.3.2 Structural Health Monitoring

SHM systems observe global dynamic properties (e.g., eigenfrequencies) of large infrastructure objects, such as bridges, poles, or buildings. Compared to the condition monitoring applications, the observed structures are larger and the relevant (sampling) frequencies thus considerably smaller (below 1 kHz). SHM applications thus better fit the DPM requirements of the heterogeneous WSN mote proposed in this thesis.

WSNs have become popular for SHM in the last decades [Lynch2006]. Several research groups have equipped a number of bridge structures with wireless sensor networks, ranging from small models with artificial excitation [Bocca2011], over medium-size pedestrian bridges [Battista2013], up to large automotive traffic bridges [Chae2012; Hu2013; Kim2007; Cho2010; Pakzad2008].

[Kim2007; Pakzad2008] installed 64 [MicaZ] motes (see Table 2) on the 1280 m main span of the Golden Gate bridge in the San Francisco bay. Each mote was equipped with two high precision and two wide range micro-electro-mechanical (MEMS) accelerometers to capture different excitation scenarios (e.g., traffic, wind, and earthquakes).
nodes were synchronized with the FTSP [Maroti2004] resulting in a maximum jitter of 10 µs in the 46-hop network. The sensor data captured at 1 kHz are down-sampled to 50 Hz by averaging, and logged to an external flash memory afterwards. The authors report an average power consumption of 358 mW for this sensor sampling. After about 22 min measurement, the raw sensor data is collected at a central gateway, which takes more than 12 h for all 64 network nodes. Finally, an offline system identification is carried out and the resulting eigenfrequencies and mode shapes are compared to theoretical models of the bridge. The first vertical bending and torsional modes were detected with a relative error of 10% and 19% respectively.

[Bocca2011] instrumented a 4.2 m long wooden bridge model with 8 [MSP430]-based WSN motes, each equipped with a 3-axis MEMS accelerometer with a resolution of 16 bit and a dynamic range of ±2 g. A clock skew-compensating synchronization protocol was used to keep the temporal jitter between the nodes below 5 µs. While exciting the bridge model with random noise generated by an electro-dynamic shaker, the acceleration sensors captured the movement of the structure at 200 Hz sampling rate. During each 60 s measurement period, the acquired samples are buffered in an external flash memory for later readout without any preprocessing or data aggregation. The authors report problems with lost samples (due to the narrow write throughput of the external memory) and an overall energy consumption of 3.2 J for data sampling, buffering and final transmission. This equates to an average power consumption of 53 mW during the 60 s measurement periods. Finally, the gathered data is used for an offline system identification. Compared to a wire-bound reference system, the first 14 structural modes were identified with a maximum error of 1.035%.

[Chae2012] installed 45 motes based on 8 bit AVR MCUs on a 550 m suspension bridge. Each mote was equipped with two different accelerometers, i.e., a high precision force-balance sensor for capturing the truss vibration and a low-power MEMS sensor for capturing cable tension. Both sensors were sampled at 60 Hz and the resulting data was directly transmitted to a base station without any data aggregation. The authors reported packet loss problems caused by high data throughput required for transmitting the unprocessed data. Finally, an offline system identification was carried out, but the comparison with a wire-bound reference system consisting of 360 sensors failed due to broken links caused by cable fatigue.

[Hu2013] instrumented a 296 m highway bridge with 6 [MSP430]-based WSN motes. Each mote was equipped with a high resolution (2 mg) MEMS accelerometer sampled at 100 Hz. The network uses an energy-balanced time synchronization, which claims to achieve a maximum jitter of 10 µs. Again, the sampled data was collected in an external memory for each 250 s measurement period, and then transmitted to a central base station without preprocessing. The authors report an average power draw of 290 mW during sampling and data transmission. After an offline-analysis of the gathered data based on multivariate autoregressive models, the detected eigenfrequencies were compared to theoretical models of the bridge. Between 4% and 18% difference between the detected and the expected eigenfrequencies was reported.

Except some simple down-sampling, none of the aforementioned monitoring systems is based on compute-intensive in-sensor data aggregation. In contrast, the Illinois SHM project [Cho2010; Battista2013] is based on the powerful [Imote2] system (see Table 2) for complex in-sensor computations. [Cho2010] monitored a 484 m cable-stayed span bridge with 70 nodes, each equipped with MEMS acceleration sensors.
A complete decentralized Operational Modal Analysis (OMA) is performed based on Frequency Domain Decomposition and Stochastic Subspace Identification. However, also measuring only 5000 samples per day with a sampling rate of 10 Hz, the power-hungry ARM processor depleted a 21 A h battery in less than two months.

Based on the same WSN platform, [Battista2013] monitored a 120 m pedestrian bridge with 8 nodes, each sampling a MEMS accelerometer at 100 Hz. The sampled data is analyzed immediately based on a Filtered Hilbert-Huang Transformation, which itself is a combination of modal separation, bandpass filtering, and the Random Decrement Technique (RDT) described in Section 7.3.1. The authors report an overall data reduction of 96 %. At an operational duty cycle of 33 % (i.e., 10 min measurement every 30 min, a 15.6 A h-Li-Ion battery depleted after 14 days. This equates to an average power draw of 172 mW from the 3.7 V supply.

These research projects show that SHM applications typically require sampling rates of several hundred hertz, as targeted by this thesis. Furthermore, algorithms for effective in-sensor data aggregation exist, but they have been rarely used so far - probably due to the required computational power of the WSN mote.
CHAPTER 4

HaLoMote: Hardware-Accelerated Low Power Mote

This chapter details the architecture and implementation of the developed heterogeneous sensor node, the communication infrastructure connecting the computational units, as well as the power management capabilities of the platform. Parts of this chapter have already been published in [Engel2012a].

4.1 Architecture Overview

In Section 2.2, the trade-off between the flexibility of general purpose software processors and the energy- and runtime-efficiency of application-specific processing architectures was described. Providing application-specific hardware accelerators for the compute-intensive distributed data aggregation algorithms described in Section 3.3 should thus improve the computation vs. communication trade-off, and reduce the energy consumption of the overall WSN mote. Apart from a few high volume WSN applications such as smart homes, ASIC-based hardware accelerators are not viable for COTS motes that have to be adaptable to a specific application at the time of deployment or even at runtime. Therefore, this work examines the use of RCUs as hardware accelerators for WSNs.

To integrate an RCU into a WSN mote, the design options for the computation and communication architecture shown in Figure 12 have to be traded-off against each other. As an RCU typically does not provide wireless communication capabilities, at least a radio transceiver must be attached (Figure 12a). These transceivers implement the physical layer (e.g., carrier frequency, symbol modulation) and part of the MAC layer (e.g., channel sensing, address filtering, error checking), and provide a limited message buffer that can be accessed by a digital interface. All other tasks of the radio protocols (e.g., multi-hop routing, transport control) must thus be handled by the RCU. As these are typically rather simple and sequential control flow-dominated algorithms, waking up the RCU just to handle the radio protocol would not be energy efficient. For the same reason, handling the radio stack on a softcore MCU inside the RCU (Figure 12b) should be avoided.

![Design options for the computation and communication architecture](image)

Figure 12: Design options for the computation and communication architecture
When adding a dedicated MCU (Figure 12c) for all low priority tasks such as the radio protocol, basic timekeeping, and the top-level control flow of the application (e.g., periodic sensor sampling), the platform can exploit its heterogeneity by selectively shutting down temporarily unused computation and communication units. This basic DPM principle is also applicable when the radio transceiver and the low-power MCU are integrated into a single device (Figure 12d), as these RF-SoCs allow to suspend either of both units separately. Compared to Figure 12c, the combination of an RCU and an RF-SoC improves the data throughput between the hardware accelerator and the wireless transceiver, as more of the limited number of MCU GPIO pins can be dedicated for the inter-processor communication instead of being required for the communication between the MCU and the transceiver. Therefore, the heterogeneous sensor node developed in this thesis is based on the architecture shown in Figure 12d.

A second fundamental architectural design choice deals with the attachment of digital sensors (or analog sensors with a downstream ADC) and additional memory, often required to temporarily buffer raw or aggregated sensor data. Attaching both peripherals to the RF-SoC (Figure 13a) allows to collect the number of samples required for the data aggregation algorithm without ever waking up the hardware accelerator. However, the entire stream of sensor data has to be transferred over the critical link from the RF-SoC to the RCU to be aggregated. As the throughput between the processing units is further limited by the RF-SoC requiring General Purpose Input/Output (GPIO) pins to interface the sensors and the memory, the transfer of the raw data stream becomes the bottleneck of the architecture.

This bottleneck can be mitigated by a shared memory used for the inter-processor communication (Figure 13b and 13c), either by utilizing a dual-port memory, or by synchronizing the memory access of both processors via the remaining direct connections between RCU and RF-SoC. While dual-port memory is typically more expensive than a single-port memory (in terms of chip cost and energy consumption), it enables more parallel operations on the heterogeneous platform such as transmitting one block of the aggregated data while generating the next block. The shared memory approach is most valuable if the sensors are attached to the RF-SoC (Figure 13b) and the data aggregation can be performed on larger blocks of the raw sample data, so the RCU can be kept sleeping until the next block is collected.

However, the sensors may have to be attached to the RCU (Figure 13c and 13d) if the RF-SoC does not have enough GPIO pins to control all peripherals, e.g., if the sensors can not be daisy-chained, or connected to a common bus. Furthermore, many event

![Diagram](image-url)

**Figure 13**: Design options for attaching sensors (S) and memories (M) to the compute units.
detection applications require an immediate processing of the sensor data to minimize the detection delay. In these cases, only the aggregated data stream has to pass the bottleneck from the RCU to the RF-SoC. A dedicated shared memory between both computational units is thus not required and the architecture of Figure 13c is not useful.

From the remaining two reasonable architectures (Figure 13b and 13d), the latter was chosen as the foundation of the HaLoMote: With all peripherals being interfaced by the RCU, the HaLoMote supports a broad range of applications with different sensor and memory requirements. This flexibility is essential for the design space exploration targeted by this work.

### 4.2 HaLOEWEEn Implementation

After evaluating two proof-of-concept HaLoMote implementations based on composing discrete evaluation boards for RCU and RF-SoC, a third revision was designed in cooperation with the Microelectronic Systems Research Group at the Technical University Darmstadt [Philipp2011]. Figure 14 shows the main components of this board called Hardware-Accelerated Low Energy Wireless Embedded Sensor Node (HaLOEWEEn). As motivated in Section 4.1, it heterogeneously combines a software-programmable RF-SoC and an RCU.

The TI CC2531 (i.e., a [CC2530] with USB controller) was chosen as RF-SoC, as it includes a 2.4 GHz IEEE [802.15.4] transceiver and is thus potentially interoperable with many [ZigBee], [6LoWPAN], [WirelessHART] and ISA [100.11a] devices. Furthermore, the small 8 bit MCU core was chosen deliberately to increase the heterogeneity of the platform, as the CC2531 just handles less complex computations, such as the radio protocol and basic (yet precise) time-keeping. It also has access to the HMI peripherals (e.g., LEDs) on the mainboard.

The RCU used as hardware accelerator is realized as a discrete FPGA. As stated in Section 2.2, FPGAs based on non-volatile configuration storage are best suited for energy constrained applications, as they provide deep sleep modes with fast shutdown.
Table 3: FPGA devices with non-volatile configuration storage (P_idle specifies lowest possible power consumption with RAM retention)

<table>
<thead>
<tr>
<th>Vendor</th>
<th>Family</th>
<th>since</th>
<th>CMem</th>
<th>largest Device</th>
<th>LCs</th>
<th>V_CC</th>
<th>P_idle</th>
</tr>
</thead>
<tbody>
<tr>
<td>Actel / Microsemi</td>
<td>[IGLOOplus]</td>
<td>2008</td>
<td>Flash</td>
<td>AGLP125</td>
<td>1,024</td>
<td>1.2 V</td>
<td>17 µW</td>
</tr>
<tr>
<td></td>
<td>[IGLOOnano]</td>
<td>2008</td>
<td>Flash</td>
<td>AGLN250</td>
<td>2,048</td>
<td>1.2 V</td>
<td>24 µW</td>
</tr>
<tr>
<td></td>
<td>[IGLOO]</td>
<td>2008</td>
<td>Flash</td>
<td>AGL1000</td>
<td>24,576</td>
<td>1.2 V</td>
<td>53 µW</td>
</tr>
<tr>
<td></td>
<td>[ProASIC3nano]</td>
<td>2008</td>
<td>Flash</td>
<td>A3PN250</td>
<td>2,048</td>
<td>1.5 V</td>
<td>4500 µW</td>
</tr>
<tr>
<td></td>
<td>[ProASIC3]</td>
<td>2005</td>
<td>Flash</td>
<td>A3P1000</td>
<td>24,576</td>
<td>1.5 V</td>
<td>12000 µW</td>
</tr>
<tr>
<td></td>
<td>[ProASIC3L]</td>
<td>2008</td>
<td>Flash</td>
<td>A3PE3000L</td>
<td>75,264</td>
<td>1.2 V</td>
<td>3300 µW</td>
</tr>
<tr>
<td>Lattice / SiliconBlue</td>
<td>[iCE65]</td>
<td>2008</td>
<td>hybrid</td>
<td>iCE65L08</td>
<td>7,680</td>
<td>1.0 V</td>
<td>54 µW</td>
</tr>
<tr>
<td></td>
<td>[iCE40]</td>
<td>2011</td>
<td>hybrid</td>
<td>LP8K</td>
<td>7,680</td>
<td>1.0 V</td>
<td>300 µW</td>
</tr>
<tr>
<td></td>
<td>[iCE40ultra]</td>
<td>2014</td>
<td>hybrid</td>
<td>iCE5LP4K</td>
<td>3,520</td>
<td>1.2 V</td>
<td>85 µW</td>
</tr>
<tr>
<td></td>
<td>[MachXO3]</td>
<td>2014</td>
<td>hybrid</td>
<td>XO3LF-6900</td>
<td>6,900</td>
<td>1.2 V</td>
<td>4872 µW</td>
</tr>
<tr>
<td></td>
<td>[XP2]</td>
<td>2007</td>
<td>hybrid</td>
<td>XP2-40</td>
<td>40,000</td>
<td>1.2 V</td>
<td>54000 µW</td>
</tr>
</tbody>
</table>

and wake-up times as well as a very low static power draw. At the design time of the HaLOEWEEn mote in 2010, only two FPGA vendors actually targeted ultra-low power applications, as shown in Table 3. Actel, which was acquired by Microsemi in 2010, offered the [IGLOO] and [ProASIC3] device families. By storing the device configuration in on-chip Flash memory, the fabric of the FPGAs can be power-cycled within a few microseconds. This process, named Flash*Freeze, also preserves the content of the registers and BRAMs holding the runtime information. These Flash-based devices are thus well suited for fast DPM. The [ProASIC3] devices, however, were designed for more compute-intensive applications. Their large idle power consumption is not acceptable for a WSN scenario.

SiliconBlue, which was acquired by Lattice Semiconducters in 2011, offered the [iCE65] device family with a non-volatile bitstream storage, that is copied into the SRAM-based configuration storage at startup. Although the (one time programmable) NVM is integrated into the FPGA die (hybrid CMem), the configuration takes at least 50 ms, which renders fast DPM schemes infeasible. Furthermore, the largest [iCE65] provides less than 7000 logic cells, which would restrict the HaLoMote target applications to very simple aggregation mechanisms. In addition to the FPGAs listed in Table 3, some older CPLD devices like the Xilinx [CoolRunner-II] or the Altera [Max5] also use a hybrid CMem and are thus also not suitable for the HaLoMote.

In the end, only the Microsemi [IGLOO] family was a viable option for the HaLOEWEEn implementation in 2010. After Microsemi shifted towards more power-consuming applications with the follow-up [IGLOO2] family and its SoC version (i.e., SmartFusion2), the old [IGLOO] family remains the best choice for ultra-low power, yet demanding reconfigurable computing even today.

The HaLOEWEEn mote is built with the large AGL1000 device, but pin compatibility down to the AGL400 is assured. Target applications that can be implemented with a smaller amount of logic resources can thus use the cheaper devices without the need for redesigning the HaLOEWEEn PCB. As shown in Figure 14, the FPGA is driven by a 20 MHz external oscillator IC and supplied by four voltage rails generated from a common energy source (e.g., a 3 V battery pack). The 1.2 V rail supplies the
FPGA core during normal operation, while the 1.5 V supply is required only during Flash programming. The corresponding Voltage Level Converter (DC/DC) can thus be disabled by a jumper, as soon as the device is programmed. The 2.5 V and 3.3 V rails drive the peripherals, the RF-SoC, and the IO banks of the FPGA.

As described in Section 4.1, application-specific sensors and memories are directly attached to the FPGA. By exposing the corresponding GPIOs pins to expansion headers, HaLOEWEn3 can be easily adopted to different applications requiring specific peripherals. The resulting HaLOEWEn PCB is shown in Figure 15.

While the HaLOEWEn mote already exceeded the performance and power efficiency of homogeneous systems for real applications [Engel2012a], detailed power profiling (see Figure 16) revealed that the 8051-based MCU of the HaLOEWEn3 RF-SoC required significant energy, even with the radio transceiver completely shut-down: With the control software on the MCU just initiating RCU operations (i.e., sensor sampling

![Figure 15: HaLOEWEn version 3 (2010): 100 mm × 62 mm PCB](image)

![Figure 16: Per-component breakdown of the HaLOEWEn3 power consumption](image)
and data accumulation) at 128 Hz, the RF-SoC consumed between 34% and 48% of the overall system energy (depending on the actual compute load on the RCU) [Engel2012a]. More than half of the active time of the MCU (31 µs out of 58 µs) was spent waiting for the clock source of the RCU to become stable after waking it up.

These practical experiences led to design improvements for a fourth refined implementation of the architecture, shown in Figure 17. A improved clocking scheme allows the MCU to provide an auxiliary clock for the RCU until the main oscillator has completely started up (see Section 4.4.3). Furthermore, the new oscillator runs only at 8 MHz and can be scaled further down by a programmable pre-divider. In most cases, the energy consumption of the RCU-internal clock conditioning component, such as a PLL, can thus be eliminated.

For the HaLOEWE4, the TI RF-SoC itself was replaced by a more recent Atmel [ATmega256RFR2] device, which not only has higher radio throughput (in a special non IEEE [802.15.4] compliant mode), but also more GPIO pins for communicating with the RCU. Furthermore, it can be operated with just a 1.8 V supply, (instead of the 2.5 V supply used for the CC2531), thus allowing for more efficient switching regulators.

More precisely, a single step-down regulator (1.8 V) is used by HaLOEWE4 for the peripheral supply instead of two buck-boost converters (2.5 V and 3.3 V) required by HaLOEWE3. Furthermore, a single configurable step-down regulator is used by HaLOEWE4 to either generate the 1.2 V or the 1.5 V core supply for the FPGA. The third DC/DC converter used by the HaLOEWE4 is only required when the FPGA is programmed and can thus be kept down most of the time. By exposing the FPGA Joint Test Action Group (JTAG) interface to the MCU, over-the-air reconfiguration of the whole platform is supported by the HaLOEWE4.

Furthermore, the power and area-consuming HMI-peripherals (e.g., LEDs, switches) were excluded from the mainboard. While they can still be attached for debugging pur-

![Figure 17: HaLOEWE4 version 4 (2014): Basic components](image-url)
poses, unused peripheral pins are now used to increase the communication bandwidth between MCU and RCU.

Most monitoring applications require a significant amount of external memory. By directly integrating four 1 Mbit serial SRAM devices on the mainboard, less demand was imposed on the expansion headers (now 40 pins, down from 148). This significantly shrunk the overall system size down to 46 mm × 30 mm (see Figure 18). Choosing multiple serial instead of a single parallel memory enables parallel independently addressed memory accesses by the RCU. Furthermore, one or more SRAMs can be selectively replaced with pin-compatible non-volatile FRAMs for even more aggressive power management, as described in Section 4.4.2. FRAM was chosen as persistent data storage as it clearly outperforms Flash-based memory in write performance (i.e., access time, granularity, and energy consumption) [Lueders2014]. In particular, the [MB85RS1MT] FRAM modules can be written at 25 Mbit/s while just consuming 5 nJ per byte from the 1.8 V supply rail. As detailed in Table 5, an equally sized Flash device can be written at 260 kbit/s while consuming 332 nJ per byte.

4.3 Inter-Processor Communication

As in every heterogeneous architecture, the communication between the different computing units has a large impact on the system level performance. Commonly, a middleware layer is employed to wrap the hardware/software interface.

In order to achieve this for a very constrained system such as HaLOEWE\textsubscript{n}, a number of issues must be addressed that do not come up for larger desktop or server-class machines. In such systems, discrete RCUs and CPUs usually communicate using memory-mapped IO (MMIO). However, due to pin-count restrictions, the MCU does not externally expose its memory bus, thus necessitating the use of a different interface.

While the Universal Synchronous and Asynchronous Serial Receiver/Transmitter (USART) of the MCU could be used to establish a low-pin count, but rather slow serial link to the RCU, a sufficient number of MCU GPIO pins is available to realize a still narrow, but much faster parallel bus. It consists of a bidirectional 9 bit to 20 bit data, a 3 bit MCU→RCU command, and a 1 bit clock signal, as shown in Figure 19. The actual data width depends on whether additional (HMI) peripherals also require GPIO pins. An additional sleep signal is used for DPM as described in Section 4.4.

Based on this physical communication layer, a message-based application-independent Application Programming Interface (API) was implemented permitting the MCU to control the execution of multiple hardware (HW) kernels on the RCU. HW kernels are typically used to interface sensors, access the external memory (which is only available
to the RCU), implement digital filters, data compression, or other application-specific data aggregation algorithms.

Each HW kernel has input and output data, an internal state, and start/done signals controlling its execution. HW kernels can thus be daisy-chained very easily. To conserve bandwidth on the narrow parallel bus, the HW kernel selection is separated from the data transfers (similar to virtual circuit switching). For example, sensor kernels must be started in every sampling cycle, but a data compression channel is only triggered after a sufficient number of samples has been collected. Thus, the virtual channel only needs to be switched to the appropriate kernel just before and after the data compression. Cyclically executing kernels can be set to auto-restart without MCU intervention, beginning computing again once their previous results have been retrieved.

Table 4 shows the commands implemented by the middleware. On the RCU side, the entire protocol (including marshaling/demarshaling of parameters) is handled in a dedicated HW block shared between the kernels.

The performance of this interface is primarily limited by the MCU-internal memory throughput and GPIO toggle rates. A latency of about 1.5 µs per atomic operation is

<table>
<thead>
<tr>
<th>Command</th>
<th>data driven by</th>
<th>Semantics and Parameters/Results</th>
</tr>
</thead>
<tbody>
<tr>
<td>OBSERVE</td>
<td>RCU</td>
<td>retrieve execution states of all kernels</td>
</tr>
<tr>
<td>READ</td>
<td>RCU</td>
<td>read current output data from virtual channel</td>
</tr>
<tr>
<td>WRITE</td>
<td>MCU</td>
<td>write current input data to virtual channel</td>
</tr>
<tr>
<td>START</td>
<td>MCU</td>
<td>start kernel with given ID</td>
</tr>
<tr>
<td>RESET</td>
<td>MCU</td>
<td>reset kernel with given ID</td>
</tr>
<tr>
<td>SWITCH</td>
<td>MCU</td>
<td>switch virtual channel to given kernel ID</td>
</tr>
<tr>
<td>INC</td>
<td>MCU</td>
<td>value to be added to the selected kernel index</td>
</tr>
<tr>
<td>CONFIG</td>
<td>MCU</td>
<td>reconfigure kernel with selected ID</td>
</tr>
</tbody>
</table>

Table 4: Atomic operations for MCU ↔ RCU communication
achieved for the 8 bit data signals of the HaLOEWEn3, resulting in a throughput of 5.3 Mbit/s. This can be increased by allocating additional GPIO pins not required for peripherals to widen the data bus. An alternative USART-based communication link could achieve up to 4 Mbit/s when utilizing both USART modules of the CC2531 RF-SoC in parallel. The proposed communication framework thus improves the conventional approach by at least 33%.

4.4 Power Management

Since both the RCU and MCU devices support DPM, a proper power management scheme should be straightforward: The MCU is woken up periodically at the beginning of each sampling interval to handle system management tasks such as the wireless communication protocol or the high-precision time synchronization (Section 6.2). It also retrieves previously computed data from the RCU (for possible transmission), offloads new computations (if any) to the RCU, and then goes to sleep itself. The RCU is powered up only if it actually has work to do.

The Microsemi [IGLOO] Flash*Freeze mode preserves the hardware-configuration as well as the content of the on-chip state, while reducing its power draw to just 53 µW. This sleep mode can be quickly entered and exited within a few microseconds transition time, which is negligible even for sampling frequencies of several 100 Hz. The MCU signals the RCU to enter and exit the Flash*Freeze mode by (de-)asserting the sleep signal shown in Figure 19.

However, the interdependency between both compute units of the heterogeneous architecture introduces difficulties for the overall DPM strategy, which are addressed in the following sections.

4.4.1 Flash*Freeze Control

Figure 20a shows a sample schedule for a single output-only HW kernel (e.g., a sensor controller). While the first line represents the active time of the MCU, the command, data, and sleep lines correspond to the inter-processor communication signals shown in Figure 19. The fifth line describes the active time of the RCU (i.e., not in Flash*Freeze mode), followed by the time the RCU is actually executing a HW kernel. The last line defines the time periods during which the RCU knows that it is ready to fall asleep, as

Figure 20: Scheduling of DPM and inter-processor communication for a single output-only HW kernel
all HW kernels are completed. This signal is derived by a simple state machine in the HW kernel controller observing the start and done flags of all kernels.

At the beginning of each sampling cycle, the MCU is woken up by its internal timer. Afterwards, the MCU deasserts the sleep signal, so the RCU is woken up. Now, the execution of the HW kernel is initiated by the MCU issuing the START command of the HW kernel API with the id of the targeted kernel passed in the data signal. Once the kernel has started, the RCU is not ready to enter the Flash*Freeze mode. At this point in time, the MCU performs its management tasks (e.g., updating the timer and handling the radio protocol), which is not shown in Figure 20. Complex HW kernels such as the compression of larger datasets (see Section 5.2), will take longer to complete than the MCU requires for its management operations. By using the OBSERVE command, the MCU determines the execution status of the HW kernel and stalls, until the kernel completes execution. Afterwards, the result of the kernel can be read back to the MCU using the READ command. Finally, the MCU asserts the sleep signal again to shutdown the RCU, before the MCU itself falls asleep.

Stalling the MCU just to wait until the RCU can be shutdown, is not energy efficient. Although the MCU could fall asleep earlier and use a GPIO interrupt to wake up as soon as the HW kernel is completed, the additional state transitions of the MCU increase the mote’s energy consumption. To decouple the shutdown of the RCU from the shutdown of the MCU, the HW kernel controller inside the RCU delays the actual shutdown request until the RCU is ready to fall asleep (i.e., all kernels finished their execution). This is achieved by a dedicated control bit inside the [IGLOO] fabric. The RCU only enters the Flash*Freeze mode, if both the external and the internal shutdown requests are asserted.

Figure 20b again shows a sample schedule for a single output-only HW kernel, now exploiting the improved Flash*Freeze control to get rid of the lengthy OBSERVE command. Due to the auto-start on read feature, even the START command can be avoided thus significantly reducing the active time of the power-hungry MCU. Note that the value read by the MCU corresponds to the output of the HW kernel from the previous sampling cycle, as the kernel is started after its result is fetched. This might be critical for delay-sensitive applications.

### 4.4.2 Swapping Live Variables to External Memory

Although the [IGLOO] FPGA can enter and exit the Flash*Freeze mode very quickly, the 53 µW static power consumption within this mode is at least an order of magnitude larger than the power drawn by typical sleeping WSN software processors (see Table 17). To further reduce the power consumption of the FPGA, its voltage supply would have to be shut down completely, either by disabling the appropriate DC/DC converter, or by opening a power Field-Effect Transistor (FET). Note that these mechanisms are not yet integrated into the HaLoMote implementations.

Due to the non-volatile configuration storage, the hardware accelerator will be operable again immediately after ramping up its power supply. However, all runtime information hold in the volatile (SRAM-based) registers and BRAMs is lost after each power cycle. Before shutting down the FPGA, any information that will be required again after the power cycle (i.e., the live variables) thus have to be swapped to an external memory, from where it can be restored after the restart of the FPGA.
The data swapping requires additional energy for the memory read and write operations and the additional time the FPGA has to be kept awake. The sleep period with reduced power consumption must be long enough to recoup the swapping overhead. To estimate the minimum sleep time per swapped bit, let $P_r$, $P_w$, and $P_s$ denote the power drawn by the memory during reading, writing and when shut down. Furthermore, let $T_r$ and $T_w$ be the achievable throughput for streaming reads and writes, as well as $P_a$ and $P_f$ be the power drawn by the FPGA in active and Flash*Freeze mode. The swapping of $n$ bit live variables for a sleep period $t$ pays off, if

$$t \cdot P_f > \frac{n}{T_w}(P_a + P_w) + (t - \frac{n}{T_r})P_s + \frac{n}{T_r}(P_a + P_r)$$  \quad (1)$$

$$\iff \frac{t}{n} > \frac{1}{P_f - P_s} \left( \frac{P_a + P_w}{T_w} - \left( \frac{1}{T_w} + \frac{1}{T_r} \right)P_s + \frac{P_a + P_r}{T_r} \right)$$  \quad (2)$$

Table 5 lists the power draw and throughput specification of three different types of memory. The SRAM and the FRAM modules are actually integrated into the HaLOEWEm4, while the third memory is a Flash device with the same capacity and interface. The relative break-even time $t/n$ depends on the power drawn by the FPGA and is thus application-specific. However, assuming $P_a = 30 \text{mW}$ for a worst case analysis is reasonable (see Table 17).

The resulting break-even points for the different memory types are also listed in Table 5. Due to its low write throughput, Flash memory is not appropriate for the live variable swapping. The SRAM swapping pays off after 62 $\mu$s/bit, while the FRAM requires 71 $\mu$s/bit to justify the swapping. When directly comparing the overhead of SRAM and FRAM swapping against each other, the latter pays off if

$$\frac{2n}{T_{w,SRAM}}(P_a + P_{w,SRAM}) + (t - \frac{2n}{T_{w,SRAM}})P_{s,SRAM} > \frac{2n}{T_{w,FRAM}}(P_a + P_{w,FRAM})$$  \quad (3)$$

$$\iff \frac{t}{n} > \frac{2}{P_{s,SRAM}} \left( \frac{P_a + P_{w,FRAM}}{T_{w,FRAM}} - \frac{P_a + P_{w,SRAM} - P_{s,SRAM}}{T_{w,SRAM}} \right) = 327 \mu\text{s/bit}$$  \quad (4)$$

Thus, for a sample live variable set of 100 B, swapping to SRAM is worthwhile for sleep periods of at least 50 ms, and using FRAM instead of SRAM is reasonable only for sleep periods of at least 262 ms. For a running measurement, the set of the live variables between subsequent sampling periods is rather large, as it typically contains buffered samples (e.g., taps of digital filters) and intermediate results (e.g., accumulated

<table>
<thead>
<tr>
<th>Device</th>
<th>Type</th>
<th>$T_w$</th>
<th>$T_r$</th>
<th>$P_w$</th>
<th>$P_r$</th>
<th>$P_s$</th>
<th>$t/n$</th>
</tr>
</thead>
<tbody>
<tr>
<td>[23A1024]</td>
<td>SRAM</td>
<td>20 Mbit/s</td>
<td>20 Mbit/s</td>
<td>1.8 mW</td>
<td>1.8 mW</td>
<td>1.8 $\mu$W</td>
<td>62 $\mu$s</td>
</tr>
<tr>
<td>[SST25WF010]</td>
<td>Flash</td>
<td>0.26 Mbit/s</td>
<td>20 Mbit/s</td>
<td>10.8 mW</td>
<td>3.6 mW</td>
<td>-</td>
<td>2993 $\mu$s</td>
</tr>
<tr>
<td>[MB85RS1MT]</td>
<td>FRAM</td>
<td>25 Mbit/s</td>
<td>25 Mbit/s</td>
<td>17.1 mW</td>
<td>17.1 mW</td>
<td>-</td>
<td>71 $\mu$s</td>
</tr>
</tbody>
</table>

Table 5: Power consumption and streaming throughput of different 1 Mbit memories operated at 1.8 V
sensor channel statistics). The swapping is therefore not viable for the DPM between subsequent sampling cycles for most of the applications addressed by the HaLoMote with sampling rates of 100 Hz and above. Instead, such swapping can only be used for supervisory power management when the application is not relying on continuous monitoring.

The estimation of the break-even points assume that the number of bits swapped out equals the number of bits swapped in. A dirty bit mechanism known from conventional cache systems may reduce the number of bits to swap out. Classification algorithms such as rainflow-counting [OConnor2010] are typical examples that benefit from this mechanism, as a larger dataset is continuously updated but only a small subset of its entries might have changed between subsequent DPM cycles.

### 4.4.3 Flexible Clock Generation

Since both the MCU and RCU rely on synchronous digital logic, they require a periodic clock signal. MCUs typically provide internal oscillators based on resonating crystals or simple capacitive circuits, which are automatically turned-off when the MCU is put to sleep. In contrast, the [IGLOO] FPGA requires an additional clock source. Commonly, an external oscillator IC is used for this purpose. The [LTC6930] oscillator clocking the HaLOEWEn4 RCU was primarily selected for its energy efficiency. Nevertheless, it still draws about 580 µW at 8 MHz, which is more than 10× the combined power required by the MCU and RCU in sleep mode. Clearly, the oscillator should also be powered down when the RCU is put to sleep. However, the start-up of the oscillator takes about 45 µs (see Figure 21), which renders the fast Flash*Freeze capability of the RCU (about 1 µs state transition time) a moot point. At higher sampling frequencies, more sophisticated clocking schemes are thus required to support fast and energy-efficient shutdown and wake-up of the hardware accelerator.

In this Section, five different clocking schemes are described and their impact on the DPM scheduling of the HaLoMote is analyzed. The resulting energy efficiency is reported in Section 4.4.4.

![Figure 21: Startup timing of the [LTC6930] external oscillator](image-url)
The most obvious clocking scheme is shown in Figure 22a. An external oscillator IC directly drives the RCU while the MCU is controlling the DPM by shutting down both the RCU and the oscillator via dedicated GPIOs pins. Most external oscillators do not provide low-power modes, but require their supply voltage to be detached by a dedicated transistor, or by shutting down the corresponding DC/DC converter. In the latter case, the oscillator keeps toggling even after the converter is shut down, until the voltage of the converter’s charge pump capacitor drops below the supply threshold of the oscillator. For example, a typical 10 µF DC/DC output capacitor keeps supplying the [LTC6930] oscillator with 321 µA for about 3 ms, before the supply voltage drops from 1.8 V to 1.7 V. A fast shutdown thus has to be realized by a transistor switch. Figure 22b details the power management timing resulting from the external clock supply. After the MCU woke up and activated the power supply of the oscillator, the hardware kernels cannot be started before the oscillator provides a stable clock signal. As stated above, this long oscillator start-up delay is a major drawback. Even worse, the MCU cannot put itself to sleep as long as the hardware kernels are running, as it has to shutdown the oscillator power supply immediately after the hardware accelerator has finished its tasks.

The late shutdown of the MCU can be avoided by directly powering the oscillator with an RCU IO pin (see Figure 23a). The [IGLOO] FPGA can pull an IO pin to ground when entering the Flash*Freeze mode, and back up to the supply voltage level of the IO banks when waking up, even without a clock signal being present. Combined with the sophisticated Flash*Freeze controller described in Section 4.4.1, this clocking scheme unburdens the MCU from monitoring whether the RCU is currently active. Furthermore, as the FPGA output pins are not buffered by a noticeable capacitance, the oscillator stops toggling immediately after the FPGA has entered the Flash*Freeze. The lengthy start-up time of the external oscillator, however, remains a weak point of this clocking scheme.

Figure 22: Clocking the RCU by an external oscillator controlled by the MCU

Figure 23: Clocking the RCU by an external oscillator controlled by the RCU
Instead of using an external oscillator IC, a clock signal generated by the MCU can drive the RCU as shown in Figure 24a. Many MCUs like the Atmel [ATmega256RFR2] can provide clock signals at dedicated output pins. For all other devices, like the TI [CC2530], an unused Serial Peripheral Interface (SPI) module can be repurposed by repeatedly filling the SPI output register to keep the serial clock running (even though no SPI transfers actually take place). This can be achieved by Direct Memory Access (DMA) transfers without actually invoking the processor. As shown in Figure 24b, the major disadvantage of this scheme is the necessity to keep the MCU active until the RCU has finished its own computations. As before, the MCU must be stalled unless useful computations can actually be issued while waiting for the RCU. Thus, this approach of using a MCU clock can only do better than the external oscillator IC if the accelerator execution time is shorter than the oscillator IC startup time.

Figure 25a shows the combination of the two previously described clocking schemes. This dual clocking approach uses the fast-starting auxiliary clock (ACLK) provided by the MCU to drive the RCU while the external oscillator is slowly starting up to generate the main clock (MCLK). The [ATmega256RFR2] can generate an ACLK of up to 16 MHz, so about 720 RCU cycles can already be executed before the MCLK becomes stable. As shown in Figure 25b, this avoids stalling the MCU and the RCU at the beginning of a sampling cycle and allows for a fast shutdown of the MCU, as the ACLK is not required to complete the hardware kernel execution (since the MCLK has taken over). In practice, however, several issues have to be addressed when implementing such a dual-source clocking scheme.

First, the RCU itself has to determine whether the MCLK is sufficiently stable after the start-up. A simple counter on the RCU delays the switch-over from the ACLK to the MCLK by an experimentally-determined number of MCLK cycles. To improve the robustness of the MCLK stability detection under voltage and temperature...
variations, the number of MCLK cycles within each (or multiple) ACLK cycle can be observed. As soon as the expected number of MCLK cycles was counted in two (or more) subsequent ACLK cycles, MCLK-stability can be assumed. This feature is, however, not yet implemented for the HaLoMote architecture.

Second, a valid switching from the ACLK to the MCLK has to be supported. Runt pulses (leading to invalid clock periods shorter than the specified cycle duration) must not reach the hardware kernels, as overclocked hardware would generate unpredictable results. As shown in Figure 25b, switching to the MCLK immediately after it is deemed stable may cause such runt pulses and a faulty clock signal. To avoid this, the RCU-internal system clock must be forced low for at least one MCLK cycle after the MCLK became stable and the auxiliary clock went low (ensuring a valid clock signal after the switch-over).

Finally, if all hardware kernels can be completed on the ACLK before the MCU falls back to sleep, the external oscillator should not even be started. To achieve the best energy savings, the decision about whether the MCLK is necessary should be made by the RCU at the start of each sampling cycle, before actually starting the hardware kernels. This is possible for many applications with fixed or predictable accelerator runtimes, as long as the minimum execution time of the MCU (e.g., required for timekeeping) is known.

The last clocking scheme considered generates the clock on the RCU itself, as described in [AC332], requiring neither an external oscillator IC nor the MCU clock. As shown in Figure 26a, this oscillator is realized as a combination of an inverting NAND gate and a configurable number of delay gates to control the clock frequency. The AO14 cell is used as delay gate as it provides the largest IO delay (≈ 6.3 ns) within the Microsemi cell library for the IGLOO device. By pulling two out of its three inputs to ground (see Figure 26b), the remaining input (C) is not inverted. The delay chain without the first NAND gate is thus not limited to an even length. The oscillator can be controlled by low active signals to immediately force the clock output high (resetN), or only after the next rising clock edge (idleN). The actual clock frequency is sensitive to the number of delay cells as well as the RCU core voltage and operating temperature, since no stable frequency reference (such as a quartz crystal) is involved (see Section 4.4.4). If the associated frequency drift is acceptable for an application, this scheme completely eliminates the flaws of the previous ones, as the MCU is no longer involved in the RCU clock management at all, as shown in Figure 27b.

Figure 26: Implementation of an RCU-internal oscillator, consisting of a configurable number of AO14 cells used as delay elements as described in [AC332]
4.4.4 Evaluation

To analyze the effects of the different clocking schemes on the energy efficiency of the heterogeneous platform, a synthetic benchmark related to the target application described in Chapter 7.3 was executed on the HaLOEWEEn mote. Essentially, a configurable number of computations and accompanied accesses to an RCU-external SRAM were handled by a HW kernel in every sampling cycle. In parallel, the MCU was performing timekeeping and queried status information from the RCU. The same algorithm was implemented on a TI [MSP430] reference system, to compare the energy efficiency of the heterogeneous architecture against a homogeneous software solution. Detailed information about this benchmark running at 128 Hz sampling frequency can be found in [Engel2012a].

The clocking schemes shown in Figures 23, 24, 25, and 27 were arranged by appropriately connecting prototyping boards for the [IGLOO] AGL1000 FPGA, the [CC2530] MCU and the 8 MHz [LTC6930] oscillator. The current flowing into a common 3.7 V supply rail was measured by an Agilent [34411A] multimeter configured for 100 mA range and an integration time of 2 s to average the current spikes into the switching regulators. Thus, all losses caused by voltage regulators and the oscillator are included in the power measurements.

Figure 28 details the measurement results when sweeping the number of accumulations to be performed per sampling cycle from 0 to 100. First of all, the [MSP430] 8 MHz External OSC (Figure 23) 1 MHz MCU Clock (Figure 24) Dual Clock Sources (Figure 25) 8 MHz Internal Oscillator (Figure 27) 8 MHz MSP430 Reference

![Figure 28: Load-Dependent power draw for the considered clocking schemes](image)
reference system is clearly outperformed by the four HaLOEWEn configurations, if more than 15 accumulations per sampling cycle have to be performed. Obviously, a WSN mote can only benefit from reconfigurable computing, if there is a sufficient amount of computation to be offloaded to the hardware accelerator.

For less than 14 accumulations per sampling cycle, the HaLOEWEn driven by the 1 MHz serial MCU clock draws less power than when driven by the 8 MHz external oscillator. In these cases, the runtime of the HW kernel clocked at 1 MHz is 49 µs, as each accumulation takes 3.5 cycles on average. This closely matches the start-up delay of the external oscillator and experimentally confirms the trade-off discussion in Section 4.4.3.

The experiments have also determined that the MCU requires at least least 26 µs to start the hardware kernel and prepare its timer for the next sampling cycle. As each accumulation may require up to four cycles (i.e., 4 µs at 1 MHz SPI clock) on the RCU, a maximum of six accumulations can be performed running just on the ACLK without starting the external oscillator. This is reflected by the power consumption characteristics of the dual oscillator clocking scheme in Figure 28. For up to six accumulations, this scheme draws even less power than the clocking scheme from Figure 22: It does not require the MCU (which supplies the ACLK) to remain awake and check whether the RCU has completed the computation before the MCU can fall asleep again. Instead, the RCU can continue autonomously as soon as the main clock becomes stable, and the MCU can go back to sleep immediately. If more than six accumulations have to be executed, the main clock is started up at the beginning of each sampling cycle, thus raising the power draw to the level of the external oscillator scheme. With increasing computational load, the dual oscillator scheme then benefits by executing at the rate of the faster main clock.

To reduce the system level power draw below that of the dual clocking scheme, all external oscillators and the need to stall any of the computing units before shutdown must be eliminated. As shown in Figure 28, the internal oscillator achieves this goal and outperforms all other clocking schemes. But the usage of such an internal oscillator is quite unconventional, as its frequency is dependent on the operating temperature and the RCU core voltage. As shown in Figure 29a, a ±10 mV core voltage variation translates into a ±13 kHz frequency variation for 400 delay cells. Since the LTC3388 voltage regulator used to generate the RCU core supply voltage is specified to be stable
within ±60 mV on its 1.2 V output voltage, the voltage-dependent frequency variation will be limited to ±78 kHz or ±11% of the 714 kHz center frequency. This far exceeds the temperature-dependent frequency variation, which was observed to be just ±6 kHz or ±0.8% over a 76 °C temperature range (Figure 29b). Thus, the internal oscillator is a feasible power-saving design choice if its target frequency is conservatively configured to be about 15% to 20% less than the maximum clock frequency supported by the specific hardware design (ensuring correct operation even if voltage variations lead to a faster clock).

To selectively control the on-chip oscillator cycle time, two options have been investigated. Increasing the number of delay cells shows a linear effect on the cycle time (Figure 30a). Modifying the distance between the delay cells increases the overall routing delay. To achieve this, each delay cell must be placed manually. Figure 30b shows the effect of longer distances on the oscillator frequency. Due to the segmented nature of the FPGA’s configurable interconnect, the relation between distance and delay is non-linear. However, using longer interconnect delays allows for fewer delay cells without changing the target frequency.

To compare the power efficiency of both design options (i.e., longer interconnects or more delay cells), two 5 MHz oscillators have been implemented using 40 and 72 delay cells placed at a distance of 10 and 1 LUTs respectively. The power draw was measured in the RCU Flash*Freeze mode with the oscillator running. Afterwards, the static power consumption in Flash*Freeze mode (with disabled oscillator) was subtracted to determine the power drawn only by the oscillator. The oscillator using 72 delay cells shows a 35% smaller power consumption (see Table 6). For lower power draw, it is thus advisable to commit more active LUT area for delay cells, than to rely on interconnect segments for controlled delays.

<table>
<thead>
<tr>
<th>Delay Cells</th>
<th>Distance</th>
<th>Frequency</th>
<th>Power Consumption</th>
</tr>
</thead>
<tbody>
<tr>
<td>40</td>
<td>10</td>
<td>5.02 MHz</td>
<td>272 µW</td>
</tr>
<tr>
<td>72</td>
<td>1</td>
<td>5.04 MHz</td>
<td>176 µW</td>
</tr>
</tbody>
</table>

Table 6: Power-Efficiency of on-chip oscillator architectures
CHAPTER 5

Library of Reusable Hardware Accelerator Blocks

The hardware accelerator of the HaLoMote architecture is primarily intended to speed up application-specific data aggregation and feature extraction algorithms. Nevertheless, there are many application-independent algorithms commonly required in WSN scenarios, that might also benefit from hardware acceleration. Potential candidates comprise encryption and authentication, generic data compression, routing, time synchronization, Real-Time Operating Systems (RTOSs), and digital filters. For the following reasons, not all of these tasks are addressed in this work.

Symmetric and asymmetric encryption on RCUs have already been extensively researched [Rodriguez-Henriquez2007]. The corresponding results can be directly applied to the HaLoMote architecture.

Routing algorithms deal with the discovery and maintenance of routes and their overhead is primarily caused by message exchanges (e.g., for neighbor detection) instead of complex computations. Hardware acceleration may be helpful for complex graph problems [Ahmed2009], but the required topology knowledge often prevents a distributed implementation, as targeted by the HaLoMote architecture. Although the hardware acceleration of generic routing protocols is not addressed in this work, a flooding scheme tailored to SHM applications is presented in Section 6.1.

In addition to resource synchronization, event handling, and message passing mechanisms, the central element of an RTOS is the task scheduler. Several attempts were made to realize some [Tang2014] or all [Yan2010] of these elements in hardware. A hardware-accelerated task scheduler, however, must be able to preempt software threads upon occurrence of asynchronous events (i.e., the reception of a radio packet). The RCU thus would remain active most of the time, which is not affordable in energy-constrained WSN scenarios.

Thus, this work is focused on accelerating other functions using dedicated hardware.

5.1 Digital Filters

Digital filters are often required in WSN applications for frequency-dependent signal conditioning (e.g., by FIR filters) or for signal transformations between the time and the frequency domain (e.g., by FFT filters). These digital filters can be implemented as dataflow structures and are thus well suited to be implemented on RCUs. The regular filter structures can also be efficiently generated by HLS tools. In this section, the energy required for hardware-accelerated FIR and FFT filters are compared against appropriate MCU implementations.
5.1.1 Finite Impulse Response Filter

A $n$-tap FIR filter with the filter coefficients $(a_i)_{i=0}^{n}$ transforms a discrete input sequence $(x_i)_{i \in \mathbb{N}}$ into the output sequence

$$y_m := \sum_{k=0}^{n} a_k \cdot x_{m-k} \quad \forall m \geq n. \quad (5)$$

By selecting appropriate coefficients, the frequency response of the filter can be adopted to realize high-pass or low-pass filters, or to model the behavior of another specific structure. With increasing filter order, the desired frequency response can be modeled more accurately. In WSN applications, FIR bandpass filters are often used for input signal conditioning, e.g., to eliminate low- and high-frequency components such as static offsets or sensor noise.

In [Engel2011], a 384-tap FIR filter modeling the behavior of a bearing structure was generated by [SynphonyHLS] (i.e., the predecessor of [SynphonyModelCompiler]) for the Microsemi [IGLOO] M1AGL1000V2 device. Fixed-point arithmetic with a 16 bit data path, 14 bit coefficients and 7 bit multipliers was used resulting from a bit width optimization process. With a $20 \times$ folding of the hardware resources, the filter occupies 71\% of the available logic cells of the FPGA. When running the FPGA at 9 MHz, the FIR execution takes $2.2 \mu s$. At a sampling period of 400 Hz as required by the bearing structure application, the 0.1\% duty cycle results in an average power consumption of 71 $\mu W$ (FPGA core supply without IO banks).

In comparison, a sequential [MSP430] implementation of the same FIR filter also running at 9 MHz draws 5.6 mW on average. The hardware accelerator thus improves the energy efficiency by $79 \times$.

5.1.2 Fast Fourier Transformation

A $n$-tap Discrete Fourier Transformation (DFT) filter transforms a discrete time domain sequence $(x_i)_{i=0}^{n-1}$ captured with the sampling frequency $f_s$ into the frequency domain sequence

$$y_m := \sum_{k=0}^{n-1} x_k e^{-2\pi i \frac{km}{n}} \quad \forall m \leq \frac{n}{2} \quad (6)$$

with $y_m$ representing the complex amplitude of the spectrum at the frequency bin $\frac{m}{n} \cdot f_s$.

As stated in Section 3.3.1, many monitoring systems operate in the frequency domain. The FFT rearranges the structure of the DFT into a self-recursive formulation for a more efficient $O(n \log n)$ implementation. It requires $n$ to be a power of two.

In [Engel2012a], a 256-tap FFT was generated by [SynphonyHLS] for the Microsemi [IGLOO] M1AGL1000V2 device. It uses integer arithmetic with 18 bit accuracy at the input stage. When running the FPGA at 8 MHz, the FFT execution takes $6.3 ms$ and consumes $46 \mu J$ (FPGA core supply without IO banks).

In comparison, a sequential [MSP430] implementation of the same FFT filter also running at 8 MHz takes 124 ms and consumes 259 $\mu J$. The hardware accelerator thus
improves the runtime-efficiency of the software-processor by 20× and the energy efficiency by 5.6×.

5.2 Lossless Data Compression

In this section, a lossless data compression scheme is described, which is based on minimal assumptions about the nature of the data stream to be compressed. Furthermore, a HaLOEWEn hardware accelerator is detailed, whose energy efficiency is evaluated for two concrete examples in Section 7.1 and 7.2. The implementation and evaluation of the compression kernel have already been published in [Engel2014a].

Lossless data compression can be considered as the most generic data aggregation scheme. It should be applied, if all sensor data has to be forwarded to the base station, either because the feature extraction requires information from all network nodes, or because the network operator requires full-spectrum data for a comprehensive failure analysis.

While reducing the communication demands and the related energy consumption of the sensor nodes, data compression comes at the cost of encoder and decoder complexity. As the decoding is typically performed at the (often mains-powered) base station, the decoder complexity is generally not a major concern. This section thus focuses on the encoder complexity.

To improve the compression quality, many encoders first collect a block of data to analyze its statistical nature before compressing the block with the appropriate settings. This two-pass strategy increases the memory capacity required on the node, as well as the latency between data acquisition and transmission. The block size, and thus the gains in compression quality, is therefore limited by the amount of available memory, or tight real-time requirements in latency-sensitive applications. In addition, both encoder passes require computation time and energy, both of which must be amortized by the reduced communication effort.

5.2.1 Adaptive Differential Pulse Code Modulation and Rice-Encoding

Without detailed knowledge about the data characteristics of a specific application, only the temporal correlation between the samples generated by each sensor can be taken into account. Unless observing random data, the absolute difference between successive samples can be expected to be smaller than the dynamic range of the sensor signal with a high probability.

As described in [Sayood2005, Chapter 11], the Differential Pulse Code Modulation (DPCM) exploits these temporal correlations by predicting the next sensor sample as a linear combination of the previous $M$ observed samples.

$$p_i := \sum_{k=1}^{M} a_k \cdot x_{i-k} \quad \forall i > M$$

$a_1, \ldots, a_M \in \mathbb{R}$ are the prediction coefficients for a predictor of order $M$. The prediction order can be used to trade-off the compression quality against the computational complexity of the encoder.
Instead of transmitting the samples, the prediction error
\[ d_i := p_i - x_i \quad \forall i > M \] 
(8)
is forwarded to the downstream symbol encoder. If the samples can be predicted accurately, the prediction error sequence has a zero mean and a small variance. To exploit these features, a Rice symbol encoder is used that translate the prediction error sequence into a compressed bitstream [Yeh1991].

As an initial step for Rice encoding, the sequence of signed prediction errors \( d_i \) is converted to a sequence of unsigned values \( v_i \) by a simple transformation:
\[ v_i := \begin{cases} 2d_i, & \text{for } d_i \geq 0 \\ -2d_i - 1, & \text{for } d_i < 0 \end{cases} \quad \forall i > M \] 
(9)
The relevant property of this transformation is that small absolute values are mapped to small unsigned values. The resulting values \( v_i \) are then encoded into Rice-form bit sequences
\[ R_W(v_i) := \text{concat} \left( U\left(\frac{v_i}{2^W}\right), B_W(v_i \mod 2^W) \right) \quad \forall i > M \] 
(10)
with \( W \) being the Rice parameter that balances the widths of a zero-terminated unary code \( U \) and the binary block code \( B_W \) in a bit-wise concatenation. Precise predictions (i.e., \( v_i < 2^W \)) can thus be represented by \( W + 1 \) bits. Infrequently occurring larger prediction errors, caused by unforeseen spikes, can still be expressed losslessly by exploiting the variable length unary code.

A small and comprehensive example for a third order DPCM and a Rice symbol coder is presented in Table 7. The first three input samples are transmitted in uncompressed form, i.e., as 8-bit block code in this example. The fourth sample is predicted as 17, so the prediction error \(-7\) must be transmitted. After mapping the signed error to the corresponding unsigned value 13, the Rice symbol coder is applied in two steps. First, the run length of the unary code is \( 13/2^4 = 0 \), so only the terminating 0 symbol is generated. Second, the remaining \( 13 \mod 2^4 = 13 \) is represented as 4-bit block code. For the fifth sample, the prediction error is larger. The unary code thus yields a run length of \( 50/2^4 = 3 \), and the remaining \( 50 \mod 2^4 = 2 \) is represented as 4-bit block code.

<table>
<thead>
<tr>
<th>( i )</th>
<th>( x_i )</th>
<th>( p_i )</th>
<th>( d_i )</th>
<th>( v_i )</th>
<th>( v_i/2^W )</th>
<th>( v_i \mod 2^W )</th>
<th>( B_W(v_i/2^W) )</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>15</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>00011111</td>
</tr>
<tr>
<td>2</td>
<td>10</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>00001010</td>
</tr>
<tr>
<td>3</td>
<td>22</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>00010110</td>
</tr>
<tr>
<td>4</td>
<td>24</td>
<td>17</td>
<td>(-7)</td>
<td>13</td>
<td>0</td>
<td>13</td>
<td>0</td>
</tr>
<tr>
<td>5</td>
<td>8</td>
<td>33</td>
<td>25</td>
<td>50</td>
<td>3</td>
<td>2</td>
<td>1110</td>
</tr>
</tbody>
</table>

Table 7: Example for DPCM and Rice coder: \( M = 3, a_1 = 0.5, a_2 = 0.3, a_3 = 0.2, W = 4 \)
code. Compared to encoding all 5 samples as 8 bit block code, the DPCM achieves a compression ratio of $37/40 = 92.5\%$.

As the compression quality of the DPCM scheme depends on the prediction accuracy, the prediction parameters should be adapted to the characteristics of the input data stream. In the ADPCM scheme, this adaption is performed on every block of $N$ samples $x_1, \ldots, x_N$ by calculating the auto-correlation values for the block:

$$r_k := \frac{s_k}{N-k} \quad 0 \leq k \leq M$$

with

$$s_k := \sum_{i=k+1}^{N} x_i \cdot x_{i-k} \quad 0 \leq k \leq M$$

These correlation values are then used to build a system of linear equations

$$r_k = \sum_{i=1}^{M} a_i \cdot r_{|i-k|} \quad 1 \leq k \leq M,$$

whose solution results in prediction coefficients that minimize the variance of the prediction error sequence [Sayood2005, Section 11.5.2]. The resulting coefficients must precede the generated bitstream of each block, as the decoder can not perform the coefficient adaption by itself. The coefficients, as well as the first $M$ samples of each block, are encoded by a fixed-length block code.

Resuming the example from Table 7, the correlation values for the $N = 5$ sample block are $r_0 = 290$, $r_1 = 273$, $r_2 = 249$, and $r_3 = 220$. The resulting system of linear equations

$$
\begin{align*}
290 \cdot a_1 + 273 \cdot a_2 + 249 \cdot a_3 &= 273 \\
273 \cdot a_1 + 290 \cdot a_2 + 273 \cdot a_3 &= 249 \\
249 \cdot a_1 + 273 \cdot a_2 + 290 \cdot a_3 &= 220
\end{align*}
$$

is solved to get the optimized prediction coefficient $a_1 = 1.13$, $a_2 = -0.05$, and $a_3 = -0.16$. With these coefficients, the compression ratio for this specific example is improved to $90\%$.

### 5.2.2 Hardware-Accelerated ADPCM

The computational demand of DPCM is small compared to other predictive compression schemes such as the Free Lossless Audio Codec (FLAC) or Apple Lossless Audio Codec (ALAC) discussed in Section 7.1.1. However, the online-adaption of the prediction coefficients may overburden low-power MCUs, especially if multiple sensor channels have to be processed in parallel (see Section 7.2.2). This section thus describes the implementation of a ADPCM hardware kernel for the HaLoMote.

Figure 31 shows the structure of the ADPCM hardware kernel for compression of one sensor channel. The compression module is generic and can be configured for static or adaptive prediction. The gray modules of Figure 31 are used only for the adaptive prediction. The block buffer is realized by on-chip BRAM and is initially filled with a block of $N$ samples. For the static DPCM compression, the block buffer is read
once to pass the uncompressed data through a shift register. After a new sample was inserted into this shift register, the prediction of the next sample is calculated according to Equation 7. By sequentially multiplexing the appropriate samples and coefficients to the Multiply-Accumulate (MACC) module, \( p_i \) is accumulated in \( M \) clock cycles. Afterwards, the prediction error is calculated and passed to the Rice coder to produce the bit sequence described by Equations 9 and 10. This bit sequence is sliced into bytes and written back to the block-buffer (now in compressed form) by a bit-buffer module. The Rice coder and the bit-buffer are running in parallel to the sequential predictor.

By using the same block-buffer for the uncompressed and the compressed data, an in-place compression approach is achieved. This allows for more ADPCM modules to be instantiated in parallel for multi-channel compression, as the BRAM of the AGL1000 FPGA used is limited to just 144 kbit. However, this approach presumes a compression ratio smaller than 100% (i.e., compression actually reduces the data size) for each prefix of each sample block in the sample stream. If this constraint can not be guaranteed, separated input and output buffers have to be used. As the spikiness of many data streams (and thus the volume of post-compression data) is limited by the inertia of the underlying physical or chemical systems, the possibility to perform in-place compression is often useful.

For the adaptive ADPCM compression scheme, an additional coefficient optimization pass precedes the compression pass. During this pass over the block-buffer, the auto-correlation sums \( s_k \) of Equation 12 are accumulated. Again, the time-multiplexed MACC module is supplied with the appropriate operands from the sample shift register \((x_i, \ldots, x_{i-M})\) and the accumulator set \((s_0, \ldots, s_M)\). At the end of the pass, the accumulated auto-correlation sums \( s_k \) are used to generate the linear equation system that has to be solved to retrieve the prediction coefficients \((a_1, \ldots, a_M)\). As a trade-off between prediction accuracy and the time and energy spent to calculate the coefficients and the prediction values, fixed point arithmetic was chosen. The resulting coefficients
are stored in Q4.12 format. The solver supports first-order predictors \( M = 1 \), which simplifies Equation 13 to

\[
a_1 = \frac{r_1}{r_0} = \frac{s_1 \cdot N}{s_0 \cdot N - s_0},
\]

(14)

By restricting the block size \( N \) to a power of two, the single remaining integer division can be performed sequentially in 16 clock cycles. Higher predictor orders could be realized, but were not necessary for the use cases analyzed in Chapter 7.

### 5.2.3 Test Setup for Energy Efficiency Evaluation

The energy efficiency of the ADPCM kernel is evaluated for two concrete applications in Section 7.1 and 7.2. The test setup used to evaluate the energy efficiency of the in-sensor data compression is, however, application independent in principle, and is thus described in this section.

Figure 32 gives an overview of the HaLoMote test setup for multi-channel ADPCM compression. It is based on the inter-processor communication API described in Section 4.3. The raw data stream to be compressed is held inside the ROM of the MCU. The data stream can thus be replayed to simplify the comparison of different ADPCM configurations. The data stream may consist of interleaved samples from multiple sensor channels, that are then distributed to multiple ADPCM kernels by a small WRITE kernel. The compressed data stream generated by the compression modules is sequentialized by the READ kernel and passed back to the MCU, which transmits it wirelessly. The HaLOEWEn radio stack allows parallelizing the radio transmission with the data transfer from the RCU to the MCU. Thus, the MCU duty cycle is not stretched by the inter-processor communication. This is important for the energy efficiency of the system (shorter duty cycles allow longer ultra-low-power sleep phases). Finally, the COMPRESS kernel controls the execution of the compression modules and tracks the number of generated output bytes.

![Figure 32: Test setup with multiple ADPCM hardware kernels](image)
<table>
<thead>
<tr>
<th>Channels</th>
<th>Scheme</th>
<th>BRAM</th>
<th>Core Cells</th>
<th>max Frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>DPCM</td>
<td>8 (25%)</td>
<td>1681 (7%)</td>
<td>19.2 MHz</td>
</tr>
<tr>
<td>1</td>
<td>ADPCM</td>
<td>8 (25%)</td>
<td>6728 (27%)</td>
<td>10.5 MHz</td>
</tr>
<tr>
<td>3</td>
<td>DPCM</td>
<td>24 (75%)</td>
<td>4419 (18%)</td>
<td>22.4 MHz</td>
</tr>
<tr>
<td>3</td>
<td>ADPCM</td>
<td>24 (75%)</td>
<td>14088 (57%)</td>
<td>10.7 MHz</td>
</tr>
</tbody>
</table>

Table 8: Synthesis results for the Microsemi IGLOO AGL1000V2 device

Each hardware-accelerated data compression design was synthesized for the Microsemi [IGLOO] AGL1000V2 FPGA using Synplify Pro H-2013.03M-1 with retiming. The compression modules were configured for 16bit samples, 2048 samples per block buffer and first order prediction, as required by both use cases analyzed in Chapter 7. As shown in Table 8, the design is primarily limited by the available memory and restricted to four channels at the given block size. Even so, some area remains for implementing higher order predictors.

To demonstrate the energy efficiency of the hardware-accelerated data compression, the measurement setup shown in Figure 33 is used. The HaLOEWEn3 mote is supplied by an external 3 V voltage source to power its internal components. The MCU drives the RCU into its low-power Flash*Freeze mode, as long as no hardware-accelerated computations are required. For precise time measurements, an external trigger is asserted by the MCU and recorded by an oscilloscope, as long as a compute task is executed. The average current drawn by the system during the task execution is measured by an Agilent [34411A] multimeter, which provides a resolution of 3 µA at a sampling frequency of 50 kHz.

For comparison with the hardware-accelerated encoders described in Section 5.2.2, the DPCM and ADPCM compression schemes were also implemented on the TI CC2531 MCU of the HaLOEWEn3 mote. The energy required for transmitting the uncompressed data stream is used as the baseline measurement. Each transmitted packet carries a maximum of 116 payload bytes, i.e., the maximum payload defined by IEEE [802.15.4].

As the runtime of the compression modules and hence the energy efficiency of the compression scheme is is affected by the nature of the data stream to be compressed, concrete results are presented in Chapter 7.
5.3 Linear Regression

In this section, a resource-efficient formulation of the least-squares Linear Regression (LR) algorithm, as well as its software and hardware implementation is proposed. This approach has already been published in [Engel2015a].

LR can be applied in different WSN scenarios. As shown in Section 6.2, a wireless time synchronization protocol can use a regression-based clock drift estimation to compensate the slightly different oscillator frequencies of all network nodes. In addition, other WSN applications like node Received Signal Strength Indication (RSSI)-based node localization [Vanheel2011], predictive compression [Carvalho2011], or task scheduling [Ravinagarajan2010] have utilized the LR algorithm.

For a list of \( n \) data points \((x_k, y_k)_{k=1}^n\), the slope of the regression line fitting the points according to the least-squares method is given by

\[
m = \frac{\sum_{k=1}^{n} (n \cdot x_k - sx) \cdot (n \cdot y_k - sy)}{\sum_{k=1}^{n} (n \cdot x_k - sx)^2} \quad \text{(15)}
\]

with

\[
sx := \sum_{k=1}^{n} x_k \quad \text{(16)}
\]

and

\[
sy := \sum_{k=1}^{n} y_k \quad \text{(17)}
\]

Directly implementing this formulation leads to a linear runtime complexity with narrow operators, as the dominating multiplications are performed just on differences between sums of data points.

If the LR is applied as a sliding window over a set of data points, i.e., the oldest data point is discarded as soon as a new data point is inserted, the Rolling Linear Regression (RLR) [Neely1966]

\[
m = \frac{n \cdot sx\cdot sy - sx \cdot sy}{n \cdot sxx - sx \cdot sx} \quad \text{(18)}
\]

with

\[
sxy := \sum_{k=1}^{n} x_k \cdot y_k \quad \text{(19)}
\]

and

\[
sxx := \sum_{k=1}^{n} x_k^2 \quad \text{(20)}
\]

can be executed in constant time (see Algorithm 2). However, the dominating arithmetic operator of Equation 18 is the multiplication of two sums of data points, which requires a larger operator width than the operations required for Equation 15. Furthermore, additional memory is required to store the rolling sums \( sx, sy, sxy \) and \( sxx \) for the RLR approach.

To combine the constant runtime complexity of the RLR (Equation 18) with the reduced operator and memory complexity of the simple LR (Equation 15), a novel formulation of the sliding window LR called Rolling Linear Regression with Coordinate Transformation (RLRCT) is proposed in the next section. Although being application independent in principle, the benefits of the RLRCT compared to the RLR are based on the bit width optimization of the underlying operators. Therefore, applica-
tions employing the RLRCT have to specify an upper limit for the step width between subsequent data points:

\[ L := (L_x, L_y) := \left( \max_{k \in \mathbb{N}} |x_{k+1} - x_k|, \max_{k \in \mathbb{N}} |y_{k+1} - y_k| \right) \]  \hspace{1cm} (21)

### 5.3.1 Rolling Linear Regression with Coordinate Transformation

The main idea of the RLRCT is to move the origin of the considered coordinate system to the most recent data point in the sliding window, as shown in Figure 34. Let \((x_k, y_k)_{k=0}^{P}\) be the list of data points at a certain point in time. The coordinate transformation translates these data points into

\[ (\tilde{x}_{p,k}, \tilde{y}_{p,k}) := (x_p - x_k, y_p - y_k) \quad \forall k \leq p. \]  \hspace{1cm} (22)

This translation does not effect the slope calculated by the LR, but it limits the size of the numerical data to be processed to \(n \cdot L\), where \(n\) still is the size of the sliding window. The individual arithmetic operations for the RLRCT can thus be implemented more efficiently than without applying the coordinate transformation.

The following formulas are stated only for the \(x\) values, but the same statements hold for the \(y\) values. To apply the proposed transformation to the rolling sums of the RLR with minimal effort, the RLRCT caches the translation steps

\[ dx_k := x_k - x_{k-1} \quad \forall p - n < k \leq p \]  \hspace{1cm} (23)

between the last \(n\) data points. The absolute coordinates \(\tilde{x}_{p,k}\) can be recovered for \(p - n \leq k \leq p\) by accumulating the translation steps:

\[
\sum_{i=k+1}^{p} dx_i \quad (Eq. 23) = \sum_{i=k+1}^{p} x_i - \sum_{i=k}^{p-1} x_i -1 = x_p + \sum_{i=k+1}^{p-1} x_i - \sum_{i=k}^{p-1} x_i = x_p - x_k \quad (Eq. 22) \equiv \tilde{x}_{p,k} \]  \hspace{1cm} (24)

Figure 34: Movement of sliding window and coordinate system by RLRCT update
When moving the sliding window to the next data point $x_{p+1}$, the coordinate transformation (i.e., translation by $dx_{p+1}$) must be applied to all coordinates, as

$$\bar{x}_{p+1,k} \overset{(Eq. 24)}{=} \sum_{i=k+1}^{p+1} dx_i = dx_{p+1} + \sum_{i=k+1}^{p} dx_i \overset{(Eq. 24)}{=} dx_{p+1} + \bar{x}_{p,k} \quad (25)$$

As RLR does in Equation 18, RLRCT also relies on updating the rolling sum

$$sx_p := \sum_{k=p-n+1}^{p} \bar{x}_{p,k} \quad (26)$$

$$= \bar{x}_{p,p} - \bar{x}_{p,p-n} + \sum_{k=p-n+1}^{p-1} \bar{x}_{p,k} \quad (Eq. 25)$$

$$= -\bar{x}_{p,p-n} + \sum_{k=p-n+1}^{p-1} dx_p + \bar{x}_{p-1,k} \quad (Eq. 26)$$

$$= n \cdot dx_p - \bar{x}_{p,p-n} + sx_{p-1} \quad (27)$$

Instead of explicitly calculating $\bar{x}_{p,p-n}$, another rolling sum is introduced as

$$sdx_p := \bar{x}_{p,p-n} \overset{(Eq. 24)}{=} 0 \sum_{i=p-n+1}^{p} dx_i = dx_p - dx_{p-n} + sdx_{p-1} \quad (28)$$

The sum of coordinate products

$$sxy_p := \sum_{k=p-n+1}^{p} \bar{x}_{p,k} \cdot \bar{y}_{p,k}$$

$$= \sum_{k=p-n+1}^{p} \bar{x}_{p,k} \cdot \bar{y}_{p,k} - \bar{x}_{p,p-n} \cdot \bar{y}_{p,p-n} + \sum_{k=p-n}^{p-1} \bar{x}_{p,k} \cdot \bar{y}_{p,k} \overset{(Eq. 25,28)}{=} -sdx_p \cdot sdy_p + \sum_{k=p-n}^{p-1} (dx_p + \bar{x}_{p-1,k})(dy_p + \bar{y}_{p-1,k})$$

$$\overset{(Eq. 26)}{=} sxy_{p-1} + dx_p \cdot sy_{p-1} + dy_p \cdot sx_{p-1} + n \cdot dx_p \cdot dy_p - sdx_p \cdot sdy_p \quad (29)$$

is not actually handled as rolling sum in RLRCT. Instead, the entire numerator of Equation 18 is accumulated as
num\textsubscript{p} := n \cdot sxy\textsubscript{p} - sx\textsubscript{p} \cdot sy\textsubscript{p} \tag{Eq. 27}
\Rightarrow n \cdot sxy\textsubscript{p} - (n \cdot dx\textsubscript{p} - sx\textsubscript{p} + sx\textsubscript{p-1}) \cdot (n \cdot dy\textsubscript{p} - sdy\textsubscript{p} + sy\textsubscript{p-1}) 
\tag{Eq. 29}
\Rightarrow n \cdot sxy\textsubscript{p-1} - sx\textsubscript{p-1} \cdot sy\textsubscript{p-1} - (n + 1) \cdot sx\textsubscript{p} \cdot sdy\textsubscript{p} + sdx\textsubscript{p} \cdot (n \cdot dy\textsubscript{p} + sy\textsubscript{p-1}) + sdy\textsubscript{p} \cdot (n \cdot dx\textsubscript{p} + sx\textsubscript{p-1}) 
\Rightarrow num\textsubscript{p-1} - (n + 1) \cdot sx\textsubscript{p} \cdot sdy\textsubscript{p} + sdx\textsubscript{p} \cdot sdy\textsubscript{p} + sy\textsubscript{p} + sdy\textsubscript{p} \cdot sx\textsubscript{p} \tag{30}

The denominator of Equation 18 is accumulated accordingly:

den\textsubscript{p} = den\textsubscript{p-1} - (n - 1) \cdot sdx\textsuperscript{2}\textsubscript{p} + 2 \cdot sdx\textsubscript{p} \cdot sx\textsubscript{p} \tag{31}

Algorithm 1 summarizes the RLRCT update procedure. In Line 1 and 2, the translation vector between the old and the new origin of the coordinate system is calculated, before the origin is actually translated in Line 3 and 4. In Line 5, the sliding window is moved by enqueuing the new data point and dequeuing the oldest entry. As no random access to the intermediate data points is required by the RLRCT, the sliding window is implemented as a First In, First Out (FIFO) structure $F$. The remaining Lines 6 to 12 realize the equations derived above to properly apply the translation vector to all rolling sums.

**Algorithm 1**: Update procedure for Rolling Linear Regression with Coordinate Transformation

**Input**: New data point $(x, y)$

**Input**: Previously received data point $(\hat{x}, \hat{y})$

**Input**: Accumulators $sdx$, $sdy$, $sx$, $sy$, $num$, $den$

**Input**: Sliding Window $F$

**Output**: Slope represented as fraction of $num$ and $den$

1. $dx := x - \hat{x}$;
2. $dy := y - \hat{y}$;
3. $\hat{x} := x$;
4. $\hat{y} := y$;
5. $(\hat{dx}, \hat{dy}) \leftarrow F \leftarrow (dx, dy)$; // move window
6. $sdx := dx - \hat{dx}$; // Equation 28
7. $sdy := dy - \hat{dy}$;
8. $sx := n \cdot dx - sdx$; // Equation 27
9. $sy := n \cdot dy - sdy$;
10. $t := (n - 1) \cdot sdx$;
11. $num := sdx \cdot sy + sdy \cdot sx - t \cdot sdy$; // Equation 30
12. $den := sdx \cdot sx \cdot 2 - t \cdot sdx$; // Equation 31
Algorithm 2: Update procedure for Rolling Linear Regression (without Coordinate Transformation)

Input: New data point point \((x, y)\)

Input: Accumulators \(sx, sy, sxy, sxx\)

Input: Sliding Window \(FT\)

Output: Slope represented as fraction of \(num\) and \(den\)

1. \(xy := x \cdot y;\)
2. \(xx := x \cdot x;\)
3. \((\hat{x}, \hat{y}, \hat{xy}, \hat{xx}) \leftarrow F \leftarrow (x, y, xy, xx);\)
   \(//\) move window
4. \(sx := x - \hat{x};\)
5. \(sy := y - \hat{y};\)
6. \(sxy := xy - \hat{xy};\)
7. \(sxx := xx - \hat{xx};\)
8. \(num := n \cdot sxy - sx \cdot sy;\)
   \(//\) Equation 18
9. \(den := n \cdot sxx - sx \cdot sx;\)

Algorithm 1 consists of 15 additions and nine multiplications (four of them with a small constant). As shown in Algorithm 2, it is assumed that the RLR buffers the additional values

\[
(xy_k, xx_k)^p_{k=p-n+1} := (x_k \cdot y_k, x_k \cdot x_k)^p_{k=p-n+1} \quad (32)
\]

for the sliding window to avoid the recomputation of the products required in Line 6 and 7. The RLR requires fewer operations (10 additions, six multiplications) than the RLRCT, but most of the RLRCT arithmetic operates on differences between subsequent data points, instead of absolute data points. If the step width limitation (Equation 21) is significantly smaller than the absolute data points, the bit width of the RLRCT operations can be reduced. As will be shown in Section 6.2.3, this reduction in operator size can amortize the cost of the larger number of operations, thus leading to a more efficient linear regression implementation. Furthermore, the RLRCT requires less memory than the RLR to buffer the sliding window, as only two translation vectors have to be stored per entry instead of two full data points and two data point products.

After execution of the update procedure, the slope of the linear regression line is represented by \(num\) and \(den\). The actual division is not executed by the update procedure, to avoid losing accuracy at that point. Instead of representing \(m\) as a fixed-point variable that can be multiplied with an value \(v\) later on, the slope division is scheduled after this multiplication, i.e., \(v \cdot m = (v \cdot num)/den\).

5.3.2 Software Implementation

To analyze the benefits of the proposed RLRCT, Algorithm 1 and 2 as well as Equation 15, and an optimized version of Equation 15 for \(n = 2\) was implemented on the CC2531 MCU of the HaLOEWEn3.
Listing 1: 8051 assembler for a sample 24 bit $\times$ 16 bit multiplication $x \cdot y = z$ in the directly addressable memory

<table>
<thead>
<tr>
<th>Line</th>
<th>Instruction 1</th>
<th>Instruction 2</th>
<th>Instruction 3</th>
<th>Instruction 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>MOV A, _x+0</td>
<td>MOV B, _y+0</td>
<td>MUL AB</td>
<td>MOV _z+0.A</td>
</tr>
<tr>
<td></td>
<td>MOV R1.B</td>
<td>MOV R1=A</td>
<td></td>
<td>_z[0]=x[0]*y[0]</td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
<td></td>
<td>MOV R1=B</td>
</tr>
<tr>
<td>3</td>
<td>MOV A, _x+1</td>
<td>MOV B, _y+0</td>
<td>MUL AB</td>
<td>MOV _z+1.A</td>
</tr>
<tr>
<td></td>
<td>CLR A</td>
<td>ADDC A, R1</td>
<td></td>
<td>_z[1]=x[1]*y[0]+carry</td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
<td></td>
<td>MOV R1,A</td>
</tr>
<tr>
<td>5</td>
<td>MOV A, _x+2</td>
<td>MOV B, _y+0</td>
<td>MUL AB</td>
<td>MOV _z+2.A</td>
</tr>
<tr>
<td></td>
<td>CLR A</td>
<td>ADDC A, B</td>
<td></td>
<td>_z[2]=x[2]*y[0]+carry</td>
</tr>
<tr>
<td>6</td>
<td></td>
<td></td>
<td></td>
<td>MOV R1,A</td>
</tr>
<tr>
<td>7</td>
<td>MOV A, _x+0</td>
<td>MOV B, _y+1</td>
<td>MUL AB</td>
<td>MOV _z+3.A</td>
</tr>
<tr>
<td></td>
<td>CLR A</td>
<td>ADDC A, B</td>
<td></td>
<td>_z[3]=carry</td>
</tr>
<tr>
<td>8</td>
<td></td>
<td></td>
<td></td>
<td>MOV R1,A</td>
</tr>
<tr>
<td>9</td>
<td></td>
<td></td>
<td></td>
<td>MOV _z+4.A</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>_z[4]=0</td>
</tr>
<tr>
<td>10</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>11</td>
<td>MOV A, _x+0</td>
<td>MOV B, _y+1</td>
<td>MUL AB</td>
<td>MOV _z+1.A</td>
</tr>
<tr>
<td></td>
<td>MOV A, B</td>
<td>ADDC A, _z+1</td>
<td></td>
<td>_z[1]=x[0]*y[1]</td>
</tr>
<tr>
<td>12</td>
<td>CLR A</td>
<td>ADDC A, _z+2</td>
<td></td>
<td>_z[2]=carry</td>
</tr>
<tr>
<td>13</td>
<td>CLR A</td>
<td>ADDC A, _z+3</td>
<td></td>
<td>_z[3]=carry</td>
</tr>
<tr>
<td>14</td>
<td>CLR A</td>
<td>ADDC A, _z+4</td>
<td></td>
<td>_z[4]=carry</td>
</tr>
<tr>
<td>15</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>16</td>
<td>MOV A, _x+1</td>
<td>MOV B, _y+1</td>
<td>MUL AB</td>
<td>MOV _z+2.A</td>
</tr>
<tr>
<td></td>
<td>CLR A</td>
<td>ADDC A, _z+3</td>
<td></td>
<td>_z[2]=x[1]*y[1]</td>
</tr>
<tr>
<td>17</td>
<td>CLR A</td>
<td>ADDC A, _z+4</td>
<td></td>
<td>_z[3]=carry</td>
</tr>
<tr>
<td>18</td>
<td>CLR A</td>
<td>ADDC A, _z+4</td>
<td></td>
<td>_z[4]=carry</td>
</tr>
<tr>
<td>19</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>20</td>
<td>MOV A, _x+2</td>
<td>MOV B, _y+1</td>
<td>MUL AB</td>
<td>MOV _z+3.A</td>
</tr>
<tr>
<td>21</td>
<td>ADD C A, _z+4</td>
<td></td>
<td></td>
<td>_z[4]=carry</td>
</tr>
</tbody>
</table>

The 8051-ISA compliant compilers (e.g., [SDCC], [IAR] or [Keil]) support up to 64 bit integer arithmetic operations with an operator granularity of 8 bit, 16 bit, 32 bit and 64 bit. As the main benefits of the RLRCT algorithm arise from optimized narrowed operator sizes, these four levels of operator sizes are not sufficiently fine-grained. Thus, the required operators (i.e., shift, inversion, accumulation, subtraction, multiplication, division, and movement between memory regions) where implemented in 8051 assembler with up to 96 bit wide representations at a granularity of 8 bit. Operands and results may be constants or arrays allocated in the directly (data) or indirectly (xdata) addressable memory region of the 8051 MCU. All intermediate variables and accumulators are located in the directly addressable memory and the assembler routines operate on that memory without accessing the stack. The implementation is optimized for execution speed at the expense of a larger code memory footprint. An example for a 24 bit $\times$ 16 bit multiplication is shown in Listing 1.

5.3.3 Hardware Implementation

As stated above, the benefits of the RLRCT arise from fine-grained operator size optimization. Furthermore, the data-dependencies between the subsequent RLRCT-operations allow for parallel executions of Lines 1 and 2, 3 and 4, 6 and 7, 8 and 9, as well as 11 and 12 of Algorithm 1. The RLRCT is thus well-suited for hardware acceleration.

Three different architectures were implemented to trade-off execution speed against resource requirements. All architectures buffer the data points (i.e., the sliding window $F$) in BRAM and perform the slope division as sequential shift-subtract steps. All additions and multiplications are executed as single-cycle combinatorial logic. In the Fully Parallel architecture, every arithmetic operation of Algorithm 1 is implemented with dedicated logic. The data dependencies result in a seven cycle data path for the LR. In the Single MAC architecture, a single multiply-accumulate operator is
implemented and the registers buffering the accumulators and intermediate results are sequentially multiplexed to this operator, resulting in a 21 cycle computation for the RLRCT.

Finally, the micro-coded μArchitecture shown in Figure 35 buffers all accumulators next to the regression table in a dual-port BRAM. To efficiently utilize the BRAM, the larger accumulators (i.e., \textit{num} and \textit{den}) occupy two memory locations (i.e., high and low word). Three double word registers (T0, T1, T2) can be fed with data from the BRAM, inputs of the HW kernels (i.e., \(x\), \(y\)), results from previous ALU operations, specific constants, or some derived values required for the slope division. Two registers are selected as operands (A, B) for the ALU, which only supports the required basic operations. Besides computing the sum, difference, product, and absolute difference of A and B, a fraction \(\frac{A}{B}\) can be reduced from double to single word precision by right-shifting the numerator and denominator by an appropriate (data-dependent) width.

As shown in Table 9, the 26 bit wide μInstructions used to control the μArchitecture are grouped into fields configuring the BRAM addressing (i.e., P0A, P1A), multiplexers (i.e., P0D, P1D, T0D, T1H, T1L, T2D, A, B) as well as the ALU operation. Various efforts have been made to keep the μInstructions compact. For example, not all BRAM addresses are available at both ports, not all registers can be mapped to both ALU operands, and the port / register write enable information is integrated into the related multiplexer configurations. To simplify the control logic of the μArchitecture, the latencies resulting from BRAM and register access have to be considered when coding the μInstructions. Furthermore, predication logic (i.e., T0D = T2D = 101) is used to implement sequential (non-performing restoring) divisions without control flow branches. Each write access to the sliding window \(F\) inside the BRAM increments an internal circular counter thus combining the push and pop operation for \(dx\) and \(dy\) as required by Algorithm 1. Each μInstruction is executed in one RCU clock cycle.

Table 10 lists the 28 μInstructions required to implement Algorithm 1. After an initiation interval of two instructions, the ALU is utilized in every instruction. Ten of
Table 9: 26 bit µInstruction controlling the RLRCT µArchitecture

<table>
<thead>
<tr>
<th>Bits</th>
<th>Field</th>
<th>Description</th>
<th>Interpretation</th>
</tr>
</thead>
</table>
| 25 - 22 | P0A   | RAM P0 address | 0000 : $\hat{x}$ \[1000 : rnum = num/2^K \]
|       |       |              | 0001 : $\hat{y}$ \[1001 : rden = den/2^K \]
|       |       |              | 0010 : sdx \[1010 : diff = |rnum - rden| \]
|       |       |              | 0011 : sdy \[1011 : temp \]
|       |       |              | 0100 : sx \[1100 : FIFO F for dx and dy \]
|       |       |              | 0101 : sy \[1101 : unused \]
|       |       |              | 0110 : num low \[1110 : unused \]
|       |       |              | 0111 : den low \[1111 : unused \]
| 21 - 20 | P0D   | RAM P0 source | 00 : no write to P0 \[10 : x \]
|       |       |              | 01 : RL \[11 : y \]
| 19 - 17 | P1A   | RAM P1 address | 000 : num high \[100 : neg = sgn(rnum - rden) \]
|       |       |              | 001 : den high \[101 : unused \]
|       |       |              | 010 : rden \[110 : unused \]
|       |       |              | 011 : temporary \[111 : unused \]
| 16 - 15 | P1D   | RAM P1 source | 00 : no write to P1 \[10 : RL \]
|       |       |              | 01 : RH \[11 : unused \]
| 14 - 12 | T0D   | REG T0 source | 000 : no write to T0 \[100 : R \]
|       |       |              | 001 : x \[101 : R if R ≥ 0 \]
|       |       |              | 010 : y \[110 : RL \]
|       |       |              | 011 : P0 \[111 : unused \]
| 11 | T1H   | REG T1 high source | 0 : RH \[1 : P1 \]
| 10 - 8 | T1L   | REG T1 low source | 000 : no write to T1 \[100 : n - 1 \]
|       |       |              | 001 : RL \[101 : SHR(T1) \]
|       |       |              | 010 : RH \[110 : unused \]
|       |       |              | 011 : P0 \[111 : unused \]
| 7 - 5 | T2D   | REG T2 source | 000 : no write to T2 \[100 : 0 \]
|       |       |              | 001 : R \[101 : T2 \cdot 2^i if R ≥ 0 \]
|       |       |              | 010 : P0 \[110 : 2^i \]
|       |       |              | 011 : n \[111 : unused \]
| 4 | A     | ALU operand A | 0 : T0 \[1 : T1 \]
| 3 | B     | ALU operand B | 0 : T1 \[1 : T2 \]
| 2 - 0 | OP    | ALU opcode   | 000 : $A + B$ \[100 : (A < B, |A - B|) \]
|       |       |              | 001 : $A - B$ \[101 : T1 < 0 ? A + B : A - B \]
|       |       |              | 010 : $A \cdot B$ \[110 : T1 < 0 ? A - B : A + B \]
|       |       |              | 011 : $(A/2^K, B/2^K)$ \[111 : unused \]

the 28 instructions actually utilize both BRAM ports. The numerator \emph{num} and denominator \emph{den} representing the slope of the regression line are produced by the ALU in Instruction 21 and 26. Two additional instructions are used to prepare the multiplication of the slope with a certain value \emph{v} as required for the time synchronization application described in Section 6.2. Instruction 27 reduces the size of the numerator (to \emph{rnum}) and denominator (to \emph{rden}) by a common factor $2^K$ chosen large enough, such that \emph{rnum} and \emph{rden} fit into single words inside the BRAM. Instruction 28 derives
Table 10: µInstruction sequence implementing Algorithm 1 on the µArchitecture

<table>
<thead>
<tr>
<th>#</th>
<th>P0A</th>
<th>P0D</th>
<th>P1A</th>
<th>P1D</th>
<th>T0D</th>
<th>T1H</th>
<th>T1L</th>
<th>T2D</th>
<th>A</th>
<th>B</th>
<th>OP</th>
<th>R</th>
<th>T0</th>
<th>T1</th>
<th>T2</th>
<th>P0</th>
<th>P1</th>
<th>RAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0000</td>
<td>10</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>P0</td>
<td>P0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>2</td>
<td>0010</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>P0</td>
<td>sdx</td>
<td>x</td>
<td>P0</td>
<td>x</td>
</tr>
<tr>
<td>3</td>
<td>1100</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>P0</td>
<td>T0</td>
<td>T1</td>
<td>P0</td>
<td>R</td>
<td>F</td>
</tr>
<tr>
<td>4</td>
<td>0100</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>10</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T1</td>
<td>R</td>
<td>P0</td>
<td>P0</td>
<td>sxx</td>
</tr>
<tr>
<td>5</td>
<td>0010</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T2</td>
<td>P0</td>
<td>R</td>
<td>sdx</td>
<td>RL</td>
</tr>
<tr>
<td>6</td>
<td>0001</td>
<td>11</td>
<td>00</td>
<td>00</td>
<td>10</td>
<td>00</td>
<td>00</td>
<td>11</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T2</td>
<td>R</td>
<td>n</td>
<td>y</td>
<td>y = y</td>
</tr>
<tr>
<td>7</td>
<td>0011</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>01</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T1</td>
<td>T2</td>
<td>P0</td>
<td>R</td>
<td>sdy</td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>0100</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T2</td>
<td>y</td>
<td>P0</td>
<td>sx</td>
<td>RL</td>
</tr>
<tr>
<td>9</td>
<td>1100</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T1</td>
<td>R</td>
<td>F</td>
<td>P0</td>
<td>syy</td>
</tr>
<tr>
<td>10</td>
<td>0101</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T1</td>
<td>T2</td>
<td>R</td>
<td>P0</td>
<td>P0</td>
<td>sdy</td>
</tr>
<tr>
<td>11</td>
<td>0011</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T2</td>
<td>P0</td>
<td>R</td>
<td>sdy</td>
<td>RL</td>
</tr>
<tr>
<td>12</td>
<td>0110</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>10</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T2</td>
<td>R</td>
<td>n</td>
<td>num</td>
<td>num</td>
</tr>
<tr>
<td>13</td>
<td>0010</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>01</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T1</td>
<td>T2</td>
<td>(P1, P0)</td>
<td>R</td>
<td>sdx</td>
<td></td>
</tr>
<tr>
<td>14</td>
<td>0101</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T2</td>
<td>R</td>
<td>P0</td>
<td>P0</td>
<td>sy</td>
</tr>
<tr>
<td>15</td>
<td>0000</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>01</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T2</td>
<td>R</td>
<td>n</td>
<td>1</td>
<td>sdy</td>
</tr>
<tr>
<td>16</td>
<td>1011</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>01</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T1</td>
<td>T2</td>
<td>R</td>
<td>P0</td>
<td>R</td>
<td>temp = RL</td>
</tr>
<tr>
<td>17</td>
<td>0100</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T1</td>
<td>T2</td>
<td>R</td>
<td>sx</td>
<td></td>
<td></td>
</tr>
<tr>
<td>18</td>
<td>0000</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T1</td>
<td>R</td>
<td>P0</td>
<td></td>
<td></td>
</tr>
<tr>
<td>19</td>
<td>1011</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T1</td>
<td>R</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>20</td>
<td>0110</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T2</td>
<td>R</td>
<td>sdy</td>
<td></td>
<td></td>
</tr>
<tr>
<td>21</td>
<td>0110</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T2</td>
<td>P0</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>22</td>
<td>0011</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>01</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T1</td>
<td>T2</td>
<td>R</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>23</td>
<td>1011</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T1</td>
<td>T1</td>
<td>R</td>
<td>P0</td>
<td></td>
<td></td>
</tr>
<tr>
<td>24</td>
<td>0000</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T1</td>
<td>R</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>25</td>
<td>1011</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T1</td>
<td>R</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>26</td>
<td>0111</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T1</td>
<td>R</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>27</td>
<td>1001</td>
<td>01</td>
<td>00</td>
<td>01</td>
<td>00</td>
<td>10</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T1</td>
<td>R</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>28</td>
<td>1010</td>
<td>01</td>
<td>00</td>
<td>01</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>x</td>
<td>T0</td>
<td>T1</td>
<td>R</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

For \( v \geq 0 \), the final sequential division can be executed without explicit sign analysis, as both \( v \cdot \text{diff} \) and \( \text{rden} \) are not negative. Furthermore, for slopes near one, i.e., \( \text{rnum} \approx \text{rden} \) as expected in Section 6.2, \( v \cdot \text{diff} \) is much smaller than \( v \cdot \text{rnum} \), thus further simplifying the sequential division.

For the evaluation of the resource and energy efficiency of the software and hardware implementations of the RLRCT, a concrete implementation defining the actual operator sizes is required. These results can be found in Section 6.2.3.
CHAPTER 6
Network Communication Primitives

In this chapter, primitives related to the wireless network communication are presented. Although motivated by the target application described in Section 7.3, the routing and synchronization mechanisms are reusable for similar applications. Parts of this chapter have already been published in [Engel2014b; Engel2015a].

6.1 Transceiver Scheduling for Multi-Source Flooding

Routing protocols have a significant impact on distributed applications relying on multi-hop wireless communication. The number of nodes involved to deliver an information (a data item) from a source to a sink determines the networks overall energy consumption as well as the data throughput. Often, shortest routes can not be easily discovered, or are undesirable (e.g., when balancing communication load equally over all network nodes or to provide robust communication over redundant paths). In general, application-tailored routing protocols can exploit special characteristics to improve throughput and energy consumption.

Flooding information into a network by multi-hop rebroadcasting is simple and robust, but also very inefficient [Ni1999]. As stated in Section 3.2.2, many application-independent improvements of the basic blind flooding have been proposed during the last decade to reduce the number of required rebroadcasts. Even better results can be achieved by exploiting some (limited) knowledge about the communication needs of the concrete applications. Even so, the approach presented here is applicable for a wide spectrum of use cases.

The following assumptions are derived from the SHM application described in Section 7.3, but they hold true for many monitoring scenarios. First of all, a monitored object is either excited or not at any point in time. If relevant information is detected at a certain position on the object, it is quite likely that nearby sensors are also detecting events. Thus, instead of assuming communication originating from a single source node, parallel multi-source information flooding should be supported. Detected events can often be characterized just by a few bits specifying the time and place where a specific activity threshold was exceeded. It can thus be assumed that information from multiple source nodes can be merged into a single message. Furthermore, if the sensor nodes are attached to a structure, the network topology can be assumed to be static (apart from node failures). And finally, as most distributed monitoring application already rely on a highly accurate time synchronization mechanism for correct data gathering (see Section 6.2), an accurate scheduling of data reception and transmission can also be achieved without additional effort.

As the network topology is assumed to be static and the nodes are time-synchronized to each other, a schedule can be derived that selects appropriate nodes as transmitters and receivers in subsequent time slots (cycles). This static schedule must be known to all nodes and is executed periodically, i.e., the source nodes buffer collected information until the restart of the schedule. The main advantage of this scheme is twofold. First, all nodes not scheduled for reception in a certain cycle can keep their radios turned off
as they do not have to listen for potential incoming data. Second, knowledge of the global topology can be used to deliberately aggregate information from specific source nodes.

Due to the spatial and temporal correlation of the sensor nodes, at each restart of the schedule it is most likely that either all source nodes have buffered some information, or that no source node has buffered any information. In the first case, the schedule can be executed as intended. In the second case, the source nodes actually scheduled for transmission will not actually start sending, which must be recognized at the receivers scheduled for listening by imposing a short timeout. These receivers, in turn, cancel their scheduled forwarding transmission, thus propagating the transmission avoidance until the end of the schedule. The energy overhead for an unused schedule is thus restricted to a small number of clear channel assessments at the scheduled receivers. The same mechanism of transmission avoidance must be applied if only a subset of source nodes provides new information at the start of a new schedule. In this case, the resulting flooding routes may not be optimal for the specific subset of source nodes. However, the probability of those cases is small due to the spatial and temporal correlations between the data sources, as stated above. The overall energy efficiency of the flooding scheme will thus not be significantly affected.

The reminder of this section focuses on calculating an optimal s-to-all flooding schedule for a given network topology.

### 6.1.1 Network Model and Cost Function

An application independent network model $M$ is defined as

$$M = (N, N_0, E_I, E_C, h)$$

(33)

$$N \subseteq \mathbb{N}$$

(34)

$$N_0 \subseteq N$$

(35)

$$E_I \subseteq N^2$$

(36)

$$E_C \subseteq E_I$$

(37)

$$h : N^2 \rightarrow \mathbb{N}.$$  

(38)

This model combines two directed graphs $M_I = (N, E_I)$ and $M_C = (N, E_C)$ with a common set of network nodes $N$. $M_C$ represents the networks single-hop connectivity, i.e., node $b \in N$ may receive information directly from node $a \in N$ iff $(a, b) \in E_C$. $M_I$ represents the interference relationship within the wireless network, i.e., the transmission of node $a$ prevents the successful reception of any other information at node $b$ iff $(a, b) \in E_I$. $M_C$ is thus a subgraph of $M_I$. The information-generating source nodes are denoted as $N_0$. The function $h$ determines the minimum number of hops between two nodes. These hop counts $h$ are precomputed by the all-pairs-shortest-path algorithm [Floyd1962]. For clarity, $a$ will be used for nodes acting as transmitters, $b$ for nodes acting as receivers, $s$ for source nodes and $v$ for arbitrary network nodes throughout this section.

The network model $M$ can be generated from the known locations and the transmission and reception characteristics of all network nodes, as shown in Section 6.1.4. A more accurate model accounting for obstacles can be obtained from well known online
topology discovery algorithms (e.g., recursive neighborhood exchange [Dheap2003]), executed once after network deployment. However, detailed topology detection is outside the scope of this thesis and $M$ is thus assumed to be known and static.

A schedule $S$ and an information assignment function $I$ are defined as

$$S : N \times \mathbb{N} \rightarrow \{RX, TX, IDLE\}$$

$$I : N \times \mathbb{N} \rightarrow 2^{N_0}$$

Thus, a schedule $S$ assigns a radio transceiver status $S(v, t)$ to each network node $v \in N$ in every scheduling cycle $t \in \mathbb{N}$. The information assignment function $I(v, t)$ denotes the subset of source nodes that have already forwarded their information to node $v$ before cycle $t$ such that the information is known to $v$ in cycle $t$. Initially, only the source nodes know about their information:

$$I(v, 0) = \begin{cases} \{v\}, & \text{if } v \in N_0 \\ \emptyset, & \text{otherwise} \end{cases}$$

The information is then propagated within each cycle $t$ from any node $a$ scheduled for transmission to any node $b$ within the transmission range of $a$ if $b$ is scheduled for reception and not disturbed by another transmitter $c$:

$$I(b, t + 1) = I(b, t) \cup \begin{cases} I(a, t), & \text{if } (a, b) \in E_C \land S(a, t) = TX \land S(b, t) = RX \land \forall(c, b) \in E_I \setminus \{(a, b)\} : S(c, t) \neq TX \\ \emptyset, & \text{otherwise} \end{cases}$$

The length $L(S)$ of a schedule $S$ is the minimum number of cycles required to flood all source node information to all nodes in the network:

$$L(S) = \min_{t \in \mathbb{N}} \forall v \in N : I(v, t) = N_0$$

The sum of scheduled receptions and transmissions are defined as the primary optimization goal to minimize the networks overall energy consumption $C(S)$ for a schedule $S$:

$$C(S) = \sum_{t=0}^{L(S)-1} |\{v \in N : S(v, t) \neq IDLE\}|$$

Radio transceivers draw about the same power when receiving and transmitting data. Thus, receiving and transmitting the same amount of data consumes about the same amount of energy. The cost function therefore does not distinguish between transmission and reception costs. However, weighting factors could be integrated as needed.

For two schedules of equal cost, the shorter schedule yields smaller latency and increased data throughput. Therefore, the schedule length is the secondary optimization
goal. Definition 1 summarizes the optimization problem to be solved in Section 6.1.2 and 6.1.3.

**Definition 1 (Optimal Flooding Schedule)**

For any network model $M$ find a schedule $S$ of finite length, such that

$$C(S) < C(\hat{S}) \lor \left( C(S) = C(\hat{S}) \land L(S) \leq L(\hat{S}) \right)$$

for any valid schedule $\hat{S}$.

**6.1.2 Integer Linear Program for Optimal Scheduling**

Finding an optimal solution for the problem stated in Definition 1 requires solving multiple set-cover problems, e.g., finding the smallest number of forwarding nodes to cover the remaining, not yet reached nodes. The optimal scheduling problem is thus NP-complete. To find an optimal solution for non-trivial network sizes in a reasonable time, parallel branch-and-bound algorithms have to be utilized. The network model (Equation 33) and the information propagation rules (Equations 41 to 44) are thus translated into an Integer Linear Program (ILP). A commercial ILP-solver is then utilized to compute the optimal schedule (see Section 6.1.4).

The ILP-solver determines integer values for a set of variables. The solution space is restricted by a set of constraints, which are relations between linear combinations of the variables and constant values. Within this solution space, a single objective is minimized. This objective is also a linear combination of the variables.

When working with binary variables, basic operations like disjunctions and conjunctions have to be replaced by the following constraints:

$$y = \bigvee_{i=1}^{n} x_i \iff y \geq x_1 \land \ldots \land y \geq x_n \land y \leq \sum_{i=1}^{n} x_i$$

$$y = \bigwedge_{i=1}^{n} x_i \iff y \geq \sum_{i=1}^{n} x_i - (n - 1) \land n \cdot y \leq \sum_{i=1}^{n} x_i$$

To formulate the ILP for the optimal scheduling, the following binary variables are used:

$$\alpha_{a,t} \iff S(a,t) = TX$$

$$\beta_{b,t} \iff S(b,t) = RX$$

$$\gamma_t \iff \exists v \in N : S(v,t) \neq IDLE$$

$$\delta_{s,v,t} \iff s \in I(v,t)$$

$$\epsilon_{s,a,b,t} \iff \delta_{s,a,t} \land \alpha_{a,t} \land \beta_{b,t}$$

To define the necessary constraints for each schedule cycle $t$, an upper bound $L_{\text{max}}$ for the length of the resulting schedule is required. This upper bound can be derived from the network model, as a naive sequential blind flooding of all information would not require more then $|N| \cdot |N_0|$ cycles. Alternatively, the outcome of the heuristic
scheduling in Section 6.1.3 can be used to define tighter bounds for the schedule length. This is important, as the ILP formulation requires a total number of \( L_{\text{max}} \cdot (2|N| + 1 + |N_0|(|N| + |E_C|)) \) variables.

Let \( T = \{0, \ldots, L_{\text{max}} - 1\} \) be the set of potential schedule cycles. The objective to be minimized is described as

\[
\text{Minimize} \quad L_{\text{max}} \cdot \sum_{t \in T} \sum_{v \in N} (\alpha_{v,t} + \beta_{v,t}) + \sum_{t \in T} \gamma_t
\]

(53)

which satisfies Equation 45. The right sum calculates the number of actually required cycles \( L(S) \). The complete distribution of information tested in Equation 43 is ensured by constraints (Equation 59). The left sum calculates \( C(S) \) according to Equation 44. It is weighted by \( L_{\text{max}} \) to make \( C(S) \) the primary objective.

The following constraints are required to distinguish between active and idle cycles:

\[
\forall t \in T, v \in N : \gamma_t \geq \alpha_{v,t}
\]

(54)

\[
\forall t \in T, v \in N : \gamma_t \geq \beta_{v,t}
\]

(55)

\[
\forall t \in T : \gamma_t \leq \sum_{v \in N} (\alpha_{v,t} + \beta_{v,t})
\]

(56)

\[
\forall t > 0 : \gamma_t \leq \gamma_{t-1}
\]

(57)

According to Equation 46, \( \gamma_t = \bigvee_{v \in N} \alpha_{v,t} \lor \beta_{v,t} \) is realized by Equations 54 to 56. Thus, a cycle is active, if any node is scheduled for transmission or reception in this cycle. Equation 57 moves all idle cycles to the end of \( T \).

The following constraints ensure correct information propagation:

\[
\forall s \in N_0, t \in T : \delta_{s,v,t} = 1
\]

(58)

\[
\forall s \in N_0, v \in N \setminus \{s\} : \delta_{s,v,L_{\text{max}}-1} = 1
\]

(59)

\[
\forall s \in N_0, v \in N \setminus \{s\}, t < h(s, v) : \delta_{s,v,t} = 0
\]

(60)

\[
\forall s \in N_0, b \in N \setminus \{s\}, t \geq h(s, b), (a, b) \in E_C :
\]

\[
\epsilon_{s,a,b,t-1} \geq \delta_{s,a,t-1} + \alpha_{a,t-1} + \beta_{b,t-1} - 2
\]

(61)

\[
3 \cdot \epsilon_{s,a,b,t-1} \leq \delta_{s,a,t-1} + \alpha_{a,t-1} + \beta_{b,t-1}
\]

(62)

\[
\delta_{s,b,t} \geq \epsilon_{s,a,b,t-1}
\]

(63)

\[
\forall s \in N_0, b \in N \setminus \{s\}, t \geq h(s, b) :
\]

\[
\delta_{s,b,t} \geq \delta_{s,b,t-1}
\]

(64)

\[
\delta_{s,b,t} \leq \delta_{s,b,t-1} + \sum_{(a,b) \in E_C} \epsilon_{s,a,b,t-1}
\]

(65)

All source nodes know about their own information at any time (Equation 58). In the last cycle, all nodes must know about all information (Equation 59). The information \( s \) can not reach node \( v \) before cycle \( h(s, v) \) (Equation 60). According to Equation 47, the Equations 61 and 62 realize \( \epsilon_{s,a,b,t-1} = \delta_{s,a,t-1} \land \alpha_{a,t-1} \land \beta_{b,t-1} \). Therefore, an information \( s \) is transmitted from node \( a \) to node \( b \) iff \( a \) knows about \( s \) and is scheduled
interference

source node 0-2-1 source node

\[
\begin{align*}
\alpha_{0,0} & \quad \alpha_{1,1} & \quad \alpha_{2,2} \\
\beta_{2,0} & \quad \beta_{2,1} & \quad \beta_{0,2}, \beta_{1,2} \\
\gamma_{0} & \quad \gamma_{1} & \quad \gamma_{2} \\
\delta_{0,0,0} & \quad \delta_{0,0,1}, \delta_{0,2,1} & \quad \delta_{0,0,2}, \delta_{0,2,2}, \delta_{0,0,3}, \delta_{0,2,3}, \delta_{0,1,3} \\
\delta_{1,1,0} & \quad \delta_{1,1,2}, \delta_{1,2,2} & \quad \delta_{1,1,3}, \delta_{1,2,3}, \delta_{1,0,3} \\
\epsilon_{0,0,2,0} & \quad \epsilon_{1,1,2,1} & \quad \epsilon_{0,2,1,2}, \epsilon_{1,2,0,2}
\end{align*}
\]

Figure 36: Example network with \( N_0 = \{0, 1\}, \ N = N_0 \cup \{2\}, \ E_C = \{(0, 2), (2, 0), (1, 2), (2, 1)\}, \ E_I = E_C \cup \{(0, 1), (1, 0)\} \) and its ILP solution (variables set to 1)

for transmission and \( b \) is scheduled for reception (Equation 52). According to Equation 46, the Equations 63 to 65 realize \( \delta_{s,b,t} = \delta_{s,b,t-1} \lor \bigvee_{(a,b) \in E_C} \epsilon_{s,a,b,t-1} \). Thus, node \( b \) knows about information \( s \) in cycle \( t \) iff it knew about \( s \) in cycle \( t-1 \), or received the information from any node \( a \) in cycle \( t-1 \).

Finally, the following constraints prevent interference and invalid transceiver usage:

\[
\begin{align*}
\forall t \in T, v \in N : \alpha_{v,t} + \beta_{v,t} & \leq 1 \\
\forall t \in T, b \in N : |N| \cdot \beta_{b,t} + \sum_{(a,b) \in E_I} \alpha_{a,t} & \leq |N| + 1
\end{align*}
\]  

Equation 66 ensures that no node is scheduled for transmission and reception in the same cycle. Due to Equation 67, a receiver \( b \) must not hear signals from multiple transmitters. If \( b \) is not scheduled for reception, Equation 67 does not constrain the solution space as the number of potential transmitters can not be larger than \(|N| - 1\).

After passing the Equations 53 to 67 to the ILP solver, the resulting schedule is constructed from the \( \alpha \) and \( \beta \) variables.

Figure 36 provides a small example network. For \( L_{\text{max}} = 4 \), the generated ILP formulation comprises 68 variables and 141 constraints. In the ILP solution, 29 variables are set to 1 representing an optimal schedule \( (0 \rightarrow \{2\}; 1 \rightarrow \{2\}; 2 \rightarrow \{0, 1\}) \).

### 6.1.3 Heuristic Scheduling

As discussed in Section 6.1.2, finding an optimal schedule is NP-complete and the ILP-solvers will thus not be able to find solutions for larger networks in an acceptable time. To this end, an heuristic algorithm is proposed to find good (see Section 6.1.4 for an evaluation), but not necessarily optimal solutions for larger networks.

The main idea of Algorithm 3 is to select the set of transmitting nodes in each scheduling cycle that transfers the maximum amount of new information to their neighboring nodes. If called with an enabled \( \text{collect} \) flag, the algorithm tries to aggregate the information of all source nodes at a central collector node \( v_{\text{collect}} \) before
Algorithm 3: Heuristic scheduling for multi-source flooding

**Input:** Graph model $M = (N, N_0, E_I, E_C, h)$

**Input:** collect flag

**Output:** Schedule $S$

1. $t := 0$;
2. $S(v, t) := IDLE \quad \forall v \in N$;
3. $I(v, t) := \emptyset \quad \forall v \in N$;
4. $I(s, t) := \{s\} \quad \forall s \in N_0$;
5. if $collect$ then
   6. \[
   d := \min_{v \in N} \sum_{s \in N_0} h(s, v);
   \]
   7. \[
   D := \{v \in N : \sum_{s \in N_0} h(s, v) = d\};
   \]
   8. \[
   a := \max_{a \in D} |\{(a, b) \in E_C\}|;
   \]
   9. \[
   v_{collect} := a \in D : |\{(a, b) \in E_C\}| = d;
   \]
10. while $\exists v \in N : I(v, t) \neq N_0$ do
11. \[
   T := \{v \in N : \exists (a, b) \in E_C : I(a, t) \not\subseteq I(b, t)\};
   \]
12. \[
   t := \max_{v \in T} |I(v, t)|;
   \]
13. \[
   T := \{v \in T : |I(v, t)| = d\};
   \]
14. \[
   w_{\text{max}} := -\infty;
   \]
15. foreach $P \in 2^T$ do
16. \[
   E := \emptyset;
   \]
17. foreach $b \in N \setminus P$ do
18. \[
   D := \{a \in P : (a, b) \in E_I\};
   \]
19. if $|D| \neq 1$ then next;
20. if $(a, b) \not\in E_C$ then next;
21. if $I(a, t) \subseteq I(b, t)$ then next;
22. \[
   E := E \cup \{(a, b)\};
   \]
23. if $collect \land I(v_{collect}, t) \neq N_0$ then
24. \[
   w := 0;
   \]
25. foreach $s \in N_0$ do
26. \[
   D := \{v \in N : s \in I(v, t)\} \cup \{b \in N : \exists (a, b) \in E : s \in I(a, t)\};
   \]
27. \[
   w := w - \min_{v \in D} h(v, v_{collect});
   \]
28. else
29. \[
   w := \sum_{(a, b) \in E} |I(a, t) \setminus I(b, t)|;
   \]
30. if $w > w_{\text{max}}$ then
31. \[
   w_{\text{max}} := w;
   \]
32. \[
   E_{\text{max}} := E;
   \]
33. $I(v, t + 1) := I(v, t) \quad \forall v \in N$;
34. foreach $(a, b) \in E_{\text{max}}$ do
35. \[
   S(a, t) := TX;
   \]
36. \[
   S(b, t) := RX;
   \]
37. \[
   I(b, t + 1) := I(b, t) \cup I(a, t);
   \]
38. \[
   t := t + 1;
   \]
39. \[
   S(v, t) := IDLE \quad \forall v \in N;
   \]
40. postProcessing($M, S, I$);
flooding the whole network. Not all network topologies benefit from this aggregation. Thus the algorithm should be run twice (with and without the \textit{collect} flag) to obtain best results.

In Lines 1 to 4, the cycle counter $t$, the schedule $S$ and the information assignment $I$ are initialized. In Lines 6 to 9, the central node $v_{\text{collect}}$ for the optional \textit{collect} mode is selected as the node with minimal distance to all source nodes and maximum number of reachable neighbors. Each execution of the while-loop in Line 10 generates another schedule cycle, until all information is distributed to all nodes. In Line 11, a list $T$ of potential transmitting nodes is selected. To keep the subsequent search-space as small as possible, $T$ is restricted to those nodes carrying the most information (Line 13). Due to possible mutual interferences, it is not always the best solution to activate all selected transmitters in the same cycle. Therefore, the subset $P$ of $T$ actually maximizing a performance metric $w_{\text{max}}$ is searched by the loop in Line 15. The performance metric varies depending on operating mode (\textit{collect true} or false), and over the course of the search (see below).

In Lines 17 to 22, a list of edges $E$ from the transmitters $P$ to appropriate receivers is determined. A node $b$ is an appropriate receiver if it is not included in the transmitters list (Line 17), if it is influenced by exactly one transmitter $a \in P$ (Line 19) and if it can actually receive additional information from $a$ (Lines 20, 21).

In non-greedy mode (or in greedy mode, but with all information already aggregated at the collector node $v_{\text{collect}}$), the algorithm aims to maximize the overall appearance of new (previously unknown) information at receiver nodes. This computation is performed in Line 30 for the currently examined subset $P$. In greedy mode, the heuristic initially drives all information towards the collector node. Thus, until all information has arrived there ($I(v_{\text{collect}}, t) \neq N_0$), the current subset $P$ is rated with regard to the minimal hop distance of each information $s$ to the collector node $v_{\text{collect}}$, summed in the loop at Line 25 across all $s$. Once all information has arrived at $v_{\text{collect}}$ ($I(v_{\text{collect}}, t) = N_0$), the quality metric falls back to aiming for maximum information distribution.

The best-so-far solution is maintained in $w_{\text{max}}$ and $E_{\text{max}}$ (Line 31). This solution is used to schedule the nodes in the current cycle and expand the information assignment for the next cycle (Lines 34 to 40).

The resulting schedule is further optimized in a post-processing step (Line 41) by eliminating redundant data transfers. A data transfer from node $a$ to node $b$ in cycle $t$ is redundant, if it is dominated by a later data transfer from node $c$ to node $b$ in cycle $k > t$, i.e., $\Delta I := I(b, t + 1) \setminus I(b, t) \subseteq I(c, k)$ and $b$ is not scheduled for transmission between cycles $t$ and $k$. In this case, $S(b, t)$ is set to IDLE and $I(b, i)$ is reduced by $\Delta I$ for $t < i \leq k$. Afterwards, the transmission $S(a, t)$ may have become superfluous if $b$ (now set to IDLE) was its only receiver in cycle $t$. Then, $S(a, t)$ is set to IDLE. This may result in a completely idle schedule cycle $t$, which has to be removed. Furthermore, removing a transmission for node $a$ may make another reception of $a$ redundant. Thus, these steps are performed repeatedly until no more redundant transfers can be found.

6.1.4 Evaluation

The difficulty to find an optimal schedule heavily depends on the network topology. In sparse networks, the number of possible information routes to choose from is smaller than in dense networks. However, in totally connected graphs the optimal solution is
stranfoward. To evaluate the proposed scheduling algorithms on a large number of topologies, random test-cases are generated. For each test-case, a number of $|N|$ nodes is randomly placed in a two dimensional plane of a certain size $l^2$ by assigning $x, y : N \rightarrow [0, l]$. Without loss of generality, the first $|N_0|$ nodes are selected as the source nodes. Two thresholds $d_C, d_I \in \mathbb{R}$ are used to derive the network-topology from the euclidean distance between two nodes $(a, b) \in N^2$:

\begin{align}
  d(a, b) &= \sqrt{(x(a) - x(b))^2 + (y(a) - y(b))^2} \\
  (a, b) \in E_C \iff d(a, b) \leq d_C \\
  (a, b) \in E_I \iff d(a, b) \leq d_I
\end{align}

The inference and connectivity thresholds are derived from the 2-ray-ground radio propagation model [Rappaport2001]:

\begin{equation}
  \frac{P_{RX}}{P_{TX}} = \frac{G_{TX}G_{RX}}{L} \frac{h_{TX}^2 h_{RX}^2}{d(TX, RX)^4}
\end{equation}

where $P_{TX}$ is the transmitted signal power, $P_{RX}$ is the received signal power, $d(TX, RX)$ is the distance between sender and receiver, $h_{TX}$ and $h_{RX}$ are the vertical height of the transmitter and receiver antennas over ground. The system loss factor $L$ and the antenna gains $G_{TX}, G_{RX}$ are usually disregarded.

As a practical example, the TI [CC2530] RF-SoC of the HaLOEWE:n3 provides a maximum $P_{TX}$ of 4.5 dBm (2.82 mW) and may successfully receive $-82$ dBm (6.31 pW) signals while being subject to another $-85$ dBm (3.16 pW) noise signal (co-channel rejection). With an assumed antenna height of 0.24 m, the connectivity and interference ranges are calculated as

\begin{align}
  d_C &= 0.24 \, \text{m} \cdot \sqrt{\frac{2.82 \, \text{mW}}{6.31 \, \text{pW}}} = 35 \, \text{m} \\
  d_I &= 0.24 \, \text{m} \cdot \sqrt{\frac{2.82 \, \text{mW}}{3.16 \, \text{pW}}} = 41 \, \text{m}
\end{align}

With these fixed settings, the density of the network topology can be adjusted by the number of network nodes and the size $l^2$ of the deployment area.

To evaluate the efficiency of the proposed optimal and heuristic scheduling algorithm, the resulting cost for various network configurations is compared against traditional and advanced flooding algorithms. These reference implementations share some common characteristics. In each cycle, a set of potential forwarding nodes $F \subseteq N$ is selected by a protocol specific mechanism (see below). Out of these, a subset $T \subseteq F$ is selected by CSMA collision avoidance and a hidden terminal detection, i.e., $\exists a, \hat{a} \in T : (a, \hat{a}) \in E_I \lor (\exists b \in N : (a, b) \in E_C \land (\hat{a}, b) \in E_I)$. All non-transmitting nodes in the interference range of any transmitter are counted as receivers $R = b \in N \setminus T : \exists a \in T : (a, b) \in E_I$. All other nodes will turn off their radio after clear channel assessment and thus do not consume a considerable amount of energy when compared to actually receiving data.
The main difference between the various reference implementations is the selection of the potential forwarding nodes $F$. In blind flooding, all nodes that have received, but not yet forwarded an information are included in $F$. This does not require any knowledge about the network topology. In contrast, the self-pruning ($sp$) [Lim2001], dominant-pruning ($dp$) [Lim2001], partial dominant-pruning ($pdp$) [Lou2002], total dominant-pruning ($tdp$) [Lou2002] and the history-based 2-hop dominant-pruning ($h2dp$) [Agathos2011] try to reduce $F$ based on knowledge about the one- or two-hop neighborhood of each node. Please refer to the original work for further details.

For this evaluation, CPLEX 12.5.1 is used as ILP solver running up to 8 threads. The ILP formulation is passed to CPLEX in the LP file format. The solver is supplied with an upper bound for the objective function derived from a preceding run of the scheduling heuristic. All computations are executed on a compute server equipped with 128 GB RAM and 32 AMD Opteron 6134 CPUs running at 2.3 GHz.

For networks of $|N| = 20$ nodes, 10 network configurations are considered by varying the number of source nodes $|N_0| \in \{1, \ldots, 5\}$ and the size of the deployment area $l^2 \in \{100 \text{ m} \times 100 \text{ m}, 150 \text{ m} \times 150 \text{ m}\}$. For each configuration, 50 networks are generated at random (within the given bounds). For each network, the various algorithms (flooding, pruning, heuristic and ILP) are applied to generate a schedule. To simplify comparison, the schedule costs (Equation 44) of the enhanced algorithms are normalized to the outcome of the naive blind flooding. The resulting relative costs are averaged over the 50 random test-cases per network configuration.

Figure 37 shows the results for the larger deployment area (i.e., the sparse topology). For a single source node, the existing pruning algorithms reduce the communication costs by up to 35% compared to the blind flooding baseline costs. These pruning algorithms assume a certain preferred direction of information expanding from a single source throughout the network. With a growing number of source nodes, this assumption is no longer valid, resulting in the degraded improvement of less than 30% cost reduction for five source nodes. In contrast, the novel heuristic and ILP scheduler are designed to manage multiple source nodes interfering with each other. Thus, their performance advantage over blind flooding increases with the number of source nodes up to 83% for five source nodes.

![Figure 37: Averaged scheduling costs relative to blind flooding for 20 nodes in an 150 m × 150 m deployment area](image-url)
Similar observations can be made in Figure 38. Here, the smaller deployment area results in denser networks. The number of redundant transmissions and receptions performed in blind flooding thus increases and the relative cost of all other algorithms decreases. Furthermore, the knowledge of the pruning algorithms about the two-hop neighborhood becomes more valuable, as it now covers a larger part of the network. The pruning algorithms achieve up to 55% cost reduction for a single source node, while the heuristic and optimal schedules computed by the novel methods achieve up to 90% cost reduction for five source nodes.

Figure 39 examines the heuristic schedule costs (i.e., the required number of transmissions/receptions) normalized to the optimal schedule costs. The percentiles describe the range of heuristic schedule costs achieved by a certain percentage of the randomly generated topologies. For example, the 60% to 95% percentile of the 150 m × 150 m
Figure 40: Averaged runtime of heuristic relative to runtime of ILP-solver (both executed on the same compute server)

topologies with 4 source nodes ranges from 102.5% to 107.7%. That is, for 60% out of the 50 test topologies, the heuristic required not more than 102.5% of the minimum achievable transmissions and receptions, while 95% out of the 50 test topologies required not more the 107.7% of the minimum schedule cost. For networks with up to three source nodes, the average heuristic results are less than 1% worse compared to the optimal results. Furthermore, the heuristic found an optimal solution in at least 60% of the generated test cases with less than four source nodes. With an increasing number of source nodes, the gap between the heuristic and the ILP solutions becomes larger, but does not exceed 4% on average for five source nodes. This slight quality loss is the price for the highly accelerated computation, detailed in Figure 40. For four sources in the dense topology, the heuristic (run in two passes, with collect both enabled and disabled) executes in just 0.1% of the time of the ILP solver. A speedup of more than 1000× can be observed for networks with five source nodes. Note that

Figure 41: Averaged schedule length relative to blind flooding for 20 nodes in an 150 m × 150 m deployment area
all execution times are measured as pure wall-clock time, not counting the eight-core processing of the ILP solver as separate execution times.

The length of a schedule is the secondary optimization target defined by Equation 45. Figures 41 and 42 detail the average length of the schedules determined by the considered flooding schemes. As for the transceiver activity, the proposed ILP and heuristics clearly outperform the conventional pruning algorithms, especially for an increasing number of source nodes and denser networks.

### 6.2 Precise Wireless Time Synchronization

Time synchronization is required in WSN applications mainly for two reasons. First, data sampled from spatially distributed sensors cannot be properly interpreted without knowledge of the exact sampling time. The acceptable uncertainty is typically a small percentage of the application’s sampling period, which may range from several seconds in environmental monitoring [Lazarescu2013] down to several milliseconds in vibration-based structural health monitoring (Section 7.3), or even shorter in acoustic localization applications [Astapov2014]. Thus, synchronization protocols with an accuracy of a few microseconds are required for the more demanding applications.

A second reason for precise time synchronization derives from the large power consumption of the radio transceiver when idly waiting for incoming messages. If the sender and receiver are synchronized, the radio protocol can define short periods of time in which transmissions can be initiated and received. Outside these windows, the power-hungry transceivers can be shut down.

#### 6.2.1 Necessity of Clock Drift Compensation

The local time at a sensor node $a$ at a certain absolute time $t$ is represented as a timestamp $t_a(t) := f_a \cdot (t - t_{a,0})$, which is the value of an oscillator driven counter with oscillation frequency $f_a$ and start-up time $t_{a,0}$. If two nodes $a$ and $b$ exchanged...
their timestamps \((l_a(t_e), l_b(t_e))\) at a time \(t_e\), then node \(b\) can calculate the timestamp of node \(a\) at any subsequent time \(t_s > t_e\) as

\[
\begin{align*}
l_a(t_s) &= l_a(t_e) + f_a \cdot (t_s - t_e) \\
&= l_a(t_e) + \frac{f_a}{f_b} (l_b(t_s) - l_b(t_e)) \\
&= l_b(t_s) + l_a(t_e) - l_b(t_e) + \frac{f_a - f_b}{f_b} (l_b(t_s) - l_b(t_e)) \\
&= l_b(t_s) + l_a(t_e) - l_b(t_e) + \left( f_a - f_b \right) \cdot (t_s - t_e)
\end{align*}
\]

Even if both nodes run at the same nominal frequency, the temperature and voltage dependency of the oscillators result in small relative frequency deviations \(\frac{f_a - f_b}{f_b}\), typically in the range of a few parts per million (ppm), as shown in Figure 44. Manufacturers specify the maximum frequency uncertainty over the entire operating temperature range at even higher values (e.g., ±40 ppm for the TI CC2530). Without explicit drift compensation, the timestamps used for offset compensation have to be exchanged at a rate of at least \(\frac{f_a - f_b}{\Delta t \cdot f_b}\) to keep the synchronization uncertainty below \(\Delta t\). Even if the timestamp exchange can be piggybacked onto the application’s regular communication payloads, the remaining additional communication effort is still undesirable for small \(\Delta t\). The next section therefore shows how to efficiently integrate the clock drift compensation into a wireless synchronization protocol.

### 6.2.2 Organization of Timestamp Exchange and Clock Drift Estimation

As in the FTSP [Maroti2004], a single node \(r\) is selected as time reference, i.e., its local time \(l_r\) represents the global time \(g\) all other nodes try to synchronize with. More precisely, in addition to its local time \(l_a\), every node \(a\) derives an assumption \(g_a\) about the global time. Under perfect synchronization, \(g(t) = g_a(t)\) is achieved for any time \(t\) and all nodes \(a\). As this section focuses on the LR-based drift compensation, dynamic reselection of the reference node, as it is described by [Maroti2004], is not considered here further.

The synchronization layer should be inserted between the MAC and the network layer to ensure that all nodes get synchronized, even if they are just routing packets. This violates the compliance with standardized protocols like [ZigBee]. However, interoperability with COTS [ZigBee] devices is not the primary design goal of the HaLOEWEEN mote. The synchronization layer adds up to 6 B to the radio packet. The first byte indicates whether or not the sender has already been synchronized to the reference, i.e., it is the reference or it received a sufficient number of timestamps to perform the offset and drift compensation between its local clock and the reference clock, as described below. If the sender \(a\) is synchronized, a 5 B timestamp \(g_a(t_{TX})\) is appended to represent the senders assumption of the global time of the current start of frame delimiter (SFD) transmission.

To calculate this timestamp, the sender \(a\) first inserts the PHY and the MAC header and the first byte of the synchronization header into the radio buffer and initiates the transmission. After the SFD was transmitted at time \(t_{TX}\), the sender calculates \(g_a(t_{TX})\) from the captured local time \(l_a(t_{TX})\). At the reference node, the local time is just
Algorithm 4: Conversion from local to global time

**Input:** Local timestamp \( l \)

**Input:** Last exchanged synchronization point \((\hat{g}, \hat{l})\)

**Input:** Clock drift represented as \(num\) and \(den\)

**Output:** Global timestamp \( g \) corresponding to \( l \)

\[
\begin{align*}
  t & := l - \hat{l}; \\
  g & := \hat{g} + t + (t \cdot (num - den))/den; \quad \text{// Equation 74}
\end{align*}
\]

This algorithm realizes Equation 74 based on the outcome of the linear regression (Section 5.3) applied to the last exchanged synchronization points. More specifically, the ratio between \(num\) and \(den\) equals the ratio between the oscillator frequencies of the reference node and the local node, i.e., \(num/den = f_r/f_a\). To avoid a radio buffer underrun, Algorithm 4 must be finished before the previous header data was transmitted (i.e., 8 B/250 kbit/s = 256 \(\mu\)s).

Upon reception of the SFD of a radio packet by node \(b\) at time \(t_{RX}\), the receiver timer captures the local timestamp \(l_b(t_{RX})\). If the sender \(a\) was synchronized to the reference, the global timestamp \(g_a(t_{TX})\) is included in the synchronization header of the packet. Node \(b\) than is provided with its local and a global timestamp nearly generated at the same time. The remaining difference between \(t_{RX}\) and \(t_{TX}\) results from the propagation delay between sender and receiver (i.e., 3.3 ns/m) and should not be significant for node distances of a few meters. In practice, however, \(t_{RX} - t_{TX} = (3.51 \pm 0.04) \mu\)s was measured by observing the SFD sampling events automatically generated by the transceiver hardware with an oscilloscope. A similar systemic offset was observed by [Akhlaq2013]. It is compensated for by subtracting \(\Delta_{SFD} := 3.51 \mu\)s \cdot 32 MHz = 112 from \(l_b(t_{RX})\), where 32 MHz is the timer frequency of the HaLOEWEn3 motes. After this error compensation, the receiver \(b\) uses the global and the local timestamp as new data point for the sliding window linear regression to update its regression slope with Algorithm 1.

In this manner, the synchronization information is flooded over multiple hops into the network. To perform a certain action (e.g., sensor sampling) simultaneously, all synchronized nodes must agree on a certain global time, convert this time into their local time by applying Algorithm 5, and configure their timers to generate an interrupt at the calculated local time.

Algorithm 5: Conversion from global to local time

**Input:** Global timestamp \( g \)

**Input:** Last exchanged synchronization point \((\hat{g}, \hat{l})\)

**Input:** Clock drift represented as \(num\) and \(den\)

**Output:** Local timestamp \( l \) corresponding to \( g \)

\[
\begin{align*}
  t & := g - \hat{g}; \\
  l & := \hat{l} + t - (t \cdot (num - den))/num; \quad \text{// Equation 74}
\end{align*}
\]
6.2.3 Evaluation

In this section the synchronization accuracy of the protocol described in Section 6.2.2 is analyzed for different configurations of the synchronization period and the regression table size. The synchronization accuracy is affected by hardware details of the utilized TI CC2531 radio transceiver (e.g., oscillator stability, timestamp accuracy and radio-triggered timestamp capturing) as well as the network topology and other environmental conditions (e.g., temperature stability). After restricting the design space to reasonable configurations for the synchronization period and the regression table size, the resource requirements of the various LR implementations are compared against each other.

The network setup shown in Figure 43 was used to evaluate the achievable synchronization accuracy. Five receiver nodes are located in the broadcast range of a gateway node. The latter dumps measurement results to the PC over a serial connection and acts as the time reference in the synchronization protocol. All nodes are not more than 1 m apart from each other, so the propagation delay is smaller than 3 ns and can be ignored. Note that the synchronized nodes are required to be within the broadcast range of the gateway only to simplify the test of the synchronization accuracy. The synchronization protocol itself relies on exchanging timestamps along the linear multi-hop chain and does not require any broadcasts.

Once per second, the gateway initiates a new measurement consisting of two phases. The test phase is started by a broadcast from the gateway to all other nodes (Message 1 in Figure 43a). The gateway captures the broadcast transmission time \( g(t_{TX}) \). Each receiving node \( i \) captures its local broadcast reception time \( l_i(t_{RX}) \), derives the local broadcast transmission time as \( l_i(t_{TX}) \approx l_i(t_{RX}) - \Delta_{SFD} \), and calculates the corresponding assumed global time \( g_i(t_{TX}) \) using offset and drift compensation. These timestamps are reported back to the gateway in a linear chain starting at node 5 (Messages 2 to 6 in Figure 43a). The actual synchronization accuracy \( A_i \) and the actual clock drift \( D_i \) of node \( i \) relative to the reference are derived from the received timestamps as

\[
A_i(t_{TX}) := g_i(t_{TX}) - g(t_{TX})
\]
\[
D_i(t_{TX}) := \frac{g(t_{TX}) - g(t_{TX} - 1 \text{s})}{l_i(t_{TX}) - l_i(t_{TX} - 1 \text{s})} - 1
\]

where \( t_{TX} - 1 \text{s} \) denotes the broadcast event of the previous test phase.

![Figure 43: Network setup for timestamp capturing](image-url)
Immediately after the test phase, the update phase is started by the gateway transmitting a unicast request (Message 7 in Figure 43b) to node 1, which is forwarded in a linear chain (Message 8 to 11) to node 5. Each node captures its local request reception time and the subsequent local request transmission time and reports them back to the gateway (Message 12 to 16). The actual synchronization update (i.e., insertion of the exchanged timestamps into the regression table and update of the slope parameters) is only performed in every tenth update phase. Thus, the effective synchronization period of the whole network is 10 s. However, the timestamps captured during the update phase allow for an offline simulation of the whole synchronization protocol. This simplifies the analysis of the impact of the regression table size and the synchronization period on the synchronization accuracy. All data presented below for a regression table size other than two, or a synchronization period other than 10 s, result from these simulations.

The left side of Figure 44 shows the clock drift of the five receiver nodes relative to the reference clock. For the first 55 min, all nodes were kept at a fixed temperature of about 21°C. Under this condition, the standard deviation of the drift amounts to 0.1 ppm at all nodes. Afterwards, node 4 was cooled down to 7°C resulting in a drift drop of about 3 ppm. In the same time, node 3 was heated up to 52°C before cooling down to 40°C again. This resulted in a drift step of 4.5 ppm and 2.5 ppm respectively. The right side of Figure 44 aggregates the clock drift characteristic into a histogram for each receiver node. The insights gained from this measurement are twofold. First, without clock drift compensation, node 1 would have to exchange timestamps with the reference node at least every 200 ms to keep the synchronization inaccuracy below 1 µs. Thus, high precision time synchronization has to provide drift compensation to keep the communication overhead manageable. Second, even if the sensor nodes in typical WSN applications will not be exposed to sudden large temperature changes in an outdoor scenario, a significant variation of the clock drift can be expected if the subset of nodes, that are exposed to full sunlight, is changing.

Figure 44: Observed clock drift relative to gateway under temperature variation (0 min to 55 min: 21°C, 55 min to 80 min: 7°C at N4, 55 min to 65 min: 52°C at N3, 65 min to 75 min: 40°C at N3)
The influence of the node’s supply voltage on the clock drift was also investigated as the supply of battery-powered sensor nodes cannot be assumed to be stable. However, due to the voltage converters inside the MCU, the reduction of the supply voltage from 3 V to 2 V did not significantly influence the clock drift at the corresponding node.

Figure 45 shows the actual clock drift of node 1 relative to the reference (captured at 1 Hz as described above) and its estimation about its own clock drift derived by LR with a synchronization period ranging from 10 s to 100 s and a fixed regression table size of two. As expected, the estimation becomes more inaccurate with longer synchronization periods. Figure 46 shows how this inaccuracy in drift estimation translates into synchronization inaccuracy. The maximum absolute clock deviations for the different synchronization periods are 0.7 µs, 2.4 µs, 5.7 µs and 8.3 µs respectively. In general, the
appropriate synchronization period depends on the concrete accuracy demands of the application.

Figure 47 shows the actual clock drift of node 1 relative to the reference and its estimation about its own clock drift derived by LR with a regression table size ranging from 2 to 8 elements at a fixed synchronization period of 10 s. Larger regression tables have an effect similar to the smoothing of the drift estimation occurring for larger synchronization periods. Indeed, larger regression tables actually do not improve the average synchronization accuracy, as shown in Figure 48. However, even for a table size of 8, the maximum absolute clock deviation does not exceed 1 µs.

The real benefit of larger regression tables becomes obvious only if the actual clock drift is more spiky, e.g., due to temporary temperature fluctuations, as shown in Figure 49. While the average clock deviation is not improved by the larger regression tables
as shown in Figure 50, the maximum absolute error is reduced from 2.8 µs to 1.7 µs when choosing a table size of 8 instead of 2.

Finally, the synchronization accuracy over multiple hops is shown in Figure 51a. Even at the fifth hop, the maximum absolute clock deviation can be kept below 1 µs. However, the average deviation is increased by about 30 ns per hop. This might be caused by the fixed compensation time for the SFD delay as described in Section 6.2.2. The accuracy of this compensation is limited by the timestamp resolution of $1/32 \text{MHz} = 31 \text{ ns}$. When simulating the synchronization with $\Delta_{SFD} = 111.5$, the mean synchronization error is kept stable over multiple hops as shown in Figure 51b. Although not yet implemented in the HaLOEWEn system, this error compensation with sub-timestamp precision can be realized by a pulse-wide modulated correction.
term, i.e., toggling between $\Delta_{SFD} = 111$ and $\Delta_{SFD} = 112$ with each synchronization period.

As shown in Figure 46, the synchronization period should not exceed 10 s to achieve a synchronization accuracy of 1 $\mu$s. At a timestamp resolution of 1/32 MHz, the differences between successive timestamps inserted into the RLRCT (Algorithm 1) can thus be represented with 29 bit. Furthermore, regression tables with more than 8 entries do not improve the synchronization accuracy. More precisely, a regression table size of 4 is advisable as a trade-off between the achievable accuracy during stable environmental conditions (Figure 48) and the outlier reduction during dynamic temperature variations (Figure 50).

The arithmetic operators required for the various LR implementations described in Section 5.3 were optimized based on these assumptions (i.e., 8 entries per regression table, 29 bit sufficient to represent the difference between successive timestamps) and implemented on the HaLOEWEn3. The three different hardware accelerator architectures were synthesized for the Microsemi IGLOO M1AGL1000 FPGA with Synplify Pro (ME I-2014.03M-SP1). The resulting resource requirements are summarized in Table 11. The Full Parallel architecture does not fit on the target device, so its us-

<table>
<thead>
<tr>
<th>Implementation</th>
<th>Fully Parallel</th>
<th>Single MAC</th>
<th>$\mu$Architecture</th>
</tr>
</thead>
<tbody>
<tr>
<td>Core Cells [%]</td>
<td>141</td>
<td>63</td>
<td>50</td>
</tr>
<tr>
<td>BRAM [%]</td>
<td>12</td>
<td>6</td>
<td>15</td>
</tr>
<tr>
<td>$f_{\text{max}}$ [MHz]</td>
<td>8.7</td>
<td>7.1</td>
<td>8.5</td>
</tr>
<tr>
<td>Regression [cycles ($\mu$s)]</td>
<td>7 (0.8)</td>
<td>21 (3.0)</td>
<td>29 (3.4)</td>
</tr>
<tr>
<td>Conversion [cycles ($\mu$s)]</td>
<td>27 (3.1)</td>
<td>29 (4.1)</td>
<td>32 (3.8)</td>
</tr>
</tbody>
</table>

Table 11: Ressource requirements of hardware accelerator architectures on M1AGL1000

Network Communication Primitives

87
age is restricted to larger FPGAs. Both other architectures occupy slightly more than half of the FPGA logic resources, mainly due to the usage of combinatorial multipliers. Compared to the linear regression (executed once per synchronization period), the timestamp conversion is performed more often (typically once per sampling period). As the \( \mu \text{Architecture} \) implements this critical operation faster than the Single MAC implementation, the former is used for the hardware-accelerated RLRCT in the following comparisons with the software solutions.

Figure 52 shows the overall execution time for the different regression implementations measured with an oscilloscope. All \( \mathcal{O}(1) \) implementations (i.e., RLR, RLRCT, HW, OPT) outperform the \( \mathcal{O}(n) \) implementation (i.e., LR) even for \( n = 2 \). Furthermore, the proposed RLRCT is 22\% faster than the RLR implementation. For \( n > 2 \), the hardware accelerator outperforms the best software implementation by 66\%. If a regression table of size 2 satisfies the application accuracy requirements, the optimized software implementation (referred to as OPT in Figures 52 to 54) actually computes the fastest regression.

Note that 95\% of the time required by the hardware accelerator is spent transferring the 80 bit timestamp pair from the MCU to the FPGA. Replacing the combinatorial by a sequential multiplier in ALU of the \( \mu \text{Architecture} \) implementation (Figure 35) can thus be used to reduce the logic resources required by the hardware accelerator without significantly increasing its overall execution time.

All software implementations of the time conversion require 118 \( \mu s \) for the offset and drift compensation. The hardware accelerator requires only 72 \( \mu s \), including the interprocessor communication. The sensor node thus benefits from the hardware accelerator even for the smallest regression tables.

Beyond execution time, memory is another limited resource on embedded systems. As shown in Figure 53, the proposed RLRCT clearly outperforms the RLR in terms of required RAM, as its regression table requires only \( 2n \cdot 29 \) bit, while the RLR

![Figure 52](image_url)

**Figure 52**: Overall execution time for the LR implementations (LR = Equation 15, RLR = Algorithm 2, RLRCT = Algorithm 1, HW = Accelerated \( \mu \text{Architecture} \), OPT = Equation 18 optimized for \( n = 2 \))
buffers $2n \cdot (40 + 80)$ bit. For regression tables larger than 12 entries, the RLRCT also requires less RAM than the LR implementation and is thus favorable in memory constrained systems. Although such large tables are not required for the wireless time synchronization, other LR applications like RSSI-based node localization [Vanheel2011] rely on larger tables and might thus benefit from the RLRCT approach.

As the assembler-based implementation of the arithmetic operations are optimized for execution speed, the instruction memory required by the different LR algorithms is relatively large. As shown in Figure 54, RLRCT requires 15% less ROM for storing instructions than RLR, although the RLRCT has to perform more arithmetic operations.

Figure 53: MCU-RAM requirement of the LR implementations (LR = Equation 15, RLR = Algorithm 2, RLRCT = Algorithm 1, HW = Accelerated μArchitecture, OPT = Equation 18 optimized for $n = 2$)

Figure 54: MCU-ROM requirement of the LR implementations (LR = Equation 15, RLR = Algorithm 2, RLRCT = Algorithm 1, HW = Accelerated μArchitecture, OPT = Equation 18 optimized for $n = 2$)
CHAPTER 7
Targeted Applications

After describing the HaLoMote architecture and some application-independent building blocks for computation and communication, three specific monitoring applications are presented in this chapter. The first two of the presented applications benefit from the energy-efficient lossless data compression capabilities of the HaLoMote. Both applications restrict the affordable sample block size either by critical delay or memory constraints. The third application requires a specific hardware accelerator for the decentralized feature extraction. The third application will thus be discussed more in detail, and evaluated at system level in the next chapter.

7.1 Monitoring Neural Activities of Primates

At the German Primate Center in Göttingen, the neural activities of primates solving different tasks are measured by a micro-electrode inside the probands brains [Rafflenbeul2012]. As the apes have to move freely over a wide testing area, wired instrumentation is impractical, and thus the sensor data sampled with 16 bit resolution at a frequency of 24.414 kHz has to be transmitted wirelessly to a control station. Figure 55 shows an sample block extracted from the neural data. The data rate of 391 kbit/s required for the transmission of the raw data stream exceeds the capability of the low-power IEEE [802.15.4] transceiver specification. The captured data stream thus has to be compressed by about 50%, before it can actually be transmitted by an IEEE [802.15.4] transceiver.

![Sample Block of Neural Activity Data](image)

Figure 55: 8192 samples of neural activity data (min = -155, max = 169, mean = 6.3, stddev = 45.5)
Based on the received neural data, the variable penetration depth of the micro-electrode is controlled remotely by an operator at the control station. In order to allow timely interactive manipulation of the probe depth, the maximum end-to-end latency is restricted by the human response time of about 100 ms. Thus, the maximum block size used by a two pass encoder may not exceed 2400 samples at 24.414 kHz.

In this section, the several lossless compression schemes are applied to the neural data. Based on a trade-off between data reduction and computational effort, a specific compression scheme is selected and the energy efficiency of its hardware-accelerated implementation on the HaLOEWEEn mote is evaluated. This analysis has already been published in [Engel2014a].

7.1.1 Compressibility of the Monitored Data

As a baseline analysis, various compression algorithms have to be applied to the neural data to figure out a well suited candidate for an embedded implementation. The raw data sample stream was split into blocks of different size and fed into the compression algorithms listed in Table 12 with their specific run-time options. Static overhead (i.e., file headers not related to compression parameters) generated by these tools was disregarded when calculating compression ratios. In addition to using these off-the-shelf dictionary-based and predictive encoders, a software implementation of the ADPCM compression scheme with downstream Rice encoding was used, as presented in Section 5.2. This allows for a fine-grained trade-off between encoder complexity and compression ratio. Audio encoders also relying on Rice encoding (e.g., MP4ALS, ALAC, or FLAC), optimize the Rice parameter ($W$ in Equation 10) for each sample block. In contrast, the ADPCM implementation employed in this section uses a fixed Rice parameter ($W = 4$) found by static optimization on the entire sensor channel.

The compression ratios (compressed data size / uncompressed data size) achieved by the different compression algorithms are shown in Figure 56. Compression ratios larger than 100 % are clipped. As expected, the compression ratios of all encoders improve with increasing block sizes. With the exception of MP4ALS, the predictive audio encoders clearly outperform the dictionary-based compression schemes. Given the audio-like characteristics of the neural activity data, this in itself is not surprising. However, the additional encoder complexity necessary for adapting the higher order linear predictor coefficients in the algorithms used in MP4ALS, ALAC, or FLAC does not improve the compression ratios significantly compared to the static first-order DPCM predictor. For instance, at a block size of 2048 samples, the FLAC encoder achieves

<table>
<thead>
<tr>
<th>Codec</th>
<th>Options</th>
<th>Version</th>
<th>Algorithm</th>
<th>Overhead</th>
</tr>
</thead>
<tbody>
<tr>
<td>BZIP2</td>
<td>-9</td>
<td>1.0.6</td>
<td>RLE + BWT + MTF + Huffman</td>
<td>24 B</td>
</tr>
<tr>
<td>RAR</td>
<td>-m5 -en</td>
<td>5.00</td>
<td>proprietary</td>
<td>55 B</td>
</tr>
<tr>
<td>ZP</td>
<td>c3</td>
<td>1.00</td>
<td>context modeling + arith. coding</td>
<td>221 B</td>
</tr>
<tr>
<td>MP4ALS</td>
<td>-7e</td>
<td>RM23</td>
<td>adapt. linear prediction + Rice</td>
<td>34 B</td>
</tr>
<tr>
<td>FLAC</td>
<td>-8</td>
<td>1.3.0</td>
<td>adapt. linear prediction + Rice</td>
<td>8292 B</td>
</tr>
<tr>
<td>ALAC</td>
<td>ffmpeg</td>
<td>0.8.7-6</td>
<td>adapt. linear prediction + Rice</td>
<td>0 B</td>
</tr>
</tbody>
</table>

Table 12: Compression algorithms and options used in further evaluation
only a 3\% improvement in compression ratios at the cost of double the execution time when compared to the DPCM encoder running on the same platform.

7.1.2 Energy Efficiency of Hardware-Accelerated Data Compression

The DPCM compression is sufficient to reach the target compression ratio of 50\% even for small block sized, i.e., minimal end-to-end delay of the signal processing chain from the micro-electrodes to the human operator controlling their penetration depth. The ADPCM hardware accelerator presented in Section 5.2 configured for first-order prediction (either static or adaptive) is thus used for further evaluation of the overall energy efficiency of the in-sensor compression. Furthermore, the maximum block size of 2048 samples is used, derived from the delay-restrictions of the application. The test setup is detailed in Section 5.2.3.

Table 13 summarizes the results for processing the neural data. Both DPCM and ADPCM compression reduce the data size down to 1577 B (38.5\%), making the simpler DPCM algorithm a better choice for this application. The DPCM execution on the MCU takes more than double the 61.2 ms required for transmitting the compressed data stream. The MCU thus has to be kept active beyond the pure transmission time

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>None</td>
<td></td>
<td></td>
<td></td>
<td>4096</td>
<td>158.8 41.5 19.77</td>
<td>100.0</td>
</tr>
<tr>
<td>DPCM</td>
<td>MCU</td>
<td>132.2</td>
<td>11.8</td>
<td>4680</td>
<td>1577 184.6 21.6 11.96 60.5</td>
<td></td>
</tr>
<tr>
<td>ADPCM</td>
<td>MCU</td>
<td>249.4</td>
<td>11.8</td>
<td>8829</td>
<td>1577 310.4 17.5 16.30 82.4</td>
<td></td>
</tr>
<tr>
<td>DPCM</td>
<td>RCU</td>
<td>1.4</td>
<td>10.7</td>
<td>45</td>
<td>1577 62.6 41.9 7.87 39.8</td>
<td></td>
</tr>
<tr>
<td>ADPCM</td>
<td>RCU</td>
<td>2.6</td>
<td>10.7</td>
<td>83</td>
<td>1577 63.8 41.3 7.90 40.0</td>
<td></td>
</tr>
</tbody>
</table>
of the prior packet in order to perform the compression of the current packet. As the current drawn by the system during compression exceeds the current drawn in low-power mode by about 120 times, the longer duty cycles lead to deteriorated overall energy efficiency. Furthermore, the MCU is not even able to support the sampling frequency required by the application. At the targeted 24.414 kHz, the processing of the 2048 samples must not take longer than 84 ms. This is achieved by the hardware-accelerated compression modules. In addition, the RCU to MCU data movement occurs in parallel to the ongoing data transmission, and the very short execution time of the compression stretches the duty cycle before going back to sleep only marginally. This leads to an almost complete translation of compression ratio (38.5%) into system level energy savings (39.8%) for the hardware-accelerated DPCM.

### 7.2 Condition Monitoring of Heavy Industrial Machinery

The second application used for evaluating the hardware-accelerated data compression deals with detecting damage or fatigue of the rotating parts of large industrial machines. This condition monitoring is used to schedule inspection and maintenance intervals before unanticipated major damage leads to high repair costs and downtime of the machinery. Since the monitoring algorithms observe long-term trends in the acquired data, this application is latency insensitive. However, multiple sensor channels have to be handled in parallel for this application, thus increasing the demand for processing power and memory for this application. Note that the sensor nodes are located on heavily vibrating parts of the machine, which would quickly wear out fixed cable connections, thus making low-power wireless communication preferable.

The raw data streams are gathered from a three channel MEMS sensor sampled at 1 kHz with a resolution of 16 bit per channel. Table 14 shows the relevant statistical characteristics of the captured signals. The actual waveforms cannot be shown here for confidentiality reasons. A Rice parameter of $W = 11$ was found to be most appropriate for all three sensor channels. The analysis described in this section has already been published in [Engel2014a].

<table>
<thead>
<tr>
<th></th>
<th>Channel 1</th>
<th>Channel 2</th>
<th>Channel 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Min</td>
<td>-3872</td>
<td>-4156</td>
<td>-22466</td>
</tr>
<tr>
<td>Max</td>
<td>5203</td>
<td>13104</td>
<td>16307</td>
</tr>
<tr>
<td>Mean</td>
<td>110</td>
<td>2919</td>
<td>-3339</td>
</tr>
<tr>
<td>Stddev</td>
<td>1213</td>
<td>2374</td>
<td>13041</td>
</tr>
</tbody>
</table>

Table 14: Statistical characteristics of 8192 samples of machinery condition monitoring data

### 7.2.1 Compressibility of the Monitored Data

Just as for the neural data monitoring application in Section 7.1, a PC-based baseline analysis of the compressibility of the condition monitoring data is carried out. The resulting overall compression ratios are shown in Figure 57. Again, the predictive audio codecs are most appropriate. At a block size of 2048 samples, FLAC produces results 6% smaller than DPCM. This improvement comes however at the cost of a doubled execution time.
Figure 57: Reduction of condition monitoring data achieved by various compression algorithms

The ADPCM scheme is examined more in detail for this application. Figure 58 quantifies the impact of the prediction order. For small blocks, the size of the additional prediction parameters, which have to be encoded in the output stream, exceeds the benefit of improved compression ratios due to the reduced prediction error variance. For blocks of 2048 samples, the compression ratio strictly decreases with the prediction order, but only marginal improvements (about 1.2%) can be achieved by increasing the prediction order above 4. The difference of the compression ratio between first and fourth order prediction amounts to 6% for the largest block size. Thus, adaptive
prediction should be used for condition monitoring if the target platform supports storing a sufficient amount of samples (recall that there are three data channels, and double-buffering may be necessary for parallel sampling and encoding).

In conclusion, the adaptive schemes reached the best compression ratios for larger sample blocks, while static DPCM proved superior on small blocks. Thus, for delay-sensitive applications or memory-constrained platforms, the simple DPCM scheme should be applied. In all other cases, FLAC can improve the compression ratio of DPCM by about 5%. Note that FLAC is just a combination of ADPCM with an extensive search for the optimal prediction order. For data sources with known characteristics, a static selection of the prediction order may be sufficient. Compared to the static DPCM, the adaptation of the prediction coefficients in ADPCM only becomes worthwhile if the additional energy required for the second compression pass is smaller than the energy savings resulting from the 5% data size reduction. Therefore, energy efficiency of a hardware implementation of the ADPCM algorithm is also examined in the next section.

7.2.2 Energy Efficiency of Hardware-Accelerated Data Compression

The ADPCM hardware accelerator presented in Section 5.2 configured for first-order prediction (either static or adaptive) is used for further evaluation of the overall energy efficiency of the in-sensor compression. Although the energy efficiency of higher order prediction would be worth examining, this feature is not yet supported by the HaLOEWE3 implementation presented in Section 5.2. Furthermore, the maximum block size of 2048 samples is used, derived from the restricted on-chip memory of the HaLOEWE3 (see Table 8). The test setup is detailed in Section 5.2.3.

Processing of the condition monitoring data is evaluated in Table 15. Here, the adaptive predictor improves the compression ratio over DPCM such that a complete radio packet is saved. On the MCU, however, compressing the three data channels takes so much time, that the energy required for compression cannot be amortized over the transmission savings, instead leading to an efficiency deterioration. The hardware-accelerated encoders fare much better: Since all channels can be compressed in parallel, the total execution time only slightly increases over that of the single channel encoder used in Table 13. As before, the encoders using the RCU convert nearly all of the data volume savings into actual energy savings, i.e., only 81.3% of the baseline energy.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>None</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>12288</td>
<td>474.4</td>
<td>41.6</td>
<td>59.21</td>
<td>100.0</td>
</tr>
<tr>
<td>DPCM MCU</td>
<td>531.0</td>
<td>11.9</td>
<td>18957</td>
<td>9836</td>
<td>1281.0</td>
<td>912.0</td>
<td>24.3</td>
<td>66.48</td>
<td>112.3</td>
</tr>
<tr>
<td>ADPCM MCU</td>
<td>906.0</td>
<td>11.9</td>
<td>32344</td>
<td>9732</td>
<td>1281.0</td>
<td>1281.0</td>
<td>20.6</td>
<td>79.17</td>
<td>133.7</td>
</tr>
<tr>
<td>DPCM RCU</td>
<td>1.6</td>
<td>10.7</td>
<td>51</td>
<td>9836</td>
<td>381.6</td>
<td>381.6</td>
<td>42.5</td>
<td>48.62</td>
<td>82.1</td>
</tr>
<tr>
<td>ADPCM RCU</td>
<td>2.9</td>
<td>10.7</td>
<td>93</td>
<td>9732</td>
<td>378.9</td>
<td>378.9</td>
<td>42.4</td>
<td>48.15</td>
<td>81.3</td>
</tr>
</tbody>
</table>

Table 15: Energy required to compress/transmit 3 × 2048 MEMS samples
7.3 Distributed Structural Health Monitoring

Civil infrastructures such as bridges are prone to fatigue and other load-induced damage. With increasing age, inspection intervals have to be scheduled more frequently to assure the secure operation of the infrastructure. As manual inspections are costly, they have to be complemented by automated Structural Health Monitoring (SHM) systems. By periodically observing modal properties such as the eigenfrequencies, mode shapes, or the damping of the structures, damage or fatigue can be identified by significant deviations of these properties from reference measurements.

As shown on the left hand side of Figure 59, modal parameters of an object are typically derived by observing the structural response to an artificial excitation. The Experimental Modal Analysis (EMA) requires the knowledge about the excitation and the response to derive the modal parameters. While large structures can be excited by appropriate equipment, a significant amount of energy is required to drive such large shakers and impact hammers. Furthermore, the excited structures often have to be taken out of service to assure safety and measurement accuracy. For continuous automated SHM, the identification of the modal properties thus has to be based on the

![Figure 59: SHM dataflow based on experimental and operational modal analysis](image-url)

Targeted Applications 97
natural ambient vibration of the structure caused by wind or traffic, as shown on the right hand side of Figure 59. The major challenge of this Operational Modal Analysis (OMA) is the extraction of the features within the structural response, which are only related to the structure but independent from the random excitation. The Random Decrement Technique (RDT) was proposed for this purpose [Cole1973]. Finally, by comparing the derived modal parameters with reference parameters identified for a well known (healthy) state of the structure, damage can be identified of even localized.

The modal analysis and damage detection requires information gathered from all over the structure to capture the global properties of the observed object. These algorithms are therefore typically executed on a centralized controller. On the other side, the sensing of the structural response has to be decentralized, as the sensors are distributed all over the structure. As will be shown in Section 7.3.2, the RDT requires only little information from other neighboring sensors while strongly aggregating the local sensor data. The RDT should therefore also be decentralized. Nevertheless, RDT-based SHM installations using conventional WSNs with small MCUs often avoid the decentralization of the RDT due to its computational complexity.

The remaining part of this section therefore shows, how the HaLoMote can be used to efficiently decentralize the RDT. The SHM application and the hardware-accelerated RDT implementation on the HaLOEWEn3 mote have already been published in [Engel2012a; Engel2014c; Engel2015c].

7.3.1 Random Decrement Technique

For the RDT, a set of sensors $S \subseteq \mathbb{N}$ is distributed all over the structure to acquire its vibrations in terms of acceleration or deflection as time series $(x_s : T \mapsto V)_{s \in S}$. For a finite sampling rate and measurement duration, the time domain is also finite and discrete, i.e., $T = \{0, \ldots, n_t - 1\} \subseteq \mathbb{N}$. For simplicity, $V \subset \mathbb{R}$ can be assumed by abstracting from the finite measurement accuracy.

As the OMA aims to discover the dynamic characteristics of the observed structure, the static components of the acquired signals (gravity or prestress) have to be eliminated by a high-pass filter, e.g., by applying an FIR filter of order $n_f \in \mathbb{N}$ with $n_f + 1$ appropriate coefficients $c_k \in \mathbb{R}$. A reformulation of Equation 5 for the time series input yields

$$\hat{x}_s(t) := \min(n_f, t) \sum_{k=0}^{\min(n_f, t)} c_k \cdot x_s(t - k) \quad \forall (s, t) \in S \times T. \quad (75)$$

To eliminate the random signal components, the RDT selects a subset of the sensors as references $R \subseteq S$ and a trigger level $l_r \in V$ for each reference $r \in R$. In practice, sensors exposed to the largest vibration amplitudes are typically selected as references, to observe the structure’s activity even under shallow excitation. Furthermore, the trigger levels have to be chosen above the noise floor of the acquired sensor signal. The points in time $t \in T$, at which a reference signal $\hat{x}_r$ crosses $l_r$, are referred to as trigger events

$$E_r := \{t \in T : (\hat{x}_r(t) \geq l_r \land \hat{x}_r(t - 1) < l_r) \lor$$
$$\quad (\hat{x}_r(t) < l_r \land \hat{x}_r(t - 1) \geq l_r)\} \quad \forall r \in R. \quad (76)$$
A signal window \( (\hat{x}_r(t + k))_{k=0}^{n_w-1} \) of fixed length \( n_w \in \mathbb{N} \) starting at a trigger event is composed of the structure’s response to its initial displacement \( \hat{x}_r(t) = l_r \), its response to the initial velocity and the random ambient excitation, as shown in Figure 60. Assuming a zero-mean excitation, the random components are extinguished when accumulating a sufficient number of these triggered windows. The velocity response will also be eliminated, as each rising signal edge (with positive initial velocity) is followed by a falling signal edge (with negative initial velocity). Thus, the accumulated signal windows converge against the displacement response, which describes the structure’s free decay and can thus be used to derive its modal properties.

To estimate the mode shapes of the structure, spatial correlations between different sensor positions are required. Therefore, signal windows from all sensors are accumulated for each trigger event, resulting in \( |S| \cdot |R| \) so-called RD sequences

\[
D_{s,r}(k) := \sum_{t \in E_r} \hat{x}_s(t + k) \quad \forall (s, r) \in S \times R, \ 0 \leq k < n_w.
\]  

### 7.3.2 Distributed Random Decrement Technique

The accumulation of the RD sequence \( D_{s,r} \) requires an information transfer about the occurrence of the trigger events \( t \in E_r \) from signal \( r \) to signal \( s \). In a distributed WSN scenario, all sensors signals are captured by a dedicated sensor node. Although each sensor node may capture multiple nearby sensor signals, information about the trigger events have to be transmitted wirelessly from the reference nodes (capturing a reference
signal) to all other nodes in the network. When monitoring structures larger than the transmission range of the wireless transceivers (i.e., suspension bridges), the transport of the trigger information has to be organized by a multi-hop routing protocol.

In Section 6.1, a multi-source flooding mechanism was proposed for this purpose. The complete dissemination of the trigger events, i.e., the schedule length determined by the ILP or the corresponding heuristic, depends on the network size, its density, and the number of source nodes (i.e., $|R|$). To simplify the DPM of the network nodes, the sampling cycles are used as scheduling cycles, to avoid additional transitions between active and sleep states. Using this static scheduling approach, the maximum number of sampling cycles required to transfer trigger events from each reference node to all other nodes can be determined at the time of deployment. Let $\Delta$ denote this maximum communication delay.

In the multi-source flooding scheme, each trigger event is represented by the ID of its originating reference node and the age of the event. The latter starts at zero during the trigger event generation, and is incremented in every sampling period at all nodes that have received and stored the event. At the receiver nodes, the age information is used to properly calculate the RD sequences, as described in the next section. The reference ID as well as the age information can be represented by a few bits ($\leq 10$ bit for both in realistic scenarios), thus fulfilling the prerequisites of the multi-source flooding mechanism stated in Section 6.1.

The trigger event flooding is only required as long as the RD sequences are accumulated. Once the number of processed trigger events fulfills the accuracy requirements of the application, the RD sequences have to be transmitted to a base station in addition to some statistical measures of the reference signals (i.e., channel variance and number of generated trigger events). The RDT thus aggregates the $|S| \cdot |T|$ raw sensor samples down to $|S| \cdot |R| \cdot n_w$ RD samples. The aggregation factor $\frac{|T|}{|R| \cdot n_w}$ increases with the measurement duration. $n_w$ is typically chosen such that the RD sequences show the free decay of the structure, which may take several seconds for large bridges. After a measurement duration of several hours, which is often required to collect a sufficient number of trigger events, the aggregation factor typically exceeds two or more decades. For example, the laboratory-scale testbed presented in Chapter 8 uses $|R| = 2$ references to trigger signal windows, each consisting of $n_w = 256$ samples. After a 10 min measurement at 400 Hz sampling rate (i.e., $|T| = 2.4 \times 10^5$) the RD sequences that have to be transmitted to the subsequent OMA are $469 \times$ smaller than the original raw sensor signals. In addition to eliminating the random parts of the sampled signals, this strong aggregation capability is the major benefit of the RDT for the distributed WSN implementation of SHM applications.

However, the RDT increases the demand for in-sensor preprocessing. The computational complexity of the RDT is dominated by the FIR filter and the memory accesses required for the accumulation of the RD sequences. The latter becomes particularly complex if these sequences cannot be stored in the few kilobytes of RAM provided by most WSN processing units, thus requiring access to external memory. RDT preprocessing scales linearly with the number of sensor channels to be processed by each sensor node. Each sensor typically provides three channels to capture the multidimensional movement of the structure, and multiple nearby sensors may be connected to a single sensor node to simplify the deployment. Thus, assuming three to twelve sensor channels per mote is not unrealistic. As the sensor channels can be processed indepen-
dently of each other, most of the RDT preprocessing can be parallelized. Section 7.3.3
details the RDT implementation on the HaLoMote RCU.

7.3.3 Hardware-Accelerated Random Decrement Technique

In this section, an RDT hardware kernel for the HaLoMote architecture is presented. The
data signals and modules presented in Figure 61 to 65 are associated with
Equations 75 to 80. More precisely, the labels refer to the channel index \( s \) of the
captured sensor input and the current time slot (i.e., sampling cycle) \( t \), in which \( \hat{x}_s(t) \)
is generated by the high-pass filter.

Figure 61 shows the FIR implementation of this high-pass filter. As described at the
end of this section (Figure 66), the FIR is executed in parallel to the sensor sampling.
At the end of time slot \( t \), the unfiltered sensor sample \( x_s(t+1) \) for the next time slot is thus already present at the FIR input, while the filtered value \( \hat{x}_s(t) \) is present at the
FIR output. The filter itself is executed sequentially, thus requiring only one MACC
unit to combine the constant coefficients \( c_k \) with the FIR taps buffered in BRAM.

The filtered value \( \hat{x}_s(t) \) is passed through an additional FIFO buffer, which is in-
tegrated into the same BRAM as the FIR taps. Delaying the filtered samples by \( \Delta \) time slots is required to compensate the maximum time required for the trigger event
flooding over the entire network, as described in the previous section.

Figure 62 shows the computational logic required for each sensor channel. A sensor
specific control module requests the samples over the digital sensor interface. Although
most of the control lines of the sensor interfaces (e.g., SPI or I\(^2\)C) could be shared
between multiple sensors, each sensor channel is controlled by a dedicated interface to
allow for parallel independent sensor sampling.

![Figure 61: FIR and FIFO in single BRAM](image)

![Figure 62: Sensor interface, trigger event detection and precomputation required for calculating the standard deviation](image)
The sensor samples are filtered and delayed by the module shown in Figure 61, as described above. The filtered samples are used to detect trigger events by comparing the current $\hat{x}_s(t)$ and the last $\hat{x}_s(t-1)$ value with the trigger level $l_s$ as described by Equation 76. Furthermore, the filtered values and their squares are accumulated as running sums. As will be shown in Equation 80, these sums are required to derive the standard deviation of the sensor channel for a normalization step of the final OMA. Both, the trigger event detection and the standard deviation calculation are only required for sensor channels configured as RDT references (i.e., if $s \in R$). To simplify the network configuration, the reference-specific hardware is provided for each sensor channel and the software processor decides which of the results to use for further processing.

Figure 63 shows the module used for accumulating the delayed samples to an RD sequence stored in BRAM. External trigger event-specific logic (see Figure 64) decides whether and which memory position to modify. Additional clear logic is required to initialize the buffer at the start of each measurement.

Figure 64 shows the handling of trigger events for the reference channel $r$. This logic is required at all sensor nodes, not only at the sensor node sampling the reference signal. A trigger event $e_{r,k} \in E_r$ defined by Equation 76 is characterized by its current age $t - e_{r,k}$, i.e., the number of sampling cycles since it was generated by the logic shown in Figure 62. New events can be inserted into the list, and the age of each event in the list is incremented in every sampling cycle. As soon as $t - e_{r,k} = \Delta$, the sample at the output of the delay buffer $\hat{x}_s(t-\Delta) = \hat{x}_s(e_{r,k})$ corresponds to the filtered sample of sensor $s$ from the time slot $e_{r,k}$, in which the trigger was generated at sensor $r$. Therefore, trigger events at least as old as $\Delta$ cause an accumulation of the delayed sample to the appropriate RD sequence, i.e., $D_{s,r}(k) = \hat{x}_s(t - e_{r,k} - \Delta)$. Two RCU clock cycles are required for reading the old and writing the new value. However, due to the dual port BRAM, subsequent accumulations can be interleaved, thus resulting in a
throughput of one accumulation per sampling cycle. Trigger events older than \( n_w + \Delta \) are removed from the event list.

The main difficulty of the trigger event management is that the sequence of event insertions does not necessarily have to match the sequence of event removals. For example, a trigger event generated at the local node will be inserted immediately with an age of 0. In the next sampling cycle, another trigger event generated at a remote node may arrive that already traveled for 10 sampling cycles on a multi-hop path to the local node. This second trigger event will thus be inserted after the first event, but it will be removed before the first event. To avoid the fragmentation of the data structure storing the trigger events, a shift register based queue is used. In each sampling cycle, each trigger event is dequeued and enqueued again after its incrementation unless it is old enough to be removed. New events are enqueued afterwards. The actual length of the queue is managed by dedicated logic and corresponds to the number of currently active (i.e., overlapping) signal windows triggered by the reference channel \( r \).

**Figure 65**: RDT kernel for four sensor channels and two reference signals
Figure 65 shows the combination of all these modules to realize an RDT kernel for \(|S| = 4\) sensor channels and \(|R| = 2\) reference signals. The SPI ports of this module are connected to the discrete digital sensors. All other ports are controlled by the HaLoMote MCU using the communication infrastructure described Section 4.3. The BRAM addressing for the RD sequences are overruled for clearing and reading of individual RD sequences. As all RD sequences of a single reference channel share the same addressing signals, BRAM can be shared among those modules.

Figure 66 details the RDT kernel execution sequence. At the start of each sampling cycle, the MCU wakes up the RCU, which starts requesting the next sensor samples. The number of RCU cycles required for this operation depends on the number of bits to read, which typically do not exceed \(3 \cdot 16\) bit. The high-pass filter operates on the sample from the previous sampling cycle and can thus be executed in parallel to the sensor sampling. The filtering takes \(n_f + 1\) RCU cycles. The trigger event detection, the computation for the standard deviation, and the handling of the delay FIFO require one additional RCU cycle after the filtering operation. The accumulation of the delayed sample value from the last sampling cycle to the RD sequences is performed in parallel to the filtering. It requires one RCU cycle per registered trigger event. If available, new trigger events are inserted just before the RCU is shutdown again. As long as the number of registered trigger events does not exceed the order of the FIR filter, the overall execution time of the RDT kernel is fixed.

A small example shown in Figure 67 illustrates the distributed processing principle and the delayed trigger handling of the RDT implementation described in this section. To keep it simple, only two sensor signals \(S = \{a, b\}\) are considered and Figures 67a and 67b show the already low-pass filtered sensor samples. Signal \(a\) is chosen as the sole reference signal \(R = \{a\}\) with a trigger level of \(l_a = 7.5\). The first trigger level overshoot (undershoot) of \(\hat{x}_a\) at Time 1 (3) triggers the first (second) signal window to be extracted from the response signal, as shown in Figure 67b. If the signals \(a\) and \(b\) are sampled at different nodes within the WSN, the trigger events \(E_a = \{1, 3\}\) have to be forwarded from the reference to the response node. This is achieved by combining the reference index \(a\) with the age \(t - e_{a,i}\) into a wireless packet flooded through the network, as described in Section 7.3.2. The red (blue) arrow in Table 67c illustrates this wireless event distribution assuming that the first (second) trigger event requires five (two) sampling cycles to reach response node. Although the flooding mechanism described in Section 6.1 assures a fixed delay between the reference and response node,
the robustness of this RDT implementation against variable network delays can be demonstrated with these example delays.

For an assumed FIR length of $n_f = 8$ and an even smaller delay of the serial sensor sampling, the RCU executes ten clock cycles in every sampling period, as described by Figure 66. Out of these ten clock cycles, Table 67c shows only the cycles required to demonstrate the trigger event handling and RD sequence calculation. While the FIR output changes at the end of cycle 8 and thus becomes visible in cycle 9, the FIFO output changes at the end of cycle 9 and thus becomes visible in the first cycle of the next sampling period. After receiving the first (second) trigger event in the sampling period $t = 5$ (6) and inserting the event into the active trigger queue $E_a$, the active trigger events are handled in the first two or three RCU cycles of the sampling periods six to twelve. As described in Figure 64, this handling includes the derivation of the read and write signals for the RD sequence accumulator (i.e., wen, raddr, and waddr).
by comparing the queue head (i.e., \( t - e_{a,1} \)) with the length \( \Delta \) of the delay FIFO. As \( \Delta = 6 \) is assumed for this example, the first (blue) trigger event activates the accumulator in the sampling periods seven to ten, while the second (red) trigger event activates the accumulator in the sampling periods nine to twelve. The \( n_w = 4 \) element wide RD sequence \( D_{b,a} \) is thus generated by accumulating the output \( \hat{x}_b(t - \Delta) \) of the delay FIFO to the appropriate memory locations. Trigger events that would reach an age of \( \Delta + n_w = 10 \) are removed from the event queue in the sampling cycles ten and twelve. The accumulated RD sequence \( D_{b,a} \) is the RDT result and required for the subsequent modal analysis, as described in the next section.

7.3.4 Operational Modal Analysis and Damage Detection Principles

[Asmussen1997] proved that a connection between the RD sequence \( D_{s,r} \) and the correlation sequence \( R_{s,r} \) can be established by normalizing with the number of detected trigger events \( |E_r| \), the trigger level \( l_r \) and the standard deviation of the reference signals:

\[
R_{s,r}(k) := D_{s,r}(k) \cdot \frac{\sqrt{\sigma_r^2}}{l_r \cdot |E_r|} \quad \forall (s, r) \in S \times R \tag{78}
\]

\[
\sigma_r^2 := \frac{1}{|T|} \sum_{t \in T} \left( \hat{x}_r(t) - \left( \frac{1}{|T|} \sum_{t \in T} \hat{x}_r(t) \right) \right)^2 \quad r \in R \tag{79}
\]

\[
= \frac{1}{|T|} \sum_{t \in T} \hat{x}_r(t)^2 - \left( \frac{1}{|T|} \sum_{t \in T} \hat{x}_r(t) \right)^2 \quad r \in R \tag{80}
\]

The second formulation for the variance of the reference signal \( r \) results from applying the binomial formula. To calculate the variance, only the running sums of \( \hat{x}_r \) and \( \hat{x}_r^2 \) thus have to be collected by the sensor nodes.

In the WSN testbed described in Chapter 8, the OMA and the subsequent damage detection is executed at a central base station running [MATLAB] and not at a dedicated WSN mote. The individual OMA steps are therefore just summarized briefly in this section. More details can be found in [Engel2014c].

Having received all RD sequences from the WSN, the base station derives the correlation sequences according to Equation 78. Afterwards, a single block Discrete Fourier Transformation (DFT), followed by a Frequency Domain Decomposition (FDD) for each frequency bin of the observed spectrum is executed [Brincker2000]. In particular, a Singular Value Decomposition (SVD) is followed by a peak-detection on the resulting singular values to find the eigenfrequencies of the observed structure.

The FDD also yields the mode shapes of the structure for each of the detected eigenfrequencies, i.e., the maximum deflection \( \varphi_s \) of the structure at the position of the sensor \( s \). After normalizing the mode shape to

\[
F_s := \frac{\varphi_s^2}{\sum_{k \in S} \varphi_k^2} \quad \forall s \in S, \tag{81}
\]
it can be compared with a mode shape \((\bar{F}_s)_{s \in S}\) captured during a reference measurement. A damage index for each sensor position is derived as

\[
DI_s := \frac{F_s + 1}{\bar{F}_s + 1} - 1 \quad \forall s \in S \tag{82}
\]

Based on the Modal Strain Energy (MSE) method [Stubbs1995], an increased \(DI_s\) is expected, if the structure’s stability has changed between the reference and the current measurement near the sensor \(s\).
CHAPTER 8
System Level Evaluation: Distributed Monitoring of a Railroad Bridge Model

The SHM demonstrator used to evaluate the capabilities of the HaLoMote architecture in a complete WSN application consists of a bridge model equipped with a network of HaLOEWEn motes. These motes are performing the RDT as described in Section 7.3, to act as a decentralized Data Acquisition (DAQ) system. In addition, an embedded PC running [MATLAB] is used for the centralized OMA and damage detection algorithms. The bridge can be excited by an impact hammer, or a train model. A wire-bound DAQ system can be installed to capture baseline (“ground truth”) measurements. This demonstrator was developed in cooperation with the Smart Structures Research Division of the Fraunhofer Institute for Structural Durability and System Reliability (LBF) and already described in [Engel2014c; Engel2015c]. It was presented at the Hannover Fair in 2014 and now is a permanent exhibit at the Transfer Center for Adaptronics at the Fraunhofer LBF, as shown in Figure 68. It was also covered by some local media reports (e.g., [Yannick2015]).

Figure 68: SHM demonstrator exhibit located at the Transfer Center for Adaptronics (Fraunhofer LBF)

8.1 Demonstrator Setup

A warren truss railroad bridge was modeled by connecting 54 metal rods with 24 metal joints resulting in 51 kg overall weight and a span width of 246 cm. The test structure can be excited by an impulse hammer or a 2.3 kg G-scale railway model crossing the bridge.

The four outer joints (marked green in Figure 69) are connected to fixed pedestals. They can rotate in either directions without changing their position. Thus, to assess the dynamic movement of the structure, only the 20 inner joints (marked red and labeled as $S_1$ to $S_{20}$ in Figure 69) have to be monitored. This is achieved by connecting a 3-axis [ADXL362] MEMS acceleration sensor to each of the inner joints. This sensor
was chosen as a trade-off between power consumption (i.e., 6 µW at 400 Hz sampling rate) and measurement accuracy (i.e., about 1 mg resolution).

The COTS sensor breakout boards are sealed in a housing shown in Figure 70b. This housing was designed and additively manufactured at the Fraunhofer LBF. The housings are screwed to the joints facing outside the structure, as shown in Figure 70c. The resulting sensor channel orientation is shown in Figure 69 and differs for the front ($S_{1}$ to $S_{10}$) and the rear side ($S_{11}$ to $S_{20}$) of the bridge. However, as the railway model excites the bridge mainly in vertical direction (i.e., orthogonal to the bridge deck), only the x-channel of each sensor is sampled for both sides of the bridge.

For the wireless DAQ system, five HaLOEWEn3 motes were mounted on the upper rods (marked blue and labeled as $N_{0}$ to $N_{4}$ in Figure 69). Special carriers were designed and additively manufactured at the Fraunhofer LBF. As shown in Figure 70a, these
carriers are clamped on the rods and comprise a battery compartment to power the HaLOEWEn screwed on top.

Finally, a connector board was designed and manufactured as PCB to connect the four sensors located in the same x-z-plane as the mote to its RCU. Four Hirose [ST60-10P] connectors (labeled as C$_0$ to C$_3$ in Figure 70a) are used to flexibly connect the sensors to the expansion headers.

Due to the relatively large stiffness of the small bridge model, the relevant structural modes to be observed are located between 50 Hz and 100 Hz. Thus, to safely meet the Shannon-Nyquist lower limit, a sampling rate of 400 Hz was chosen. A synchronization accuracy of a few microseconds between the HaLOEWEn motes, as it can be achieved by the approach described in Section 6.2, is sufficient for this sampling rate. The WSN-based DAQ system does not capture the actual excitation of the structure, as this would also not be possible for real-world SHM deployments. Thus, an OMA is required as described in Section 7.3. A $n_f = 64$ tap high pass filter with a cut-off frequency of 20 Hz is applied to each sensor channel. This configuration was chosen as a trade-off between computational complexity and the filter quality. Static acceleration is damped by 60 dB, while all frequencies above 40 Hz are damped by less than $4 \times 10^{-3}$ dB. After the high-pass filtering, the RDT with up to three reference channels and runtime configurable trigger levels is applied. A fixed window length of $n_w = 256$ is used to capture the free decay of the structure within the first 640 ms after each trigger event.

A second wire-bound DAQ system was installed in parallel to the WSN for reference measurements, as shown in Figure 71a. It consists of 12 PCB [356A16] integrated circuit piezoelectric (ICP) accelerometers controlled by the LMS Test.Lab 14A (LMS) and attached to the joints of the bridge by an adhesive (wax), as shown in Figure 71b. Due to the limited number of input channels available at the [SCADAS] sensor front-end, only the front side of the bridge (i.e., $S_1$ to $S_{10}$) can be completely observed by the reference system. In principle, both sides of the bridge could be analyzed.

Figure 71: Wire-bound reference DAQ systems (LMS Test.Lab 14A with 12 ICP accelerometers)
independently, but moving the sensors from one side to another is not practical for long term observations as required by SHM. Instead, only two additional ICP sensors are installed on the rear side (i.e., $S_{13}$ and $S_{18}$) to assess the symmetry of the observed mode shapes. All sensors are sampled at 512 Hz with a resolution of 0.1 mg. The LMS system also captures the excitation provided by an impulse hammer (see Figure 71c), so an EMA can be performed thus providing more accurate results than the OMA-based WSN. Compared to the wireless data acquisition system, the cabling required for the LMS system becomes rather complex (Figure 71a) even though only 60% of the structure is covered.

### 8.2 Accuracy of the Wireless Data Acquisition System

Choosing $S_3$ and $S_{13}$ as reference channels provided the best results, as the center of the structure is excited the most. The trigger level of $l_3 = l_{13} = 200 \text{ mg}$ was determined experimentally and corresponds to the peak excitation caused by the model train.

The manual excitation of the structure with the impact hammer for 10 s resulted in 60 trigger events registered at node 3 and 40 trigger events registered at node 13. The 40 resulting RD sequences $D_{1,3}, \ldots, D_{20,3}, D_{1,13}, \ldots, D_{20,13}$ were transmitted to a base station for the subsequent modal analysis. As shown in Figure 72 for $D_{3,13}$, these RD sequences characterize the free decay of the structure.

For the wired LMS measurement, five strokes on $S_{17}$ were injected in vertical direction with the impulse hammer at intervals of about 3 s to excite the vertical bending modes of the bridge model. The EMA results are averaged over those five individual measurements. The resulting frequency response function at $S_3$ is shown in Figure 73. Below 40 Hz and above 110 Hz, the structure’s characteristics cannot be captured properly by the WSN-based system. However, the two dominating modes are located outside of these inaccurate frequency bands and can be captured with an accuracy of at least 1% as summarized in Table 16.

In addition to the eigenfrequencies, the actual mode shapes are of special interest for an SHM system, as minor damage will be reflected in the deformation of the mode shapes.
shapes before significant changes in the eigenfrequencies can be detected. Figure 74a shows the asymmetric vertical bending mode captured by both monitoring systems. More specifically, the unexcited bridge is shown in gray, while the vertical movement of the horizontal bars is shown in red (for the wire-bound LMS DAQ system) and blue (for the WSN DAQ system). As only the position of the four outer bearing joints is fixed, the four lateral bars are also bending inwards. Remember that the wire-bound LMS system has a limited view on the rear side of the structure due to input channel restrictions. However, the two nodes on the rear side are sufficient to detect the asymmetric character of the mode, i.e., the rear side is bending up while the front side is bending down. But only the WSN system provides a detailed view on both

<table>
<thead>
<tr>
<th>Mode shape</th>
<th>by LMS</th>
<th>by WSN</th>
<th>Relative deviation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Symmetric vertical bending</td>
<td>47.0 Hz</td>
<td>46.9 Hz</td>
<td>0.2 %</td>
</tr>
<tr>
<td>Asymmetric vertical bending</td>
<td>75.7 Hz</td>
<td>75.0 Hz</td>
<td>0.9 %</td>
</tr>
</tbody>
</table>

Table 16: Detected dominant mode shapes

Figure 74: Mode shapes captured with both DAQ systems
sides of the structure, which is essential for the subsequent SHM analysis. This also holds true for the symmetric vertical bending mode shown in Figure 74b. Although the reduced accuracy of the WSN system is clearly visible in the mode shapes, the principal behavior of the structure can still be observed without the need for extensive cabling and well-known controlled excitation.

After showing that the WSN system can identify structural properties such as eigenfrequencies and mode shapes, these properties are used to evaluate the damage detection capabilities of the HaLOEWEn network. For the selected configuration of the reference signals, each bridge crossing of the train generates 12 to 20 trigger events at each reference signal. The reference mode shape of the bridge was derived from a 600 s measurement, which generated 1103 trigger events at $S_3$ and 873 trigger events at $S_{13}$.

To demonstrate the reproducibility of the mode shape determined for the undamaged structure, measurements of reduced duration have been carried out. The resulting mode shapes of the first symmetric vertical bending mode and the corresponding damage indexes at the sensors along the horizontal axis of the structure (i.e., $S_1$ to $S_5$) are calculated by Equation 82 and shown in Figure 75. The reference mode shape is shown in red, while the actually determined mode shape is shown in blue. The number of accumulations performed for each RD sequence decreases with the measurement duration and the remaining noise in the RD sequences results into larger $DI$ values. The possibility of false positive damage detection thus increases. For the experimentally determined damage detection threshold of 0.05, a 60 s measurement interval provides sufficiently low $DI$ values.

To simulate a damaged structure, single screwed connections near specific sensors were loosened. Figure 76 shows the mode shapes and the damage indexes determined from the damaged structure for 3 different damage locations. In all three cases, the damage has been detected, i.e., the damage detection threshold of 0.05, determined during the reference measurements, is exceeded. The damage location, however, is not reflected by the $DI$ diagram.

Figure 75: Mode-shapes and $DI_s$ of undamaged structure obtained from 300 s (left), 60 s (center), and 30 s (right) measurements
8.3 Performance and Energy Evaluation

Finally, the runtime and energy required by the hardware accelerator of the HaLoMote architecture for the RDT computation is compared against software processors typically used in mobile and WSN applications. Namely, the TI [CC2530] and the Atmel [ATmega256RFR2] were chosen as representative 8 bit MCUs, as these RF-System-on-Chips (SoCs) have also been integrated into the HaLoMote. Furthermore, the TI [CC430] RF-SoCs was chosen as representative 16 bit MCU, as the MSP430 architecture is widely used in many WSN motes. Most recent software processors for mobile applications are based on the ARM Cortex-M architecture, so the [STM32F407] and the TI [CC2650] were chosen as particularly powerful and energy-efficient state-of-the-art references.

For a fair comparison, all systems were configured with the same RDT settings as described in Section 8.2, i.e., 400 Hz sampling frequency, four sensor channels, two reference channels, 64-tap FIR filter and 256 samples per RD sequence. As the actual workload heavily depends on the number of detected trigger events, all systems were fed with the same pre-recorded sensor samples stored in the processors code memories and the trigger levels were chosen such that the actual number of triggered events is 60 and 40, respectively, as observed in Section 8.2. The firmware for all systems was built with the most recent compilers, configured to optimize for execution speed. The average runtime per sampling cycle measured with peripheral timers was combined with data-sheet information about the power consumption and the wake-up time from sleep mode as shown in Table 17. The deepest sleep mode with memory retention and enabled real-time clock was chosen for each system respectively. To derive the overall energy spent per sampling cycle ($E_{\text{overall}}$), a power-consumption of $P_{\text{active}}$ was assumed during wake-up, as the capacitors of the internal switching regulators have to be charged during the ramp-up. To better illustrate the consumed energy per sample by the different processor architectures, the corresponding system endurance ($t_{\text{alive}}$) achievable when supplied by a 1 Wh energy buffer (e.g., a typical NiMH AAA cell) was derived. Note that this estimation takes only the processor into account,
## Table 17: Resources required for executing the RDT on various processing units

<table>
<thead>
<tr>
<th>Processing Unit</th>
<th>TI CC2530</th>
<th>ATmega256RFR2</th>
<th>TI CC430</th>
<th>STM32F407</th>
<th>TI CC2650</th>
<th>AGL1000</th>
</tr>
</thead>
<tbody>
<tr>
<td>Architecture</td>
<td>8 bit</td>
<td>8 bit</td>
<td>16 bit</td>
<td>32 bit</td>
<td>32 bit</td>
<td></td>
</tr>
<tr>
<td></td>
<td>8051</td>
<td>AVR</td>
<td>MSP430</td>
<td>Cortex-M4F</td>
<td>Cortex-M3</td>
<td>IGLOO</td>
</tr>
<tr>
<td>Compiler</td>
<td>SDCC 3.4</td>
<td>AVR GCC 4.3.1</td>
<td>CL430 4.4.3</td>
<td>ARMGCC 4.9.3</td>
<td>ARMGCC 4.9.3</td>
<td>Synplify 2014</td>
</tr>
<tr>
<td>Main Clock [MHz]</td>
<td>32</td>
<td>16</td>
<td>20</td>
<td>128</td>
<td>48</td>
<td>8</td>
</tr>
<tr>
<td>VCC [V]</td>
<td>2.0</td>
<td>1.8</td>
<td>2.4</td>
<td>1.8</td>
<td>1.8</td>
<td>1.2</td>
</tr>
<tr>
<td>I_{active} [mA]</td>
<td>6.5</td>
<td>3.7</td>
<td>4.6</td>
<td>40</td>
<td>2.9</td>
<td>25</td>
</tr>
<tr>
<td>P_{active} [mW]</td>
<td>13</td>
<td>6.7</td>
<td>11</td>
<td>72</td>
<td>5.2</td>
<td>30</td>
</tr>
<tr>
<td>sleep mode</td>
<td>LPM2</td>
<td>POWER-SAVE</td>
<td>LPM3</td>
<td>STOP</td>
<td>STANDBY</td>
<td>Flash*Freeze</td>
</tr>
<tr>
<td>I_{idle} [μA]</td>
<td>1</td>
<td>1.5</td>
<td>5.3</td>
<td>280</td>
<td>1</td>
<td>44</td>
</tr>
<tr>
<td>P_{idle} [μW]</td>
<td>2</td>
<td>2.7</td>
<td>12.7</td>
<td>156</td>
<td>1.8</td>
<td>53</td>
</tr>
<tr>
<td>t_{active} [cycles/sample]</td>
<td>257,795</td>
<td>43,387</td>
<td>24,985</td>
<td>5,463</td>
<td>7,150</td>
<td>68</td>
</tr>
<tr>
<td>t_{active} [μs/sample]</td>
<td>8,056</td>
<td>2,712</td>
<td>1,249</td>
<td>43</td>
<td>149</td>
<td>9</td>
</tr>
<tr>
<td>t_{wakeup} [μs/sample]</td>
<td>100</td>
<td>34</td>
<td>150</td>
<td>110</td>
<td>151</td>
<td>1</td>
</tr>
<tr>
<td>t_{idle} [μs/sample]</td>
<td>-</td>
<td>-</td>
<td>1,101</td>
<td>2,347</td>
<td>2,200</td>
<td>2,490</td>
</tr>
<tr>
<td>E_{active} [nJ/sample]</td>
<td>104,728</td>
<td>18,170</td>
<td>13,739</td>
<td>3,096</td>
<td>775</td>
<td>270</td>
</tr>
<tr>
<td>E_{wakeup} [nJ/sample]</td>
<td>1,300</td>
<td>228</td>
<td>1,650</td>
<td>7,920</td>
<td>785</td>
<td>30</td>
</tr>
<tr>
<td>E_{idle} [nJ/sample]</td>
<td>-</td>
<td>-</td>
<td>14</td>
<td>366</td>
<td>4</td>
<td>132</td>
</tr>
<tr>
<td>E_{overall} [nJ/sample]</td>
<td>-</td>
<td>-</td>
<td>15,403</td>
<td>11,382</td>
<td>1,564</td>
<td>432</td>
</tr>
<tr>
<td>t_{alive} [d]</td>
<td>-</td>
<td>-</td>
<td>7</td>
<td>9</td>
<td>67</td>
<td>241</td>
</tr>
</tbody>
</table>
disregarding the sensors and the radio transceiver (which are assumed to be identical across the platforms examined).

As shown in Table 17, the 8 bit MCUs do not achieve the required sampling period of 2500 µs. The [CC430] requires about 50% of the sampling period for the RDT computations while consuming 15.4 µJ per sampling cycle. The powerful Cortex-M4 device is nearly 30 times faster than the [CC430], but its comparatively large power draw in idle mode still results in 11.4 µJ consumed per sampling cycle. The TI [CC2650] proves to be the most energy-efficient software-processor under consideration as it requires only 1.6 µJ per sampling cycle. However, the FPGA on the HaLOEWEn4 requires only 28% of the energy of the most efficient software processor. Note that the HaLOEWEn4 MCU causes an additional overhead, mainly due to its energy required during wake-up (228 nJ) and idle (6 nJ) periods. The combination of the hardware accelerator and the Atmel MCU, as used by the HaLOEWEn4 implementation, thus consumes only 44% of the energy of the most efficient software processor.

The energy efficiency of the HaLoMote easily exceeds that of the other platforms when actively performing computations, but it suffers when the node has to remain powered, but stays idle. In that case, its idle power consumption is 53 µW, which is nearly 30x the power drawn by the CC2650 MCU in idle mode. As sensor motes spend most of their lifetime in idle mode, special care must be taken on the HaLoMote to address this issue. In the SHM application, this can be achieved by having a supervisory power manager suspend the sensor sampling when no traffic is present on the structure. For the specific use case of railway bridges, these times may be more than 95% of the overall operating time. For these quiet periods, the non-volatile FRAM (see Section 4.4.2) can be employed to store the internal state of the hardware kernels, which has to be preserved once the supervisory manager completely powers down the hardware accelerator. Note that the FPGA configuration data is not affected by such a shutdown, as it is held on-chip in non-volatile Flash memory.

For the concrete SHM configuration discussed in this section, about 48 kbit of runtime state has to be preserved across shutdowns (i.e., 41 kbit for the RD sequences, 3 kbit for the FIR taps, 3 kbit for the delay FIFO, and 1 kbit for the trigger events). According to Equation 2 and 4, live variable swapping to the external SRAM pays off after approximately 2s shutdown, while swapping to FRAM becomes attractive only after about 16s. Beyond the railway bridge use case, such short idle-times occur even in many less frequently traveled automotive bridges. Thus, the capability of quickly and power-efficiently preserving the system state, while completely shutting down the accelerator, is attractive for a variety of applications.

### 8.4 Discussion of Results

In this section, the obtained results for the HaLOEWEn-based SHM system are compared with the related work described in Section 3.3.2.

The achieved measurement accuracy, i.e., the relative error of the detected eigenfrequencies compared to ground truth measurements of the wire-bound system (≤ 0.9%) is comparable with the outcome of [Bocca2011] (≤ 1.035%). All other wireless SHM systems missed the exact eigenfrequencies by up to 18%. One possible cause of this observation might be that [Bocca2011] is the only of the considered reference projects also relying on a laboratory scaled testbed with well-controlled boundary conditions, such as the artificial excitation of the structure. However, while [Bocca2011] used a white-
noise generating shaker, the model train set used in this thesis might be considered more realistic.

Another comparison worth looking at is the achievable data-aggregation factor. [Kim2007] reported a 20× down-sampling, but this can not be understood as data aggregation, as the authors just increased the sampling rate to obtain a better signal quality (i.e., reduced noise floor). [Battista2013] reported a 96% data aggregation (i.e., 25×) for 10 min measurements every 30 min. As stated in Section 7.3.2, the distributed RDT used by the HaLOEWEn system reduces the raw data stream by 469× for the same measurement duration.

Finally, the overall energy efficiency of the data acquisition systems must be compared against each other. It is hard to directly compare the energy or the power consumed by different systems against one another, without taking the quality and the information content of the gathered data into account (e.g., higher sampling rates result in higher power draw, but also in more accurate results). However, all systems considered in Section 3.3.2 were used to identify the modal properties of the observed structures. The smallest average power draw reported by the reference systems was 53 mW for a 60 s active measurement period [Bocca2011]. As analyzed in the previous section, the computational units of the HaLOEWEn4 consume 666 nJ per sample. At 400 Hz sampling rate, this equates to an average power draw of 266 µW just for processing. After the measurement period, $4 \cdot |R| = 8$ RD sequences per sensor node have to be transmitted to the base station, each consisting of $n_w \cdot 4 = 1024$ Byte. Even when assuming that only 10% of the available IEEE [802.15.4] data rate is available for actual payload transfer (e.g., due to packet loss and addressing overhead), and the maximum transmission power of the HaLOEWEn4 RF-SoC is required for a specific transmission range (i.e., 36 mW power consumption during transmission), the overall average power of the HaLOEWEn4 is increased by

$$\frac{8 \cdot 1024 \cdot 8 \text{ bit}}{250 \text{ kbit/s} \cdot 10\%} \cdot 36 \text{ mW} \cdot \frac{1}{60 \text{ sec}} = 1.57 \text{ mW}$$

The resulting average power drawn by the HaLOEWEn4 during a 60 s measurement is thus smaller than 2 mW or 29× times less the power required by [Bocca2011]. Note that in practice, the measurement duration will be much longer, thus even increasing the HaLOEWEn energy efficiency.

Finally, while all reference systems considered in Section 3.3.2 focus on system identification, the HaLOEWEn system also provides basic damage detection capabilities.
CHAPTER 9

Summary and Future Work

In this thesis, the Hardware-Accelerated Low Power Mote (HaLoMote) was proposed as a heterogeneous Wireless Sensor Network (WSN) architecture for hardware-accelerated energy-efficient distributed data aggregation. The Hardware-Accelerated Low Energy Wireless Embedded Sensor Node (HaLOEWEn) implementation of this architecture was detailed in conjunction with the platforms inter-processor communication and power management strategies. In addition to application-independent hardware accelerators for digital filters, lossless data compression, and linear regression, network communication primitives for flooding and time synchronization have been developed. These building blocks were evaluated in the context for three monitoring applications. One of these applications, i.e., a distributed Structural Health Monitoring (SHM) was analyzed more in detail, as an application-specific feature-extraction algorithm, namely the Random Decrement Technique (RDT), was hardware accelerated. A laboratory-scale SHM testbed was conducted for the system-level evaluation of the HaLoMote architecture. It could be shown that a HaLOEWEn-based wireless Data Acquisition (DAQ) system can identify the the dynamic characteristics relevant for the SHM application with an acceptable deterioration of the measurement accuracy compared to a wire-bound laboratory measurement system. At the same time, the energy efficiency of the HaLoMote platform was shown to outperform the energy efficiency of recent low-power microcontrollers by more than $2\times$.

9.1 Lessons Learned

Most of the research problems addressed in this thesis (see Section 1.2) can be answered clearly.

Which specific RCU is suited best to be integrated into a WSN mote?

FPGAs with a truly non-volatile configuration storage provide a low static power consumption with fast transitions between active and idle states. Even nowadays, the Microsemi [IGLOO] family provides the best trade-off between available logic resources and static power consumption.

How to integrate the RCU into the architecture of a WSN mote? To be more specific, how should the sensors, memories, processing and communication modules be arranged, and how should they communicate?

Using the FPGA as hardware accelerator next to a small RF-SoC is required, as a standalone FPGA cannot properly apply DPM-mechanisms during periods with reduced compute demands. However, two fundamental approaches for interconnecting the devices with each other and with the sensors were identified. The HaLoMote architecture connects all sensors to the FPGA and provides only a narrow communication link between the FPGA and the RF-SoC. The FPGA thus has to be woken up in every sampling cycle to query the next sensor samples. Although the DPM mechanism of the HaLoMote were designed to support
this fast state transitions, an alternative approach can be imagined. When connecting all sensors to the RF-SoC and interconnecting FPGA and RF-SoC with a shared memory, the hardware accelerator can be kept low until a sufficient amount of data is captured for processing. As this second approach was not evaluated throughout this thesis, this research question can not be clearly answered.

How does the RCU affect the DPM strategy of the WSN mote?

The clock supply and the external shutdown request required by the FPGA were identified as a major difficulty for the DPM of the entire platform. To decouple the sleep scheduling or FPGA and RF-SoC, two mechanisms were proposed, namely a hybrid clock supply and the self-controlled shutdown of the FPGA. To even save the power drawn by the FPGA in Flash*Freeze mode, the feasibility of live variable swapping to an external SRAM of FRAM was analyzed. For typical monitoring applications, this mechanism is, however, only useful as long as no active measurement is carried out.

How does the RCU affect the trade-off between in-sensor preprocessing and wireless data transmission? For which kinds of application do the improved data aggregation capabilities pay off in terms of overall energy consumption?

Due to the energy efficiency of the hardware accelerator, data aggregation pays off even if only a few percentage of the wireless communication bandwidth can be saved. The more relevant question thus is, which computational load justifies the usage of a hardware accelerator instead of a low-power MCU. For the SHM target application sampling four channels in parallel at 400 Hz, a recent ARM-based MCU was outperformed by the HaLOEWEn, however.

Once the RCU is integrated as application-specific accelerator, which generic WSN services can also benefit from the improved processing capabilities?

In this thesis, the hardware acceleration of a generic lossless data compression module and a linear regression solver used for improved time synchronization was proposed. Although not analyzed in this thesis, many other services like encryption or forward error correction are known to also benefit from hardware acceleration.

9.2 Remaining Research and Engineering Challenges

Although many algorithms and methods have been analyzed, implemented and tested for the HaLoMote platform, a lot of additional and complementary research and engineering challenges can be identified.

First of all, the following improvements can be achieved without major modifications of the HaLoMote design: Over-the-Air-Programming of the FPGA is useful to simplify in-field reconfiguration or the entire mote. For larger deployments and testbeds, this feature is almost mandatory. The HaLOEWEn4 MCU already has access to the JTAG port of the FPGA and Microsemi provides platform-independent firmware for the MCU-based JTAG-programming of their devices (namely DirectC and the Stable-Player). However, some additional effort should be spent on the bitstream transport throughout the network to efficiently support incremental updates or the delivery of only slightly differing bitstreams to different, but nearby network nodes.
Furthermore, the support for higher-order predictive compression can be implemented by just integrating higher-order linear equation solvers to the ADPCM kernel described in Section 5.2.2. The multi-channel conditional monitoring application (Section 7.2) was shown to benefit from a fourth order predictor in terms of compression ratio, but the energy required for this more complex encoder could not be analyzed in this thesis.

Two improvements for the multi-source flooding mechanism can be envisioned. Instead of assuming the knowledge of the entire topology (i.e., the connectivity and interference graph), some sophisticated protocols should be developed for an automated topology discovery. While the connectivity graph can be easily generated by well known neighbor-discovery algorithms [Dheap2003], capturing the interference graph is rather difficult. In addition, the flooding scheduler could exploit channel switching for congestion resolution and integrate energy-balancing to equally distribute the transceiver activities to all network nodes, instead to building hot spots.

Only minor changes to the HaLOEWEn design are required to support the live variable swapping described in Section 4.4.2. So far the core voltage supply of the FPGA can not be turned off completely, so an additional power FET controlled by the MCU is required.

If a major redesign of the HaLOEWEn is envisioned, than a new Flash-based FPGA device family is hopefully available. The [IGLOO] family is built with a quite outdated 130 nm CMOS process. The advantage of the hardware accelerator against the most recent MCUs has been reduced down to a factor of two. A break-even between the old FPGA device and new software processors can thus be expected in the near future.

Furthermore, the alternative hardware accelerator architecture based on shared memory between RCU and RF-SoC (Figure 13b) must be evaluated. It can be expected that this architecture is better suited for the higher sampling rates required by most condition monitoring applications, as the FPGA does not have to be power-cycled in every sampling cycle with this approach.

Most important, however, the HaLOEWEn network must be deployed in a real world scenario (e.g., bridge monitoring) to evaluate its long term robustness and energy efficiency under realistic conditions such as increased packet error rates caused by foggy weather.

Summary and Future Work
Bibliography

Scientific Publications


Floyd1962 Robert W. Floyd. “Algorithm 97: Shortest Path”. In: Communications of the ACM 5.6 (1962), pp. 345–.


Hardware Datasheets


ATmega128 Atmel. 8-bit Microcontroller with 128K Bytes In-System Programmable Flash. URL: http://www.atmel.com (visited on 2016-11-14).

ATmega256RFR2 Atmel. 8-bit AVR Microcontroller with Low Power 2.4 GHz Transceiver for ZigBee and IEEE 802.15.4. 8393B-MCU. 2013. URL: http://www.atmel.com/devices/atreseventytwo.aspx (visited on 2016-11-14).


CC2530 Texas Instruments. CC253x System-on-Chip Solution for 2.4-GHz IEEE 802.15.4 and ZigBee Applications. URL: http://www.ti.com/product/cc2530 (visited on 2016-11-14).


iCE65 Silicon Blue. iCE65 Ultra Low-Power mobileFPGA Family. 2010.


MB85RS1MT Fujitsu. Memory FRAM 1M (128 K x 8) Bit SPI. 2013.


Software Tools


Specifications and Standards


Others


<table>
<thead>
<tr>
<th>Abbreviation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>ACLK</td>
<td>auxiliary clock</td>
</tr>
<tr>
<td>ADC</td>
<td>Analog-to-Digital Converter</td>
</tr>
<tr>
<td>ADPCM</td>
<td>Adaptive Differential Pulse Code Modulation</td>
</tr>
<tr>
<td>ALAC</td>
<td>Apple Lossless Audio Codec</td>
</tr>
<tr>
<td>ALM</td>
<td>Adaptive Logic Module</td>
</tr>
<tr>
<td>ALU</td>
<td>Arithmetic Logic Unit</td>
</tr>
<tr>
<td>API</td>
<td>Application Programming Interface</td>
</tr>
<tr>
<td>ASIC</td>
<td>Application-Specific Integrated Circuit</td>
</tr>
<tr>
<td>ASM</td>
<td>Auto-Sequencing Memory</td>
</tr>
<tr>
<td>BA</td>
<td>Back Annotation</td>
</tr>
<tr>
<td>BDI</td>
<td>Bridge Diagnostics Inc</td>
</tr>
<tr>
<td>BLE</td>
<td>Bluetooth Low Energy</td>
</tr>
<tr>
<td>BRAM</td>
<td>Block RAM</td>
</tr>
<tr>
<td>CGRA</td>
<td>Coarse-Grained Reconfigurable Architecture</td>
</tr>
<tr>
<td>CLB</td>
<td>Configurable Logic Block</td>
</tr>
<tr>
<td>CMem</td>
<td>Configuration Memory</td>
</tr>
<tr>
<td>COTS</td>
<td>commercial off-the-shelf</td>
</tr>
<tr>
<td>CPLD</td>
<td>Complex Programmable Logic Device</td>
</tr>
<tr>
<td>CPS</td>
<td>Cyber Physical System</td>
</tr>
<tr>
<td>CPU</td>
<td>Central Processing Unit</td>
</tr>
<tr>
<td>CSMA</td>
<td>Carrier Sense Multiple Access</td>
</tr>
<tr>
<td>DARPA</td>
<td>United States Defense Advanced Research Projects Agency</td>
</tr>
<tr>
<td>DAQ</td>
<td>Data Acquisition</td>
</tr>
<tr>
<td>DC/DC</td>
<td>Voltage Level Converter</td>
</tr>
<tr>
<td>DFT</td>
<td>Discrete Fourier Transformation</td>
</tr>
<tr>
<td>DMA</td>
<td>Direct Memory Access</td>
</tr>
<tr>
<td>DPCM</td>
<td>Differential Pulse Code Modulation</td>
</tr>
<tr>
<td>DPM</td>
<td>Dynamic Power Management</td>
</tr>
<tr>
<td>DSN</td>
<td>Distributed Sensor Networks</td>
</tr>
<tr>
<td>DSP</td>
<td>Digital Signal Processor</td>
</tr>
<tr>
<td>DVFS</td>
<td>Dynamic Voltage and Frequency Scaling</td>
</tr>
<tr>
<td>Abbreviation</td>
<td>Description</td>
</tr>
<tr>
<td>--------------</td>
<td>-------------</td>
</tr>
<tr>
<td>EMA</td>
<td>Experimental Modal Analysis</td>
</tr>
<tr>
<td>FB</td>
<td>Functional Block</td>
</tr>
<tr>
<td>FDD</td>
<td>Frequency Domain Decomposition</td>
</tr>
<tr>
<td>FET</td>
<td>Field-Effect Transistor</td>
</tr>
<tr>
<td>FF</td>
<td>Flip-Flop</td>
</tr>
<tr>
<td>FFT</td>
<td>Fast Fourier Transformation</td>
</tr>
<tr>
<td>FIFO</td>
<td>First In, First Out</td>
</tr>
<tr>
<td>FIR</td>
<td>Finite Impulse Response</td>
</tr>
<tr>
<td>FLAC</td>
<td>Free Lossless Audio Codec</td>
</tr>
<tr>
<td>FPGA</td>
<td>Field Programmable Gate Array</td>
</tr>
<tr>
<td>FRAM</td>
<td>Ferroelectric RAM</td>
</tr>
<tr>
<td>FTSP</td>
<td>Flooding Time Synchronization Protocol</td>
</tr>
<tr>
<td>GAG</td>
<td>Generic Address Generator</td>
</tr>
<tr>
<td>GPIO</td>
<td>General Purpose Input/Output</td>
</tr>
<tr>
<td>GPU</td>
<td>Graphics Processing Unit</td>
</tr>
<tr>
<td>HaLOEWEEn</td>
<td>Hardware-Accelerated Low Energy Wireless Embedded Sensor Node</td>
</tr>
<tr>
<td>HaLoMote</td>
<td>Hardware-Accelerated Low Power Mote</td>
</tr>
<tr>
<td>HDL</td>
<td>Hardware Description Language</td>
</tr>
<tr>
<td>HLS</td>
<td>High-Level Synthesis</td>
</tr>
<tr>
<td>HMI</td>
<td>Human-Machine Interface</td>
</tr>
<tr>
<td>HPC</td>
<td>High-Performance Computing</td>
</tr>
<tr>
<td>HW</td>
<td>hardware</td>
</tr>
<tr>
<td>IC</td>
<td>Integrated Circuit</td>
</tr>
<tr>
<td>ICP</td>
<td>integrated circuit piezoelectric</td>
</tr>
<tr>
<td>ICT</td>
<td>Information and Communication Technology</td>
</tr>
<tr>
<td>IDE</td>
<td>Integrated Development Environment</td>
</tr>
<tr>
<td>IEEE</td>
<td>Institute of Electrical and Electronics Engineers</td>
</tr>
<tr>
<td>FC</td>
<td>Inter-Integrated Circuit</td>
</tr>
<tr>
<td>ILP</td>
<td>Integer Linear Program</td>
</tr>
<tr>
<td>IO</td>
<td>Input/Output</td>
</tr>
<tr>
<td>IOB</td>
<td>Input/Output Block</td>
</tr>
<tr>
<td>IoT</td>
<td>Internet of Things</td>
</tr>
<tr>
<td>Abbreviation</td>
<td>Description</td>
</tr>
<tr>
<td>--------------</td>
<td>-------------</td>
</tr>
<tr>
<td>IP</td>
<td>Intellectual Property</td>
</tr>
<tr>
<td>ISA</td>
<td>International Society of Automation</td>
</tr>
<tr>
<td>ISM</td>
<td>Industrial, Scientific and Medical</td>
</tr>
<tr>
<td>JTAG</td>
<td>Joint Test Action Group</td>
</tr>
<tr>
<td>LBF</td>
<td>Institute for Structural Durability and System Reliability</td>
</tr>
<tr>
<td>LC</td>
<td>Logic Cell</td>
</tr>
<tr>
<td>LE</td>
<td>Logic Element</td>
</tr>
<tr>
<td>LED</td>
<td>Light-Emitting Diode</td>
</tr>
<tr>
<td>LMS</td>
<td>LMS Test.Lab 14A</td>
</tr>
<tr>
<td>LR</td>
<td>Linear Regression</td>
</tr>
<tr>
<td>LR-WPAN</td>
<td>Low Rate WPAN</td>
</tr>
<tr>
<td>LUT</td>
<td>Lookup Table</td>
</tr>
<tr>
<td>LZW</td>
<td>Lempel-Ziv-Welch</td>
</tr>
<tr>
<td>MAC</td>
<td>Medium Access Control</td>
</tr>
<tr>
<td>MACC</td>
<td>Multiply-Accumulate</td>
</tr>
<tr>
<td>MCLK</td>
<td>main clock</td>
</tr>
<tr>
<td>MCU</td>
<td>microcontroller</td>
</tr>
<tr>
<td>MEMS</td>
<td>micro-electro-mechanical</td>
</tr>
<tr>
<td>MMIO</td>
<td>memory-mapped IO</td>
</tr>
<tr>
<td>MSE</td>
<td>Modal Strain Energy</td>
</tr>
<tr>
<td>NFC</td>
<td>Near Field Communication</td>
</tr>
<tr>
<td>NiCd</td>
<td>Nickel-Cadmium</td>
</tr>
<tr>
<td>NiMH</td>
<td>Nickel-Metal Hydride</td>
</tr>
<tr>
<td>NRE</td>
<td>non-recurring engineering costs</td>
</tr>
<tr>
<td>NVM</td>
<td>non-volatile memory</td>
</tr>
<tr>
<td>OMA</td>
<td>Operational Modal Analysis</td>
</tr>
<tr>
<td>OSC</td>
<td>Oscillator</td>
</tr>
<tr>
<td>OSI</td>
<td>Open Systems Interconnection</td>
</tr>
<tr>
<td>PAL</td>
<td>Programmable Array Logic</td>
</tr>
<tr>
<td>PAR</td>
<td>Place and Route</td>
</tr>
<tr>
<td>PC</td>
<td>Personal Computer</td>
</tr>
<tr>
<td>PCB</td>
<td>Printed Circuit Board</td>
</tr>
<tr>
<td>Abbreviation</td>
<td>Definition</td>
</tr>
<tr>
<td>--------------</td>
<td>----------------------------------</td>
</tr>
<tr>
<td>PE</td>
<td>Processing Element</td>
</tr>
<tr>
<td>PEG</td>
<td>Piezoelectric Generator</td>
</tr>
<tr>
<td>PLA</td>
<td>Programmable Logic Array</td>
</tr>
<tr>
<td>PLD</td>
<td>Programmable Logic Device</td>
</tr>
<tr>
<td>PLL</td>
<td>Phase Locked Loop</td>
</tr>
<tr>
<td>ppm</td>
<td>parts per million</td>
</tr>
<tr>
<td>PROM</td>
<td>Programmable Read-Only Memory</td>
</tr>
<tr>
<td>RAM</td>
<td>Random Access Memory</td>
</tr>
<tr>
<td>RCU</td>
<td>Reconfigurable Compute Unit</td>
</tr>
<tr>
<td>RD</td>
<td>Random Decrement</td>
</tr>
<tr>
<td>RDT</td>
<td>Random Decrement Technique</td>
</tr>
<tr>
<td>RFID</td>
<td>Radio-frequency Identification</td>
</tr>
<tr>
<td>RF-SoC</td>
<td>Radio Frequency System-on-Chip</td>
</tr>
<tr>
<td>RLR</td>
<td>Rolling Linear Regression</td>
</tr>
<tr>
<td>RLRCT</td>
<td>Rolling Linear Regression with Coordinate Transformation</td>
</tr>
<tr>
<td>RMS</td>
<td>Root Mean Square</td>
</tr>
<tr>
<td>ROM</td>
<td>Read-Only Memory</td>
</tr>
<tr>
<td>RSSI</td>
<td>Received Signal Strength Indication</td>
</tr>
<tr>
<td>RTL</td>
<td>Register Transfer Level</td>
</tr>
<tr>
<td>RTOS</td>
<td>Real-Time Operating System</td>
</tr>
<tr>
<td>RTSP</td>
<td>Recursive Time Synchronization Protocol</td>
</tr>
<tr>
<td>SDF</td>
<td>Standard Delay Format</td>
</tr>
<tr>
<td>SFD</td>
<td>start of frame delimiter</td>
</tr>
<tr>
<td>SHM</td>
<td>Structural Health Monitoring</td>
</tr>
<tr>
<td>SIMD</td>
<td>Single Instruction Multiple Data</td>
</tr>
<tr>
<td>SM</td>
<td>Switch Matrix</td>
</tr>
<tr>
<td>SoC</td>
<td>System-on-Chip</td>
</tr>
<tr>
<td>SPI</td>
<td>Serial Peripheral Interface</td>
</tr>
<tr>
<td>SPLD</td>
<td>Simple Programmable Logic Device</td>
</tr>
<tr>
<td>SRAM</td>
<td>Static RAM</td>
</tr>
<tr>
<td>SRD</td>
<td>Short Range Device</td>
</tr>
<tr>
<td>SVD</td>
<td>Singular Value Decomposition</td>
</tr>
<tr>
<td>Abbreviation</td>
<td>Description</td>
</tr>
<tr>
<td>--------------</td>
<td>-------------</td>
</tr>
<tr>
<td>TEG</td>
<td>Thermoelectric Generator</td>
</tr>
<tr>
<td>TI</td>
<td>Texas Instruments</td>
</tr>
<tr>
<td>UC</td>
<td>University of California</td>
</tr>
<tr>
<td>USART</td>
<td>Universal Synchronous and Asynchronous Serial Receiver/Transmitter</td>
</tr>
<tr>
<td>USB</td>
<td>Universal Serial Bus</td>
</tr>
<tr>
<td>VHDL</td>
<td>Very High Speed Integrated Circuit Hardware Description Language</td>
</tr>
<tr>
<td>VLIW</td>
<td>Very Long Instruction Word</td>
</tr>
<tr>
<td>WAN</td>
<td>Wide Area Network</td>
</tr>
<tr>
<td>WBAN</td>
<td>Wireless Body Area Network</td>
</tr>
<tr>
<td>WINS</td>
<td>Wireless Integrated Network Sensors</td>
</tr>
<tr>
<td>WISP</td>
<td>Wireless Identification and Sensing Platform</td>
</tr>
<tr>
<td>WLAN</td>
<td>Wireless Local Area Network</td>
</tr>
<tr>
<td>WSN</td>
<td>Wireless Sensor Network</td>
</tr>
<tr>
<td>WPAN</td>
<td>Wireless Personal Area Network</td>
</tr>
</tbody>
</table>
## List of Figures

1. Examples for wirelessly communicating consumer electronics, industrial products, and medical devices ........................................... 2
2. Typical monitoring application realized by a WSN, which forwards the captured data from the leaf nodes over intermediates routers and a dedicated gateway to a remote base station ........................................... 9
3. Generic architecture of a WSN mote ........................................... 10
4. OSI layers used in a WSN communication protocol stack ............... 10
5. Classification of WSN topologies consisting of leaf nodes, routers, and gateways ................................................................. 11
7. Two basic principles of programmable processor architectures .......... 13
8. Simple Programmable Logic Devices realizing boolean functions ......... 15
9. Configurable building blocks of an FPGA ....................................... 16
10. Generic architecture of an FPGA ................................................ 16
11. FPGA tool flow transforming resources between different abstraction levels ................................................................. 17
12. Design options for the computation and communication architecture .... 31
13. Design options for attaching sensors (S) and memories (M) to the compute units ................................................................. 32
14. HaLOEWE\textsuperscript{n} version 3 (2010): Basic components ........... 33
15. HaLOEWE\textsuperscript{n} version 3 (2010): 100 mm × 62 mm PCB .......... 35
16. Per-component breakdown of the HaLOEWE\textsuperscript{n}3 power consumption ................................................................. 35
17. HaLOEWE\textsuperscript{n} version 4 (2014): Basic components ............... 36
18. HaLOEWE\textsuperscript{n} version 4 (2014): 46 mm × 30 mm PCB .......... 37
19. Hardware kernels ................................................................. 38
20. Scheduling of DPM and inter-processor communication for a single output-only HW kernel ................................................................. 39
21. Startup timing of the [LTC6930] external oscillator ......................... 42
22. Clocking the RCU by an external oscillator controlled by the MCU .... 43
23. Clocking the RCU by an external oscillator controlled by the RCU .......... 43
24. Clocking the RCU by the MCU ................................................ 44
25. Clocking the RCU by two sources ........................................... 44
26. Implementation of an RCU-internal oscillator, consisting of a configurable number of AO14 cells used as delay elements as described in [AC332] ................................................................. 45
27. Clocking the RCU by an internal oscillator ................................... 46
28. Load-Dependent power draw for the considered clocking schemes .......... 46
29. Frequency stability of on-chip oscillator with 400 delay cells for variable supply voltage and environmental temperature ........................................... 47
30. Effect of delay cell number and distance on the oscillator cycle time .......... 48
31. Structure of the generic ADPCM hardware kernel. The gray modules are dedicated to the online coefficient adaption ........................................... 54
32. Test setup with multiple ADPCM hardware kernels ......................... 55
33. Measurement setup ................................................................. 56
34. Movement of sliding window and coordinate system by RLRCT update ................................................................. 58
35. HW Kernel (μArchitecture) for RLRCT ........................................... 63
36. Example network with $N_0 = \{0, 1\}$, $N = N_0 \cup \{2\}$, $E_C = \{(0, 2), (2, 0), (1, 2), (2, 1)\}$, $E_I = E_C \cup \{(0, 1), (1, 0)\}$ and its ILP solution (variables set to 1) ................................................................. 72
37. Averaged scheduling costs relative to blind flooding for 20 nodes in an 150 m × 150 m deployment area ................................................................. 76
38. Averaged scheduling costs relative to blind flooding for 20 nodes in an 100 m × 100 m deployment area ................................................................. 77
39. Heuristic scheduling costs relative to optimal scheduling costs .......... 77
List of Figures

40 Averaged runtime of heuristic relative to runtime of ILP-solver (both executed on the same compute server) .............................................. 78
41 Averaged schedule length relative to blind flooding for 20 nodes in an $150\, \text{m} \times 150\, \text{m}$ deployment area ........................................... 78
42 Averaged schedule length relative to blind flooding for 20 nodes in an $100\, \text{m} \times 100\, \text{m}$ deployment area ........................................... 79
43 Network setup for timestamp capturing ......................................... 82
44 Observed clock drift relative to gateway under temperature variation (0 min to 55 min: $21^\circ \text{C}$, 55 min to 80 min: $7^\circ \text{C}$ at N4, 55 min to 65 min: $52^\circ \text{C}$ at N3, 65 min to 75 min: $40^\circ \text{C}$ at N3) ........................................... 83
45 Impact of the synchronization period on estimated clock drift at N1 (regression table size = 2) ...................................................... 84
46 Impact of the synchronization period on the synchronization accuracy at N1 (regression table size = 2) ...................................................... 84
47 Impact of the regression table size on estimated clock drift at N1 (synchronization period = 10 s) ...................................................... 85
48 Impact of the regression table size on the synchronization accuracy at N1 (synchronization period = 10 s) ...................................................... 85
49 Impact of the regression table size on estimated clock drift at N3 (heated up to 40 $^\circ \text{C}$, synchronization period = 10 s) ...................................................... 86
50 Impact of the regression table size on the synchronization accuracy at N3 (heated up to 40 $^\circ \text{C}$, synchronization period = 10 s) ...................................................... 86
51 Impact of the hop distance to the reference node on synchronization accuracy (synchronization period = 10 s, table size = 2) ...................................................... 87
52 Overall execution time for the LR implementations ................................ 88
53 MCU-RAM requirement of the LR implementations ................................ 89
54 MCU-ROM requirement of the LR implementations ................................ 89
55 8192 samples of neural activity data (min = -155, max = 169, mean = 6.3, stddev = 45.5) ...................................................... 91
56 Reduction of neural activity data achieved by various compression schemes...................................................... 93
57 Reduction of condition monitoring data achieved by various compression algorithms ...................................................... 95
58 Impact of prediction order on ADPCM compression ratio ................................ 95
59 SHM dataflow based on experimental and operational modal analysis ...................................................... 97
60 Accumulation of triggered signal windows for the Random Decrement Technique. Each window represents the qualitative behavior of the sensor signal (i.e., acceleration or deflection) over time ...................................................... 99
61 FIR and FIFO in single BRAM ...................................................... 101
62 Sensor interface, trigger event detection and precomputation required for calculating the standard deviation ...................................................... 101
63 Accumulation of RD sequence (Equation 77) ...................................................... 102
64 Handling of trigger events $E_r = \{e_{r,1}, \ldots, e_{r,m}\}$ for reference $r \in R$ ...................................................... 102
65 RDT kernel for four sensor channels and two reference signals ...................................................... 103
66 Scheduling of parallel computations ...................................................... 104
67 Example for the main processing sequence of the distributed RDT ...................................................... 105
68 SHM demonstrator exhibit located at the Transfer Center for Adaptronics (Fraunhofer LBF) ...................................................... 109
69 Sensor position and orientation on the truss bridge model ...................................................... 110
70 Mounting of sensor mote and sensors ...................................................... 110
71 Wire-bound reference DAQ systems (LMS Test.Lab 14A with 12 ICP accelerometers) ...................................................... 111
72 RD sequence $D_{3,13}$ captured by the WSN ...................................................... 112
Frequency Response Function at $S_3$ captured with both DAQ systems 113
Mode shapes captured with both DAQ systems 113
Mode-shapes and $DI_s$ of undamaged structure obtained from 300 s (left), 60 s (center), and 30 s (right) measurements 114
Mode-shapes and $DI_s$ of damaged structure obtained from a 300 s measurement after loosening a screw at $S_1$ (left), $S_3$ (center) or $S_5$ (right) by a $5/6$ rotation 115

List of Tables

<p>| | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Market share [TMR2015-FPGA] of leading FPGAs vendors and logic resources provided by some of their devices</td>
</tr>
<tr>
<td>2</td>
<td>Examples of WSN motes with focus on FPGA-based architectures</td>
</tr>
<tr>
<td>3</td>
<td>FPGA devices with non-volatile configuration storage ($P_{idle}$ specifies lowest possible power consumption with RAM retention)</td>
</tr>
<tr>
<td>4</td>
<td>Atomic operations for MCU $\leftrightarrow$ RCU communication</td>
</tr>
<tr>
<td>5</td>
<td>Power consumption and streaming throughput of different 1 Mbit memories operated at 1.8 V</td>
</tr>
<tr>
<td>6</td>
<td>Power-Efficiency of on-chip oscillator architectures</td>
</tr>
<tr>
<td>7</td>
<td>Example for DPCM and Rice coder: $M = 3, a_1 = 0.5, a_2 = 0.3, a_3 = 0.2, W = 4$</td>
</tr>
<tr>
<td>8</td>
<td>Synthesis results for the Microsemi IGLOO AGL1000V2 device</td>
</tr>
<tr>
<td>9</td>
<td>26 bit $\mu$Instruction controlling the RLRCT $\mu$Architecture</td>
</tr>
<tr>
<td>10</td>
<td>$\mu$Instruction sequence implementing Algorithm 1 on the $\mu$Architecture</td>
</tr>
<tr>
<td>11</td>
<td>Ressource requirements of hardware accelerator architectures on M1AGL1000</td>
</tr>
<tr>
<td>12</td>
<td>Compression algorithms and options used in further evaluation</td>
</tr>
<tr>
<td>13</td>
<td>Energy required to compress/transmit 2048 samples of neural activity data</td>
</tr>
<tr>
<td>14</td>
<td>Statistical characteristics of 8192 samples of machinery condition monitoring data</td>
</tr>
<tr>
<td>15</td>
<td>Energy required to compress/transmit 3 $\times$ 2048 MEMS samples</td>
</tr>
<tr>
<td>16</td>
<td>Detected dominant mode shapes</td>
</tr>
<tr>
<td>17</td>
<td>Resources required for executing the RDT on various processing units</td>
</tr>
</tbody>
</table>

List of Algorithms

<p>| | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Update procedure for Rolling Linear Regression with Coordinate Transformation</td>
</tr>
<tr>
<td>2</td>
<td>Update procedure for Rolling Linear Regression (without Coordinate Transformation)</td>
</tr>
<tr>
<td>3</td>
<td>Heuristic scheduling for multi-source flooding</td>
</tr>
<tr>
<td>4</td>
<td>Conversion from local to global time</td>
</tr>
<tr>
<td>5</td>
<td>Conversion from global to local time</td>
</tr>
</tbody>
</table>

List of Code Listings

<p>| | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>8051 assembler for a sample 24 bit $\times$ 16 bit multiplication $x \cdot y = z$ in the directly addressable memory</td>
</tr>
</tbody>
</table>