2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Attention-Based Transformers for Instance
Segmentation of Cells in Microstructures

Tim Prangemeier, Christoph Reich, Heinz Koeppl‡

Centre for Synthetic Biology,
Department of Electrical Engineering and Information Technology, Department of Biology,

Technische Universität Darmstadt
‡heinz.koeppl@bcs.tu-darmstadt.de

and is computationally more efficient than previous approaches
[8]. Its simplicity promises to be beneficial for its adoption in
real-world applications.

Time-lapse fluorescence microscopy (TLFM) is a powerful
technique for studying cellular processes in living cells [4],
[9]–[11]. The vast amount of quantitative data TLFM yields,
promises to constitute the backbone of the rational design of de
novo biomolecular functionality [10], [11]. Ideally in synthetic
biology, well characterised parts are combined in silico in a
quantitatively predictive bottom up approach [11]–[13], for
example, to detect and kill cancer cells [14], [15].

Quantitative TLFM with high-throughput microfluidics is an
essential technique for concurrently studying the heterogeneity
and dynamics of synthetic circuitry on the single cell level
[4], [9], [11]. A typical TLFM experiment yields thousands of
specimen images (Fig. 1) requiring automated segmentation,
examples include [5], [16], [17]. Segmenting each individual
cell enables its pertinent information to be extracted quanti-
tatively. For example, the abundance of a fluorescent reporter
can be measured, giving insight into the cell’s inner workings.

traptrap

cell

cell
cell

cell
Cell-DETR

Fig. 1. Schematic of Cell-DETR direct instance segmentation discerning
individual cell (colour) and trap microstructure (grey) object instances.

Instance segmentation is a major bottleneck in quantify-
ing single-cell microscopy data and manual analysis is pro-
hibitively labour intensive [9], [11], [16], [18], [19]. The vast
majority of single-cell segmentation methods are designed for
a posteriori data processing and often require post-processing
for instance detection or manual input [9]. This is not only
a drawback on the amount of experiments that can be per-
formed, but also limits the type of experiments [18], [20].
For example, harnessing the potential of advanced closed-
loop optimal experimental design techniques [12], [21], [22]
requires online monitoring with fast instance segmentation

Abstract—Detecting and segmenting object instances is a 
common task in biomedical applications. Examples range from 
detecting lesions on functional magnetic resonance images, to the 
detection of tumours in histopathological images and extracting 
quantitative single-cell information from microscopy imagery, 
where cell segmentation is a major bottleneck. Attention-based 
transformers are state-of-the-art in a range of deep learning 
fields. T hey h ave r ecently b een p roposed for s egmentation tasks 
where they are beginning to outperform other methods. We 
present a novel attention-based cell detection transformer (Cell-
DETR) for direct end-to-end instance segmentation. While the 
segmentation performance is on par with a state-of-the-art 
instance segmentation method, Cell-DETR is simpler and faster. 
We showcase the method’s contribution in a the typical use case 
of segmenting yeast in microstructured environments, commonly 
employed in systems or synthetic biology. For the specific use 
case, the proposed method surpasses the state-of-the-art tools for 
semantic segmentation and additionally predicts the individual 
object instances. The fast and accurate instance segmentation 
performance increases the experimental information yield for a 
posteriori data processing and makes online monitoring of exper-
iments and closed-loop optimal experimental design feasible. 
Code and data samples are available at https://git.rwth-aachen. 
de/bcs/projects/cell-detr.git.

Index Terms—attention, instance segmentation, transformers, 
single-cell analysis, synthetic biology, microfluidics, deep learning

I. INTRODUCTION

Instance segmentation is a common task in biomedical
applications. It is comprised of both detecting individual object
instances and segmenting them [1], [2]. Prevalent examples in
healthcare and life sciences include the detection of individual
tumour or cell entities and the segmentation of their shape. Re-
cent advances in automated single-cell image processing, such
as instance segmentation, have contributed to early tumour
detection, personalised medicine, biological signal transduc-
tion and insight into the mechanisms behind cell population
heterogeneity, amongst others [3]–[6]. An example of instance
segmentation is shown in Fig. 1, with four separate cell and
two trap microstructures detected and segmented individually.

Object detection and panoptic segmentation are closely
related to instance segmentation [2], [7]. Carion et al. recently
proposed a novel attention-based detection transformer DETR
for panoptic segmentation [8]. DETR achieves state-of-the-
art panoptic segmentation performance, while exhibiting a
comparatively simple architecture that is easier to implement

978-1-7281-6215-7/20/$31.00 ©2020 IEEE

© 2022 IEEE.  Personal use of this material is permitted.  Permission from IEEE must be obtained for all other uses, in any 
current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new 
collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Urheberrechtlich geschützt / In Copyright

https://git.rwth-aachen.de/bcs/projects/cell-detr.git
https://git.rwth-aachen.de/bcs/projects/cell-detr.git
https://rightsstatements.org/page/InC/1.0/


capabilities. Attention-based methods, such as the recently
proposed detection transformer DETR [8], are increasingly
outperforming other methods [8], [23]. For the yeast-trap
configuration (Fig. 1) direct instance segmentation has yet to
be employed and attention-based transformers have yet to be
applied for segmentation in the biomedical fields in general.

In this study, we present Cell-DETR, a novel attention-based
detection transformer for instance segmentation of biomedical
samples based on DETR [8]. We address the automated cell
instance segmentation bottleneck for yeast cells in microstruc-
tured environments (Fig. 1) and showcase Cell-DETR on this
application. Section II introduces the previous segmentation
approaches and the microstructured environment. Our experi-
mental setup for fluorescence microscopy, the tested architec-
tures and our approach to training and evaluation are presented
in Section III. We analyse the proposed method’s performance
in Section IV and compare it to the application specific
state-of-the-art, as well as to a general instance segmentation
baseline. After interpreting the results and highlighting the
method’s future potential in Section V, we summarise and
conclude the study in Section VI. Our model surpasses the
previous application baseline and is on par with a general state-
of-the-art instance segmentation method. The relatively short
inference runtimes enable higher throughput a posteriori data
processing and make online monitoring of experiments with
approximately 1000 cell traps feasible.

II. BACKGROUND

An extensive body of research into the automated processing
of microscopy imagery dates back to the middle of the 20-th
century. Recent studies demonstrate the utility of deep learning
segmentation approaches, for example [6], [9], [19], [24],
[25]. Comprehensive reviews of the many methods to segment
yeast on microscopy imagery are available elsewhere [3], [20].
Here we focus on dedicated tools for segmenting cells in
trapped microstructures. U-Net convolutional neural networks
(CNNs) with an encoder-decoder architecture bridged by skip
connections have been shown to perform semantic segmen-
tation well for E. coli mother machines [9], [19] and yeast
in microstructured environments [6]. In the case of trapped
yeast, the previous state-of-the-art tool DISCO [16] was based
on conventional methods (template matching, support vector
machine, active contours), until recently being superseded by
U-Nets [6]. The current baseline for semantic segmentation
of yeast in microstructured environments, as measured by the
cell class intersection-over-union, is 0.82 [6]. Additional post-
processing, of the segmentation maps is required to attain each
individual cell instance [6].

For instance segmentation in general, recent state-of-the-art
methods are available, for example Mask R-CNN [1]. It is
a proposal-based instance segmentation model, which com-
bines a CNN backbone, region proposals with non-maximum-
suppression, region-of-interest (ROI) pooling, and multiple
prediction heads [1]. Attention-based methods are increas-
ingly outperforming convolutional methods and are currently
state-of-the-art in natural language processing [23]. Beyond

Fig. 2. Single-cell fluorescence measurement setup. Microfluidic chip on
the microscope table (top right), microscope imagery and design of the
yeast trap microstructures. The trap chamber (green rectangle) contains an
array of approximately 1000 traps. Single specimen images show a pair of
microstructures and fluorescent cells, violet contours indicate segmentation of
two separate cell instances with corresponding fluorescence measurement F1

and F2; black scale bar 1 mm, white scale bar 10 µm.

natural language processing, attention-based approaches, such
as axial-attention modules [26], have demonstrated promising
results in computer vision applications [8]. Recently, the first
transformer-based method (DETR [8]) for object detection and
panoptic segmentation was reported. DETR achieves state-of-
the-art results on par with Faster R-CNN and constitutes a
promising approach for further improvements in automated
object detection and segmentation performance.

The microfluidic trap microstructures we consider here are
designed for long-term culture of yeast cells (Saccharomyces
cerevisiae) within the focal plane of a microscope [17]. The
on-chip environment is tightly controlled and conducive to
yeast growth. Examples of its routine employ include Fig.
2 and [4]–[6], [11], [16]. A constant flow of yeast growth
media hydrodynamically traps the cells in the microstructures
and allows the introduction of chemical perturbations. An
automated microscope records an entire trap chamber of up
to 1000 traps by imaging both the brightfield and fluorescent
channels at approximately 20 neighbouring positions. Typical
experiments each produce hundreds of GB of image data.
Time-lapse recordings allow individual cells to be tracked
through time. Robust instance segmentation facilitates tracking
[9], [20], which itself can be a limiting factor with regard to
the data yield of an experiment [4], [9], [19], [20].

III. METHODOLOGY

A. Live-cell microscopy dataset and annotations

The individual specimen images each contain a single
microfluidic trap and some yeast cells, as depicted in Fig.
1. These are extracted from larger microscope recordings,
whereby each exposure contains up to 50 traps (Fig. 2 mid-
dle). Ideally, a single mother cell persists in each trap, with
subsequent daughter cells being removed by the constant flow
of yeast growth media. In practice, multiple cells accumulate
around some traps, while other traps remain empty (Fig. 3).

We distinguish between three classes on the specimen image
annotations, as depicted in Fig. 3. The yeast cells in violet
are the most important class for biological applications. To


brightfield background traps cells

Fig. 3. Example of class and instance annotations for a specimen image;
brightfield image (left), background label in light greylight grey ��, instances of the
trap class in shades of dark greydark grey �� and instances of the cell class in shades
of violetviolet �� (left to right respectively); scale bar 10µm.

counteract traps being segmented as cells, we employ a distinct
class for them (dark grey). The background (light grey) is
annotated for semantic segmentation training, for example of
U-Nets. For instance segmentation training we introduce a no-
object class ∅ in place of the background class.

Each instance of cells or trap structures are annotated indi-
vidually with a bounding box, class specification and separate
pixel-wise segmentation map. Here we omit the bounding
boxes to enable an unobscured view of the contours. Instead,
the distinct cell instances and their individual segmentation
maps are indicated by different shades of violet in Fig. 3.

The annotated set of 419 specimen images from various
experiments was randomly assigned for network training,
validation and testing (76 %, 12 % and 12 % respectively).
Examples are shown in Fig. 4. Images include a balance of the
common yeast-trap configurations: 1) empty traps, 2) single
cells (with daughter) and 3) multiple cells. Slight variations in
trap fabrication, debris, contamination, focal shift, illumination
levels and yeast morphology were included. Further scenarios
or strong variations, such as trap design geometries, model
organisms and significant focal shift, were omitted.

B. The Cell-DETR instance segmentation architecture

The proposed Cell-DETR models A and B are based on
the DETR panoptic segmentation architecture [8]. We adapted
the architecture for non-overlapping instance segmentation and
reduced it in size for faster inference. The main differences
between DETR and our variants Cell-DETR A and B are
summarised in Table I. The Cell-DETR variants have ap-
proximately one order of magnitude less parameters than the
original (∼ 40 × 106 reduced to ∼ 5 × 106 parameters). The
main building blocks of the Cell-DETR model are detailed in
Fig. 5. They are the backbone CNN encoder, the transformer
encoder-decoder, the bounding box and class prediction heads,
and the segmentation head.

The CNN encoder (left in Fig. 5) extracts image features
of the brightfield specimen image input. It is based on four
ResNet-like [27] blocks with 64, 128, 256 and 256 convo-
lutional filters. After each block a 2 × 2 average pooling
layer is utilised to downsample the intermediate feature maps.
The Cell-DETR variants employ different activations and
convolutions, as detailed in Table I.

The transformer encoder determines the attention between
image features. The transformer decoder predicts the attention

Fig. 4. Characteristic selection of specimen images and corresponding
annotations, including empty or single trap structures, trapped single cells
(with single daughter adjacent) and multiple trapped cells; trap instances in
shades of dark greydark grey ��, cell instances in shades of violetviolet �� and transparent
background; scale bar 10µm.

regions for each of the N = 20 object queries. They are both
based on the DETR architecture [8]. We reduced the number
of transformer encoder blocks to three and decoder blocks to
two, each with 512 hidden features in the feed-forward neural-
network (FFNN). The 128 backbone features are flattened
before being fed into the transformer. In contrast to the original
DETR, we employed learned positional encodings. While
Cell-DETR A employs leaky ReLU [28] activations, Padé
activation units [29] are utilised for Cell-DETR B.

TABLE I
OVERVIEW OF DIFFERENCES BETWEEN DETR, CELL-DETR A AND B.

——-
Model

Activation
functions

Convolutions Feature
fusion

Param.
×106

DETR [8] ReLU standard
spatial

addition ' 40

C-DETR A leaky ReLU
[28]

standard
spatial

addition 4.3

C-DETR B Padé [29] deformable
(v2) [30]

pix.-adapt.
conv. [31]

5.0

The prediction heads for the bounding box and classification
are each a FFNN. They map the transformer encoder-decoder
output to the bounding box and classification prediction. These
FFNN process each query in parallel and share parameters
over all queries. In addition to the cell and trap classes, the
classification head can also predict the no-object class ∅.


input image

backbone
CNN

encoder
image features

transformer
encoder-
decoder

FFNN

FFNN class & BB
predictions

cell

cell

cell

trap trap

encoder features multi-head-
attention

CNN
decoder

||

skip connections

segmentation
prediction

Fig. 5. Architecture of the end-to-end instance segmentation network, with brightfield specimen image input and an instance segmentation prediction as output.
The backbone CNN encoder extracts image features that then feed into both the transformer encoder-decoder for class and bounding box prediction, as well
as to the CNN decoder for segmentation. The transformer encoded features, as well as the transformer decoded features, are fed into a multi-head-attention
module and together with the image features from the CNN backbone feed into the CNN decoder for segmentation. Skip connections additionally bridge
between the backbone CNN encoder and the CNN decoder. Input and output resolution is 128 × 128 pixels.

The segmentation head is composed of a multi-head atten-
tion mechanism and a CNN decoder to predict the segmen-
tation maps for each object instance. We employ the original
DETR [8] two-dimensional multi-head attention mechanism
between the transformer encoder and decoder features. The
resulting attention maps are concatenated channel-wise onto
the image features and fed into the CNN decoder. The three
ResNet-like decoder blocks decrease the feature channel size
while increasing the spatial dimensions. Long skip connections
bridge between the CNN encoder and CNN decoder blocks’
respective outputs. The features are fused by element-wise
addition in Cell-DETR A and by pixel-adaptive convolutions
in Cell-DETR B. A fourth convolutional block incorporates
the queries in the feature dimension and returns the original
input’s spatial dimension for each query. Non-overlapping
segmentation is ensured by a softmax over all queries.

C. Training Cell-DETR

We employ a combined loss function and a direct set
prediction to train our Cell-DETR networks end-to-end. The
set prediction ŷ = {ŷi = {p̂i, b̂i, ŝi}}N=20

1 is comprised of
the respective predictions for class probability p̂i ∈ RK (here
K = 3 classes, no-object, trap, cell), bounding box b̂i ∈ R4

and segmentation ŝi ∈ R128×128 for each of the N queries.
We assigned each instance set label yσ(i) to the corresponding
query set prediction ŷi with the Hungarian algorithm [8], [32].
The indices σi denote the best matching permutation of labels.

The combined loss L is comprised of a classification loss
Lp, a bounding box loss Lb, and a segmentation loss Ls

L =

N∑
i=1

(
Lp + 1{pi 6=∅}Lb + 1{pi 6=∅}Ls

)
,

with N = 20 object instance queries in this case. We employ
class-wise weighted cross entropy for the classification loss

Lp
(
pσ(i), p̂i

)
= −

K∑
k=1

βk pσ(i),k log(p̂i,k),

with weights β = [0.5, 0.5, 1.5] for the K = 3 classes, no-
object, trap and cell classes respectively. The bounding box
loss is itself composed of two weighted loss terms. These are
a generalised intersection-over-union LJ [33], and a L1 loss,
with respective weights λJ = 0.4 and λL1 = 0.6

Lb
(
bσ(i), b̂i

)
= λJ LJ

(
bσ(i), b̂i

)
+ λL1

∣∣∣∣∣∣bσ(i) − b̂i∣∣∣∣∣∣
1
.

The segmentation loss Ls is a weighted sum of the focal
loss LF [34] and Sørensen-Dice loss LD [6], [8]

Ls
(
sσ(i), ŝi

)
= λF LF

(
sσ(i), ŝi; γ

)
+ λD LD

(
sσ(i), ŝi; ε

)
.

The respective weights are λF = 0.05 and λD = 1, with
focusing parameter γ = 2 and ε = 1 for numerical stability.

D. Evaluation and implementation

We employ a number of metrics to quantitatively analyse
the performance of the trained networks with regard to classi-
fication, bounding box and segmentation performance. Given
the ground truth Y and the prediction Ŷ (in the corresponding
instance-matched permutation), we evaluate the segmentation
performance with variants of the Jaccard index J and the
Sørensen-Dice D coefficient [6], [8], omitting the background

D(Y, Ŷ) =
2|Y ∩ Ŷ|
|Y|+ |Ŷ|

; Jk(Yk, Ŷk) =
|Yk ∩ Ŷk|
|Yk ∪ Ŷk|

, (1)

with Jk intuitively the intersection-over-union for each class
k. With respect to the metrological application in image
cytometry, the cell class is of most importance, therefore,
we consider the Jaccard index for the cell class alone (Jc).
Similarly, in the case of instance segmentation, we compute
Ji for each instance i and average over all I object instances
to compute the mean instance Jaccard index J̄I = 1

I

∑I
i=1 Ji.

We utilise the accuracy as the proportion of correct pre-
dictions for classification. The bounding boxes are evaluated
with the Jaccard index J̄b. It is defined analogously to the
object instance Jaccard index (compare Eqn. 1), yet computed
implicitly with the bounding box coordinates.


We compare the proposed method with our own imple-
mentations of both the state-of-the-art for the trapped yeast
application (U-Net [6]), as well as more generally with a
state-of-the-art instance segmentation meta algorithm (Mask
R-CNN [1]). The multiclass U-Net for semantic segmentation
was implemented in PyTorch, with the architecture, pre- and
post-processing described in [6]. We implemented a Mask R-
CNN [1] with Torchvision (PyTorch) and a ResNet-18 [27]
backbone, which was pre-trained for image classification.

We implemented the proposed Cell-DETR A and B archi-
tectures with PyTorch. We used the MMDetection toolkit [35]
for deformable convolutions and the PyTorch/Cuda implemen-
tation for the Padé activation units [29]. We trained the models
using AdamW [36] for optimisation with a weight decay of
10−6. The initial learning rate was 10−5 for the backbone
and 10−4 for the rest of the model. The learning rates were
decreased by an order of magnitude after 50 and again 100
epochs of the total 200 epochs. The additional first and second-
order momentum moving average factors were 0.9 and 0.999
respectively. We selected the best performing model based
on the cell class Jaccard index Jc, typically after 80 to 140
epochs with mini batch size 8. The training data was randomly
augmented by elastic deformation [6], [24], horizontal flipping
or by the addition of noise with a probability of 0.6. Inference
runtimes for one forward pass were averaged over 1000 runs
on a Nvidia RTX 2080 Ti for all three methods (U-Net, Mask
R-CNN and Cell-DETR).

E. Data acquisition setup

Yeast cells were cultured in a tightly controlled microfluidic
environment. A temperature of 30 °C and the flow of yeast
growth media enables yeast to grow for prolonged periods and
over multiple cell-cycles. The microfluidic chips confined the
cells to the focal plane of the microscope. Continuous media
flow hydrodynamically traps the living cells in the microstruc-
tures. The Polydimethylsiloxane (PDMS) microstructures con-
strain the cells in XY, while axial constraints in Z are provided
by the cover slip and PDMS ceiling. The space between cover
slip and the PDMS ceiling is on the order of a cell diameter
to facilitate continuously uniform focus of the cells.

We recored time-lapse brightfield (transmitted light) and
fluorescent channel imagery of the budding yeast cells every
10 min with a computer controlled microscope (Nikon Eclipe
Ti with XYZ stage; µManager; 60x objective). A CoolLED
pE-100 and a Lumencor SpectraX light engine illuminated the
respective channels, which were captured with a ORCA Flash
4.0 (Hamamatsu) camera. Multiple lateral and axial positions
were recorded sequentially at each timestep (Fig. 2).

IV. RESULTS

A. Cell-DETR variant results

A sample of segmentation results for the two Cell-DETR
variants is shown in Fig. 6. The cell and trap instances are all
detected and classified correctly with slight variations in seg-
mentation contours. Separate instances of cells and traps are
indicated by the shades of violet and grey respectively. Variant

B demonstrates slightly better segmentation performance. A
qualitative example of this is shown in in Fig. 6, where Cell-
DETR A in contrast to B excludes a small section of one cell.

brightfield C-DETR A C-DETR B label

Fig. 6. Qualitative comparison of Cell-DETR A and B segmentation
examples for a selected test image (left) and label (right); trap instances in
shades of dark greydark grey ��, cell instances in shades in violetviolet ��; scale bar 10µm.

The quantitative comparison between the segmentation per-
formance of the Cell-DETR variants is summarised in Table
II. We modified model B for better performance on our appli-
cation, as described in Section III-B. The mean Jaccard index
over all object instances increased from J̄I = 0.84 for model
A to J̄I = 0.85 for model B, while the cell class Jaccard index
increased by a similar margin from Jc = 0.83 to Jc = 0.84.
Taking the background into account, a segmentation accuracy
of 0.96 is achieved. Both Cell-DETR surpass the segmentation
performance (Jc) of the previous state-of-the-art methods for
the trapped yeast application [6], [16], in addition to directly
attaining the instances.

TABLE II
SEGMENTATION PERFORMANCE OF CELL-DETR A AND B.

———
Model

Sørensen
Dice
D

Mean
instance
J̄I

——–
Cell class
Jc

Seg.
accuracy

C-DETR A 0.92 0.84 0.83 0.96
C-DETR B 0.92 0.85 0.84 0.96

The bounding box and classification performance is sum-
marised in Table III. Again, both models perform similarly
well. They correctly classify the object instances (classification
accuracy of 1.0) and detect the correct number of instances
for each class. They also perform similarly well at localising
the instances, achieve a bounding box intersection-over-union
of Jb = 0.81, for the standard formulation as well as the
generalised form employed for training.

TABLE III
BOUNDING BOX AND CLASSIFICATION PERFORMANCE METRICS FOR

CELL-DETR A AND B.

Bounding box Classification
Model Jaccard Jb accuracy

C-DETR A 0.81 1.0
C-DETR B 0.81 1.0

The slight increase in segmentation performance that model
B yields is a trade off with increased computational cost. The
number of parameters is increased from approximately 4×106,
to over 5× 106 (Table I). This leads to an increase in runtime


from 9.0 ms for model A to 21.2 ms for model B. These times
are orders of magnitude faster than the previous state-of-the-
art method DISCO [16] and on the same order of magnitude as
the currently fastest reported network for this application [6].
Runtimes on this order of magnitude suffice for in-the-loop
experimental techniques.

We select model B for further analysis, based on the im-
proved performance and sufficiently fast runtimes. A selection
of segmentation predictions for the three most typical scenar-
ios in the test dataset is given in Fig. 7. The detection of cell
and trap instances, without any overlap between instances, is
successful for single cells (middle row), multiple cells (bottom
row), and empty traps are correctly identified. The introduction
of multiple classes (traps, cells), as well as individual object
instances facilitated individually segmenting each cell entity
and discerning these from both the traps and other cells.

brightfield overlay pred. mask label

Fig. 7. Example of different scenarios from the test dataset segmented with
Cell-DETR B: an empty trap (top row), a single trapped cell (middle row)
and multiple cells; columns are brightfield, an overlay of the prediction, the
prediction mask and the ground truth label (left to right respectively). Colours
indicate traps in shades of greygrey �� and cell instances in shades of violetviolet ��;
scale bar 10µm.

The intended application of our method is to deliver seg-
mentation masks for each cell instance for subsequent single-
cell fluorescence measurements. We trialled this application
on unlabelled and unseen data as depicted in Fig. 8. The
cell instances are detected based on the brightfield image
(left) and the resulting object segmentation predictions are
used as masks to measure the individual cell fluorescence on
the fluorescent channel (right). An overlay of the brightfield,
fluorescent images with the segmentation contours is depicted
in the middle, along with the green fluorescent protein (GFP)
channel. The individual cell area (A1 and A2) is measured as
the number of pixels in the instance segmentation mask and
indicated on the GFP channel. The cell instance fluorescence
(F1 and F2) is summed over the mask area and indicated on

the right for each individual cell in arbitrary fluorescence units.

brightfield bf + gfp gfp gfp masks

F1 = 6.1x106 a.u.

F2 = 2.5x106 a.u.

A1 = 929 pix.

A2 = 241 pix.

Fig. 8. Example of individual cell fluorescence measurement application with
a segmentation mask contour for each individual cell (violet contours violetviolet
��) based on the brightfield image (left); scale bar 10µm.

B. Comparison with state-of-the-art methods

We compare our proposed method with the state-of-the-art
for the trapped yeast application (DISCO [16], U-Net [6]),
as well as with a general state-of-the-art method for instance
segmentation (Mask R-CNN [1]). We implemented both the
U-Net and Mask R-CNN methods in this study (Section III-D).
A characteristic qualitative example of the results is given in
Fig. 9, with the ground truth on the left, followed by Cell-
DETR B, Mask R-CNN and U-Net segmentations results. All
three methods segment two trap microstructures and all four
cells in separate classes, without any overlap or touching cells.
Cell-DETR B and Mask R-CNN additionally segment each
cell or trap object as an individual instance. The contours are
slightly smaller for the U-Net, which is deemed a result of
the emphasis on avoiding touching cells and the associated
difficulty of discerning these in subsequent post-processing.

label C-DETR B M. R-CNN U-Net

Fig. 9. Example segmentation for our implementations of Cell-DETR B, Mask
R-CNN and U-Net. Trap instances in shades of greygrey �� and cell instances in
shades of violetviolet �� (no instance detection for U-Net); scale bar 10µm.

Accurate segmentation of the cells is particularly important
for the measurement of cell morphology or fluorescence. We
compare the cell class Jaccard index Jc of our proposed
methods Cell-DETR A and B with the application state-of-
the-art methods DISCO, U-Net and Mask R-CNN. The com-
parison is summarised in Table IV. U-Net recently superseded
DISCO [16] (Jc ∼ 0.7) as the state-of-the-art trapped yeast
segmentation method, achieving Jc = 0.82. Our Cell-DETR
variants both further improve on this result, with model B
achieving the same Jc = 0.84 on par with our Mask R-CNN
implementation. Cell-DETR and Mask R-CNN additionally
provide each cell object instance.

We measured the average runtime of a forward pass of each
method on a single specimen image (Table IV). For DISCO


TABLE IV
COMPARISON OF CELL-DETR PERFORMANCE WITH THE

STATE-OF-THE-ART METHODS FOR THE TRAPPED YEAST APPLICATION
(DISCO, U-NET) AND INSTANCE SEGMENTATION (MASK R-CNN).

———–
Model

Cell
Class Jc

Inference
runtime1

——
Instances

DISCO [16]2 ∼ 0.70 ∼ 1300 ms ×
U-Net 0.82 1.8 ms ×
Mask R-CNN 0.84 29.8 ms X
Cell-DETR A 0.83 9.0 ms X
Cell-DETR B 0.84 21.2 ms X

1 Runtimes for U-Net, Mask R-CNN, and Cell-DETR averaged
over 1000 runs (∼ 300 different images) on a Nvidia RTX
2080 Ti; measurement uncertainty is below ±5%.

2 Reported literature values [16].

[16] we consider the reported values that include some pre-
and post-processing steps to detect cells individually. The deep
methods are significantly faster than DISCO, making online
monitoring of live experiments feasible. The U-Net is the
fastest, taking 1.8 ms for a forward pass, in contrast to 29.8 ms
for the Mask R-CNN [1]. However, the U-Net requires further
post-processing steps to detect the object instances and has
been reported to take approximately 20 ms in conjunction with
watershed post-processing [6]. The Cell-DETR variants take
the middle ground with 9.0 ms and 21.2 ms.

V. DISCUSSION

A. Analysis of the instance segmentation performance

Cell-DETR has some benefits in comparison to state-of-
the-art methods, such as Mask R-CNN. The Cell-DETR
architecture is comparatively simple and avoids common
hand designed components of Mask R-CNNs, such as non-
maximum suppression and ROI pooling. This reduces Cell-
DETR’s reliance on hyperparameters and facilitates end-to-
end training with a single combined loss function. In contrast,
Mask R-CNNs require additional supervision to train the
region proposal network. As a result of these differences, Cell-
DETR is easier to implement, has less parameters and is faster
than Mask R-CNN for the same segmentation performance.

While Cell-DETR does not rely on explicit region proposals,
it does utilise attention maps that highlight the pertinent
features in the latent space. The mapping of these is learnt
during the end-to-end training. The loss curves of individual
prediction tasks are shown in Fig. 10. The classification loss
Lp (blue) converges first, indicating that the network first
learns how many objects are present in an image and to which
class they belong. The bounding box loss Lb (red) converges
next, with the network learning the approximate location of
each object. Finally, the model learns to refine the pixel-
wise segmentation maps with the segmentation loss Ls (green)
converging last.

With respect to the specific single-cell measurement appli-
cation, Cell-DETR offers robust and repeatable instance seg-
mentation of yeast cells in microstructures. The key cell class
segmentation performance surpasses the previous state-of-the-
art semantic segmentation methods [6], [16] with a cell class

200 400 600 800 1000

0.2

0.4

0.6

0.8

1

training steps

lo
ss

Lp classification
Lb bounding box
Ls segmentation

Fig. 10. Classification, bounding boxes and segmentation loss curves for
Cell-DETR B; thick lines are running averages (window size 30).

Jaccard index of 0.84. Additionally, the proposed technique
directly detects individual object instances and classifies the
objects robustly (near 100 % accuracy). The robust instance
segmentation performance promises to facilitate cell tracking,
increase the experimental information yield and enables Cell-
DETR to be employed without human intervention.

B. Limitations, outlook and future potential

The presented models are trained for a specific microfluidic
configuration and trap geometry. While they are relatively
robust and fulfil their intended purpose, their utility could be
broadened by expanding the dataset to include more classes,
for example different trap geometries. More generally as an in-
stance segmentation method, Cell-DETR offers a platform for
incorporating future advances in attention mechanisms as they
are increasingly outperforming convolutional approaches. For
example, replacing the convolutional elements in the backbone
and segmentation head with axial-attention [26] may lead to
further improved performance. Currently, Cell-DETR achieves
state-of-the-art performance and as an instance segmentation
method is generally suitable for and readily adaptable to a
wide range of biomedical imaging applications.

The presented Cell-DETR methods can be harnessed for
high-content quantitative single-cell TLFM. Cell-DETR, Mask
R-CNN and U-Net achieve runtimes orders of magnitudes
faster than the previous state-of-the-art trapped yeast method
(DISCO [16]). These runtimes coupled with Cell-DETRs
robust instance segmentation make both online monitoring
and closed-loop optimal experimental design of typical ex-
periments with approximately 1000 traps feasible. Harnessing
this potential promises to provide increased experimental in-
formation yields and greater biological insights in the future.

VI. CONCLUSION

In summary, we present Cell-DETR, an attention-based
transformer method for direct instance segmentation and show-
case it on a typical application. To the best of our knowl-
edge, this is the first application of detection transformers on
biomedical data. The proposed method has fewer parameters
and is 30% faster while matching the segmentation perfor-
mance of a state-of-the-art Mask R-CNN. A simpler Cell-


DETR variant exhibits slightly lesser segmentation perfor-
mance (Jc = 0.83 instead of 0.84) while requiring 1/3rd of a
Mask R-CNN’s runtime. As a general instance segmentation
model, Cell-DETR achieves state-of-the-art performance and
is deemed suitable and readily adaptable for a range of
biomedical imaging applications.

Showcased on a typical systems or synthetic biology ap-
plication, the proposed Cell-DETR robustly detects each cell
instance and directly provides instance-wise segmentation
maps suitable for cell morphology and fluorescence measure-
ments. In comparison to the previous semantic segmentation
trapped yeast baselines, Cell-DETR provides better segmen-
tation performance with a cell class Jaccard index Jc = 0.84
while additionally detecting each individual cell instance and
maintaining comparable runtimes. This promises to reduce
measurement uncertainty, facilitate cell tracking efficacy and
increase the experimental data yield in future applications. The
resulting runtimes and accurate instance segmentation make
future online monitoring feasible, for example for closed-loop
optimal experimental control.

ACKNOWLEDGEMENTS

We thank Christian Wildner for insightful discussions,
André O. Françani and Jan Basrawi for contributing to la-
belling and Markus Baier for aid with the computational setup.

This work was supported by the Landesoffensive für wis-
senschaftliche Exzellenz as part of the LOEWE Schwerpunkt
CompuGene. H.K. acknowledges support from the European
Research Council (ERC) with the consolidator grant CONSYN
(nr. 773196).

REFERENCES

[1] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in IEEE
ICCV, 2017, pp. 2961–2969.

[2] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen-
son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for
semantic urban scene understanding,” in IEEE/CVF CVPR, 2016.

[3] J. Sun, A. Tárnok, and X. Su, “Deep Learning-Based Single-Cell Optical
Image Studies,” Cytom. Part A, vol. 97, no. 3, pp. 226–240, 2020.

[4] M. Leygeber, D. Lindemann, C. C. Sachs, E. Kaganovitch, W. Wiechert,
K. Nöh, and D. Kohlheyer, “Analyzing Microbial Population Hetero-
geneity - Expanding the Toolbox of Microfluidic Single-Cell Cultiva-
tions,” J. Mol. Biol., 2019.

[5] A. Hofmann, J. Falk, T. Prangemeier, D. Happel, A. Köber, A. Christ-
mann, H. Koeppl, and H. Kolmar, “A tightly regulated and adjustable
CRISPR-dCas9 based AND gate in yeast,” Nucleic Acids Res., vol. 47,
no. 1, pp. 509–520, 2019.

[6] T. Prangemeier, C. Wildner, A. O. Françani, C. Reich, and H. Koeppl,
“Multiclass yeast segmentation in microstructured environments with
deep learning,” IEEE CIBCB, 2020.

[7] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic
segmentation,” in IEEE/CVF CVPR, 2019, pp. 9404–9413.

[8] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, “End-to-end object detection with transformers,”
arXiv:2005.12872, 2020.

[9] J.-B. Lugagne, H. Lin, and M. J. Dunlop, “DeLTA: Automated cell
segmentation, tracking, and lineage reconstruction using deep learning,”
PLoS Comput Biol, vol. 16, no. 4, 2020.

[10] R. Pepperkok and J. Ellenberg, “High-throughput fluorescence mi-
croscopy for systems biology,” Nat. Rev. Mol. Cell Biol., p. 690, 2006.

[11] T. Prangemeier, F. X. Lehr, R. M. Schoeman, and H. Koeppl, “Microflu-
idic platforms for the dynamic characterisation of synthetic circuitry,”
Curr. Opin. Biotechnol., vol. 63, pp. 167–176, 2020.

[12] D. G. Cabeza, L. Bandiera, E. Balsa-Canto, and F. Menolascina, “Infor-
mation content analysis reveals desirable aspects of in vivo experiments
of a synthetic circuit,” in IEEE CIBCB, 2019, pp. 1–8.

[13] F.-X. Lehr, M. Hanst, M. Vogel, J. Kremer, H. U. Göringer, B. Suess,
and H. Koeppl, “Cell-free prototyping of and-logic gates based on
heterogeneous rna activators,” ACS Synth. Biol., p. 2163, 2019.

[14] Z. Xie, L. Wroblewska, L. Prochazka, R. Weiss, and Y. Benenson,
“Multi-input RNAi-based logic circuit for identification of specific
cancer cells,” Science, vol. 333, pp. 1307–1312, 2011.

[15] W. Si, C. Li, and P. Wei, “Synthetic immunology: T-cell engineering
and adoptive immunotherapy,” Synth. Syst. Biotechnol., vol. 3, no. 3,
pp. 179–185, 2018.

[16] E. Bakker, P. S. Swain, and M. M. Crane, “Morphologically constrained
and data informed cell segmentation of budding yeast,” Bioinformatics,
vol. 34, no. 1, pp. 88–96, 2018.

[17] M. M. Crane, I. B. N. Clark, E. Bakker, S. Smith, and P. S. Swain,
“A Microfluidic System for Studying Ageing and Dynamic Single-Cell
Responses in Budding Yeast,” PLoS One, vol. 9, p. e100042, 2014.

[18] D. A. Van Valen, T. Kudo, K. M. Lane, D. N. Macklin, N. T.
Quach, M. M. DeFelice, I. Maayan, Y. Tanouchi, E. A. Ashley, and
M. W. Covert, “Deep Learning Automates the Quantitative Analysis of
Individual Cells in Live-Cell Imaging Experiments,” PLoS Comput Biol,
vol. 12, no. 11, pp. 1–24, 2016.

[19] J. Sauls, J. Schroeder, S. Brown, G. Treut, F. Si, D. Li, J. Wang, and
S. Jun, “Mother machine image analysis with MM3,” bioRxiv, 2019.

[20] E. Moen, D. Bannon, T. Kudo, W. Graf, M. Covert, and D. Van Valen,
“Deep learning for cellular image analysis,” Nat. Methods, vol. 16,
no. 12, p. 1233, 2019.

[21] T. Prangemeier, C. Wildner, M. Hanst, and H. Koeppl, “Maximizing
information gain for the characterization of biomolecular circuits,” in
Proc. 5th ACM/IEEE NanoCom, 2018, pp. 1–6.

[22] L. Bandiera, D. Gomez-Cabeza, J. Gilman, E. Balsa-Canto, and F. Meno-
lascina, “Optimally Designed Model Selection for Synthetic Biology,”
ACS Synth. Biol., 2020.

[23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS,
2017, pp. 5998–6008.

[24] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Net-
works for Biomedical Image Segmentation,” in MICCAI, 2015, p. 234.

[25] N. Dietler, M. Minder, V. Gligorovski, A. M. Economou, D. A. H. L.
Joly, A. Sadeghi, C. H. M. Chan, M. Koziński, M. Weigert, A.-F.
Bitbol, and S. J. Rahi, “A convolutional neural network segments yeast
microscopy images with high accuracy,” Nat. Commun., p. 5723, 2020.

[26] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen,
“Axial-deeplab: Stand-alone axial-attention for panoptic segmentation,”
arXiv:2003.07853, 2020.

[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in IEEE/CVF CVPR, 2016, pp. 770–778.

[28] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities
improve neural network acoustic models,” in ICML, 2013, p. 3.

[29] A. Molina, P. Schramowski, and K. Kersting, “Padé activation units:
End-to-end learning of flexible activation functions in deep networks,”
in ICLR, 2019.

[30] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More
deformable, better results,” in IEEE/CVF CVPR, 2019, pp. 9308–9316.

[31] H. Su, V. Jampani, D. Sun, O. Gallo, E. Learned-Miller, and J. Kautz,
“Pixel-adaptive convolutional neural networks,” in IEEE/CVF CVPR,
2019, pp. 11 166–11 175.

[32] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval
Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.

[33] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese,
“Generalized intersection over union: A metric and a loss for bounding
box regression,” in IEEE/CVF CVPR, 2019, pp. 658–666.

[34] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
dense object detection,” in IEEE ICCV, 2017, pp. 2980–2988.

[35] K. Chen, J. Wang, J. Pang et al., “MMDetection: Open mmlab detection
toolbox and benchmark,” arXiv:1906.07155, 2019.

[36] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
in ICLR, 2019.