Prediction of Cytotoxicity Related PubChem Assays Using High-Content-Imaging Descriptors derived from Cell-Painting

Vollmers, Luis (2023)
Prediction of Cytotoxicity Related PubChem Assays Using High-Content-Imaging Descriptors derived from Cell-Painting.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00020236
Master Thesis, Primary publication, Publisher's Version

Text
master2A.pdf
Copyright Information: CC BY 4.0 International - Creative Commons, Attribution.
Download (12MB)

Item Type:

Master Thesis

Type of entry:

Primary publication

Title:

Prediction of Cytotoxicity Related PubChem Assays Using High-Content-Imaging Descriptors derived from Cell-Painting

Language:

English

Referees:

Schmitz, Prof. Dr. Katja ; Bender, Dr. Andreas

Date:

2023

Place of Publication:

Darmstadt

Collation:

79 Seiten

DOI:

10.26083/tuprints-00020236

Abstract:

The pharmaceutical industry is centred around small molecules and their effects. Apart from the curative effect, the absence of adverse or toxicological effects is cardinal. However, toxicity is at least as elusive as it is important. A simple definition is: ’toxicology is the science of adverse effects of chemicals on living organisms’.1 However, this definition comprises several caveats. What is the organism? Where do therapeutic and adverse effects start and end? Even for the simplest organisms’ toxicity, cytotoxicity, the mechanisms are manifold and difficult to unravel. Hence, it remains obscure which characteristics a compound has to combine to be labelled as toxic. One attempt to illuminate these characteristics are novel cell-painting (CP) assays. For a CP assay, cells are perturbed by libraries of small compounds, which might affect the cellular morphology before images are taken via automated fluorescence microscopy. Five fluorescent channels are used for imaging, and these channels correspond to certain cell organelles.2 Therefore CP data contains information about cell structure variations caused by each compound. Which subinformation is actually valuable within these morphological fingerprints remains elusive. Therefore a significant part of the project presented here is dedicated to exploring the CP data and their predictive capabilities comparatively. They will be compared against different descriptors for a variety of bioassays. The CP data used in this project contains roughly 30 000 compounds and 1800 features.3 In chemistry, the structure determines the properties of a compound or substance. Therefore, apart from CP, structural fingerprints are used as a benchmark descriptor set for comparison. In this project extended-connectivity fingerprints (ECFPs) were used to encode the compounds’ structures as numerical features. This work is concerned with morphological changes that correspond to toxicity. Thus, the CP data were combined with toxicological endpoints from specific assays selected from the PubChem database. The selection process implemented a minimum number of active compounds, a size criterion and the occurrence of toxicologically relevant targets. After the selected assays were combined with each of their descriptors, machine learning models were trained, and their predictive power was evaluated against specific metrics. The predictions can be divided into four cycles. In the first cycle, the CP data are used as descriptors, the second cycle used the structural fingerprints, and the third cycle used a subset of both. A rigorous feature engineering process selected the subsets. The last cycle skipped the feature engineering and combined all CP and ECFP descriptors into one large set of inputs. The evaluation of the prediction metrics illuminates which strengths and shortcomings the morphological fingerprints feature compared to the structural fingerprints. It turned out that there are two groups of assays: those PubChem assays that are generally better predicted with CP features and those that have higher predictive potential when using ECFP. Additionally, it was revealed that ECFP comprise higher specificity compared to CP data which show higher sensitivity on the other hand. A high sensitivity means the prediction rarely mislabels a sample as negative (e.g. non-toxic) compared to the number of correctly labelled positive samples (e.g. toxic compounds.). Based on these results, CP is better suited for toxicity prediction and drug safety evaluations since the mislabelled, positive compound can lead to expenses or even damage to health. Furthermore, based on the data from fluorescent channels, an enrichment measure was introduced and calculated for the aforementioned two groups of PubChem assays. This enrichment connects predictive performance with cell organelle activity. The hypothesis was that PubChem assays, reliably predictable from CP data, should exhibit increased enrichment, which was the case for four out of five fluorescence microscopy channels. As a next step, phenotypic terms were manually generated to categorize the different PubChem assays. These terms corresponded to cellular mechanisms or morphological processes and were generated unbiasedly. Nevertheless, they are subject to human error. The phenotypic annotations that are found to be enriched for successful modelling approaches might guide the preselection of bioassays in future projects. The enrichment analysis of phenotypic annotations detected that PubChem assays that could be well predicted via CP data are related to immune response, genotoxicity and genome regulation and cell death. Finally, the assays are assigned gene ontology (GO) terms obtained from the GO database. These terms comprise a controlled, structured vocabulary that explicitly describes the molecular function and biological processes of a given gene product. For PubChem assays associated with a protein target, the GO terms are collected. If an assay is particularly well predicted via CP descriptors, the associated GO terms can relate this finding to cellular function. Even though the analysis with go terms suffers from a minimal sample size, it was found that CP related assays usually correspond to processes concerning deoxyribonucleic acid (DNA) and other macromolecules. This finding is in good agreement with the analysis of the channel enrichment as well as the phenotypic enrichment.

Alternative Abstract:

Alternative Abstract

Language

Diese Arbeit befasst sich mit zellulären, morphologischen Veränderungen in Zusammenhang Toxizität. CP-Daten wurden hierbei mit toxikologischen Endpunkten aus spezifischen Assays kombiniert, die aus der PubChem-Datenbank ausgewählt wurden. Das Auswahlverfahren implementierte eine Mindestanzahl von Wirkstoffen, ein Größenkriterium und das Auftreten toxikologisch relevanter Endpunkte. Nachdem die ausgewählten Assays mit ihren Deskriptoren kombiniert worden waren, wurden Modelle für machine learning (ML) trainiert und ihre Vorhersagekraft anhand spezifischer Kenngrößen bewertet. Die Vorhersagen können in vier Zyklen unterteilt werden. Im ersten Zyklus wurden die CP-Daten als Deskriptoren verwendet, im zweiten Zyklus wurden strukturelle Merkmale verwendet, und im dritten Zyklus wurde eine Teilmenge beider verwendet. Ein ausgiebiger Feature-Engineering-Prozess wählte die Teilmengen aus. Im letzten Zyklus wurde das Feature-Engineering übersprungen und alle CP- und ECFP-Deskriptoren zu einem großen Datensatz zusammengefasst. Die Auswertung der Vorhersagemetriken zeigt, welche Stärken und Mängel die morphologischen Fingerabdrücke im Vergleich zu den strukturellen Merkmalen aufweisen. Es stellte sich heraus, dass es zwei Gruppen von Assays gibt: jene PubChem-Assays, die mit CP-Daten im Allgemeinen besser vorhergesagt werden können, und jene, die bei Verwendung von ECFP ein höheres Vorhersagepotential haben. Zusätzlich wurde gezeigt, dass ECFPs eine höhere Spezifität aufweisen als CP-Daten, die andererseits eine höhere Empfindlichkeit zeigen. Eine hohe Empfindlichkeit bedeutet für eine Vorhersage, dass eine Probe im Vergleich zur Anzahl korrekt markierter positiver Proben (z. B. toxische Verbindungen) selten falsch als negativ (z. B. nicht toxisch) vorausgesagt wird. Basierend auf diesen Ergebnissen sind CP-Daten besser für die Vorhersage der Toxizität und die Bewertung der Arzneimittelsicherheit geeignet, da eine falsch ausgewiesene positive Verbindung zu Kosten oder sogar zu Gesundheitsschäden führen kann. Darüber hinaus wurde basierend auf den Daten der Fluoreszenzmikroskopiekanäle eine EnrichmentGröße eingeführt und für die oben genannten zwei Gruppen von PubChem-Assays berechnet. Diese Enrichment-Größe verbindet die Vorhersageleistung mit der Aktivität der Zellorganellen. Die Hypothese war, dass PubChem-Assays, die zuverlässig aus CP-Daten vorhersagbar sind, eine erhöhte enrichment-Größe aufweisen sollten, was bei vier von fünf Fluoreszenzmikroskopiekanälen der Fall war. Als nächster Schritt wurden phänotypische Kennwörter manuell generiert, um die verschiedenen PubChem-Assays zu kategorisieren. Diese Begriffe entsprachen zellulären Mechanismen oder morphologischen Prozessen und wurden unvoreingenommen generiert. Trotzdem unterliegen sie menschlichen Fehlern. Die phänotypischen Annotationen, die für erfolgreiche ML Modelle angereichert sind, könnten die Vorauswahl von Bioassays in zukünftigen Projekten vereinfachen. Die Enrichment-Analyse phänotypischer Annotationen ergab, dass PubChemAssays, die über CP-Daten gut vorhergesagt werden konnten, mit Immunantworten, Genotoxizität und Genomregulation sowie Zelltod zusammenhängen. Schließlich werden den Assays GO-Begriffe zugewiesen, die aus der GO-Datenbank stammen. Diese Begriffe umfassen ein kontrolliertes, strukturiertes Vokabular, das die molekulare Funktion und die biologischen Prozesse eines bestimmten Genprodukts explizit beschreibt. Für PubChem-Assays, sofern sie einem Protein Target zugeordnet sind, wurden die GO-Begriffe gesammelt. Wenn ein Assay über CP-Deskriptoren besonders gut vorhergesagt wird, können die zugehörigen GO-Terme diesen Befund mit der Zellfunktion in Beziehung setzen. Obwohl die Analyse mit GO-Begriffen durch eine kleine Stichprobengröße eingeschränkt sind, wurde festgestellt, dass CP-bezogene Assays normalerweise Prozessen entsprechen, die DNA und andere Makromoleküle betreffen. Dieser Befund stimmt gut mit dem Enrichment der Fluoreszenzmikroskopiekanälen sowie den phänotypischen Annotationen überein.

German

Status:

Publisher's Version

URN:

urn:nbn:de:tuda-tuprints-202360

Classification DDC:

000 Generalities, computers, information > 004 Computer science
500 Science and mathematics > 540 Chemistry
500 Science and mathematics > 570 Life sciences, biology

Divisions:

07 Department of Chemistry > Clemens-Schöpf-Institut > Fachgebiet Biochemie > Biologische Chemie

Date Deposited:

02 Jun 2023 12:06

Last Modified: