Statistische Modelle und Inferenz der strukturellen Biophysik

Schmidt, Michael (2022)
Statistische Modelle und Inferenz der strukturellen Biophysik.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00021148
Ph.D. Thesis, Primary publication, Publisher's Version

Text (Einige Grafiken und Tabellen sind von der CC-Lizenz ausgenommen)
Dissertation_Michael_Schmidt_v2.pdf
Copyright Information: CC BY-SA 4.0 International - Creative Commons, Attribution ShareAlike.
Download (9MB)

Item Type:

Ph.D. Thesis

Type of entry:

Primary publication

Title:

Statistische Modelle und Inferenz der strukturellen Biophysik

Language:

German

Referees:

Hamacher, Prof. Dr. Kay ; Drossel, Prof. Dr. Barbara

Date:

2022

Place of Publication:

Darmstadt

Collation:

xiii, 121 Seiten

Date of oral examination:

3 November 2021

DOI:

10.26083/tuprints-00021148

Abstract:

Mathematische Modelle sind essentielle Werkzeuge für die Strukturanalyse von Biomolekülen und ergänzen Experimente. Dank enorm steigender Datenmengen sind probabilistische Ansätze aus den Bereichen der statistischen Inferenz und des maschinellen Lernens prominenter denn je. In dieser Arbeit betrachten wir drei verwandte biophysikalische Fragestellungen und bearbeiten diese mit der Entwicklung von effizienten Modellen auf Basis der statistischen Mechanik.

Der erste Teil betrachtet die sequenzbasierte Vorhersage von Proteinstrukturen. Schnell wachsende Sequenzdatenbanken machten dies seit dem letzten Jahrzehnt zu einer vielversprechenden Alternative im Vergleich zu teuren und oft limitierten experimentellen Methoden. Wir untersuchen die sogenannte Direct-Coupling-Analysis (DCA), welche Kontaktinformationen aus einem multiplen Sequenzalignment (MSA) extrahiert. Dies entspricht einem inversen Potts-Modell aus der statistischen Physik, bei dem Korrelationen in Form von empirischen relativen Häufigkeiten gegeben sind und Parameter des Hamiltonians bestimmt werden müssen. Hierbei werden die Spin-Zustände durch die q verschiedenen Aminosäuretypen repräsentiert. Die exponentielle Zunahme der Terme in der Zustandssumme erfordert geeignete Approximationsmethoden wie beispielsweise die Mean-Field-Inversion. Wir fügen die folgenden Erweiterungen ein, um eine erhöhte Vorhersagegenauigkeit zu erhalten.

1. Die Vorhersagekraft der DCA ist durch die ausschließliche Berücksichtigung von lokalen Feldern und Zweierkopplungen begrenzt, während Wechselwirkungen höherer Ordnung in Proteinen bekanntlich auftreten. Wir erweitern den Hamiltonian um einen Dreierkopplungsterm und leiten analytische Gleichungen innerhalb der Mean-Field-Approximation her. Eine anschließende Auswertung mit einem Benchmark-Datensatz übertrifft ein reines Zweikörper-DCA-Modell. Unsere Implementierung ist hochgradig parallel, was zu schnellen Laufzeiten auf modernen Computern führt.

2. Die DCA-Scores für die Kontaktvorhersage ergeben sich aus den erhaltenen Zweierkopplungen. Dies wird durch eine Transformation einer q × q-Matrix auf einen skalaren Wert erreicht, wobei jedoch potenziell wichtige Informationen verloren gehen. Wir entwickeln ein Schema zur Nutzung aller verfügbaren Kopplungsinformationen. Es beruht auf der Inferenz eines sekundären Potts-Modells mithilfe eines MSAs, das aus den Feldern und Kopplungen der ersten DCA besteht. Ein Benchmark zeigt erneut eine verbesserte Genauigkeit.

Der zweite Teil befasst sich mit dem Vergleich von biomolekularen Strukturen. Wir entwickeln den probabilistischen Subgraphisomorphismus SICOR und wenden ihn auf RNA-Sekundärstrukturgraphen an. Die Graphen stammen aus einem sogenannten Systematic-Evolution-of-Ligands-by-Exponential-Enrichment (SELEX)-Experiment, bei dem die Auswahl von RNA-Aptameren auf struktureller Diversität beruht. Wir sind in der Lage, angereicherte SELEX-Iterationen zu identifizieren und übertreffen bestehende State-of-the-Art-Methoden. Darüber hinaus erlaubt SICORs allgemeines Design den Vergleich beliebiger Graphen und garantiert somit eine breite Anwendbarkeit sowohl in verwandten Bereichen wie der Chemoinformatik als auch in angrenzenden Gebieten wie der Analyse von sozialen Netzwerken.

Das Verständnis der funktionellen Eigenschaften einer Proteinstruktur ist von fundamentaler Bedeutung für medizinische Bereiche wie die Medikamentenentwicklung. Im dritten Teil analysieren wir die Proteindynamik in einem informationstheoretischen Kontext und stellen eine Methode zur Identifikation von funktionalen Einheiten vor. Sie beruht auf der Kullback-Leibler-Divergenz DKL zwischen den Boltzmann-Verteilungen von zwei anisotropen Netzwerkmodellen (ANM). Hierbei definieren wir zunächst ein Mapping zwischen einem Ziel-ANM und einem dimensionsreduzierten Modell-ANM und minimieren die DKL in den Modellparametern. Durch Hinzufügen einer zweiten Optimierungsebene sind wir in der Lage, das optimale Mapping und die entsprechenden funktionellen Residuen zu identifizieren. Wir evaluieren die Aussagekraft unserer Methode durch einen Benchmark an einem Satz gut untersuchter Ionenkanalporen.

Alternative Abstract:

Alternative Abstract

Language

Mathematical models are essential tools for the structural analysis of biomolecules and complement experiments. Thanks to vastly increasing amount of data, probabilistic approaches from areas like statistical inference and machine learning are more prominent than ever. In this thesis we address three related types of biophysical challenges and develop efficient statistical mechanical models to tackle them.

The first part considers sequence-based protein structure prediction. Rapidly growing sequence databases made this a promising alternative to the expensive and often limited experimental methods in the last decade. We investigate the direct-coupling analysis (DCA), which extracts contact information from a multiple sequence alignment (MSA). It corresponds to an inverse Potts model in statistical physics, where correlations are given in form of empirical frequency counts and parameters of the Hamiltonian have to be inferred. Here, the q different amino acid types represent the spin states. Exponential growth of the number of terms in the partition function requires suitable approximation methods, such as the mean-field inversion. We incorporate the following extensions in order to obtain an increased prediction accuracy.

1. DCA’s performance is limited by the exclusive consideration of local fields and two-body couplings, while higher-order interactions are known to occur in proteins. We extend the Hamiltonian by a three-body coupling term and derive analytic equations within the mean-field approximation. A subsequent evaluation on a benchmark data set outperforms a plain two-body DCA model. Our implementation is highly parallel, resulting in fast runtimes on modern computers.

2. DCA’s contact prediction scores follow from the inferred two-body couplings. This is achieved by a transformation of a q × q matrix to a scalar score, where potentially important information is lost. We develop a scheme to utilize all available coupling information. It is based on the inference of a secondary Potts model onto an MSA consisting of the first DCA’s fields and couplings. A benchmark again shows an improved accuracy.

The second part investigates the comparison of biomolecular structures. We develop the probabilistic subgraph isomorphism SICOR and apply it to RNA secondary structure graphs. The graphs come from a Systematic Evolution of Ligands by Exponential Enrichment (SELEX) experiment, where RNA aptamer selection builds on structural diversity. We are able to identify enriched SELEX iterations and outperform existing state of the art methods. Furthermore, SICOR’s general design allows the comparison of arbitrary graphs and makes it suitable for a broad application both in related fields like cheminformatics as well as unrelated tasks like the analysis of social networks.

Understanding the functional properties of a protein structure is of fundamental importance for medical fields such as drug development. In the third part of this thesis we analyze the protein dynamics in an information theoretic context and present a method for the identification of functional units. It is based on the Kullback-Leibler divergence DKL between Boltzmann distributions of two anisotropic network models (ANMs). Here, we first define a mapping between a target ANM and a dimension-reduced model ANM and minimize the DKL in the model parameters. By adding a second optimization level we are able to identify the optimal mapping and the corresponding functional residues. We assess the significance of our method by a benchmark on a set of well-studied ion channels pores.

English

Status:

Publisher's Version

URN:

urn:nbn:de:tuda-tuprints-211486

Classification DDC:

000 Generalities, computers, information > 004 Computer science
500 Science and mathematics > 510 Mathematics
500 Science and mathematics > 530 Physics
500 Science and mathematics > 570 Life sciences, biology

Divisions:

05 Department of Physics > Institute for Condensed Matter Physics

TU-Projects:

DFG|GRK1657|GRK 1657

Date Deposited: