Conquering the Molecules
Computational quantum chemistry is due for an explosion.
Could computational quantum chemistry soon have a research explosion? I think this is likely. The field, which is the first principles modelling of molecules at the quantum level, could see a surge in importance as its relationship and usefulness to other scientific fields develops. Within ten years, I think this level of importance could be comparable to the current artificial intelligence boom. In this post, I’ll be listing several, fairly unobvious reasons why. I’ll be explaining its involvement in the life sciences, deep learning and spectroscopy. I’ll also give my take on engineering problems that could develop the field quicker.
The first principles modelling of many molecules is hard - we can’t solve the underlying equations by hand because they are complicated non linear differential equations. So we have to resort to using computers, which offers a whole new avenue of complexity. By the end of this article, I hope I can convince you to invest some time into learning about the problems and techniques in the area. If you’re obsessed with niche graph theory, or crazy about theoretical physics, stick with that. But if you’re on the fence - maybe I can persuade you that research here might be worth while.
The Life Sciences
Quantum chemistry concerns the modelling of many-body systems at the quantum level. There are two ingredients for a research explosion in the field. The first is the presence or societally relevant open problems. The second is a catalyst.
As for the first ingredient, I think that there are plenty of open problems in the life sciences which hinge on a better understanding of the physics of many electron systems. I think that progress will be catalysed by better compute power, as our chip manufacturing capabilities grow inline with Moore’s Law. The following sections outline some of the open problems.
The obvious one is protein folding. If we can predict how proteins fold, we can get an insight into their shape, and therefore, their function. So we can figure out how they’re involved in disease, and build better drugs to bind to them. Even with the invention of Deepmind’s Alphafold, we still don’t understand the how to model protein folding from a first principles, quantum mechanical perspective. Alphafold right now is still just a big linear interpolator with no insight into the mechanics of folding itself.
We still don’t know how to make enzymes as good as the natural ones that have evolved naturally. There is still a wide research effort, largely due to Judith Klinman, to understand possible quantum tunnelling effects that might be speeding up reaction rates in enzymes. Jim Al-Khalili has written a great exposition on the state of quantum mechanics in biology in his book ‘Life on the Edge’.
We also still don’t know how to replicate photosynthesis as efficiently as plants in nature, and the mechanism behind how light catalyses CO2 and water splitting. Imagine if we could genetically engineer ourselves to photosynthesis - that means we wouldn’t need to eat anymore.
Diseases exist because of the unwanted actions of proteins. For example, the GSK3 protein acts by adding groups of phosphorus to two types of critical amino acids, in a process important to cell signalling - uninhibited GSK3 has been thought to be linked to bipolar disorder. So suppose you wanted to make a medicine. One strategy you could use is to modify the action of a problematic protein in your body. To do this, you could design a better molecule that covalently bonds to the protein in specific place, inhibiting its action. But to design such a molecule, you would need to experiment, or test with a computer, what the properties would be. You would need to know things like its electric charge distribution to understand if it could feasibility stick to the protein, or the energies it can have to measure its reactivity.
But to do this all of this effectively, we need to work out how to model molecules. But modelling the physical properties of molecules without computational techniques is insanely hard. Sure, you could cook up loads of compounds and test to see what works. But that’s expensive, time consuming, and dangerous.
So this is where the ‘computational’ in computational chemistry comes in. Ideally, we’d want to model everything on a computer beforehand. Because molecules are so small, we need to model them with quantum physics. And yet Schrodinger’s equation is extremely hard to solve by hand and on paper alone - it took me three years to learn how to solve for the spectra of hydrogen alone. Later on in this essay, I’ll give an explicit example on why this is the case.
As a Tool to Train Machine Learning
Electronic structure calculations to compute quantum states are slow. This is because the equation that governs quantum systems is the Schrödinger equation. But the Schrödinger equation is an eigenvalue problem, which involves a procedure called ‘diagonalisation’ of a very large matrix that contains all of the information that a system has. But this is hard from a numerical algorithms standpoint, because matrix diagonalistion has been shown to be an algorithm that doesn’t scale very well. To do computational chemistry, most university departments have a cluster of high performance computers that go on for days solving a particular problem.
But what if there was a better way - could we figure out a way to use the developments in deep learning to get to our final states quicker? Take for example, Alphafold, which was trained on protein structures that were already painstakingly solved with physical techniques. In principle, we could train learning algorithms on structures that were solved with a computer. If this was the case, then we could use machine learning to quickly get results that were first order approximations to the time intensive first-principles calculations. I know for a fact that there is a flurry of researchers already working on this problem for small moleculess
To do this, we would need to get a comprehensive dataset across an insanely large amount of molecules, along with their electronic structure calculations. Which is why expertise In first principles calculations is going to be valuable in the future. Datasets today, especially in nucleic acids, are sparse. There also isn’t really a centralise, clean database where researchers can share datasets and train models on them. Personally, I’ve found this frustrating. When I contact researchers they don’t tend to share datasets, or don’t reply. I think that in principle, there’s nothing stopping us from doing this - platforms like Kaggle Or HuggingFace have been very successful in the machine learning world.
This argument is predicated on computer chips getting better, and compute becoming more widely available. I think that this trend will continue. There are already startups whose product is offering cloud based compute to computational chemistry research groups.
Seeing biology
There’s also an interesting case to be made for the observation and measurement of biological processes. Imagine if we could ‘see’ protein shapes interactions. Right now, it’s still extremely hard since we don’t have the right tools to look at stuff that small.. But to make the right tools, we would need to model how proteins interact with light.
One of the ways to model the interaction, again, is thorough computational chemistry, and electronic structure calculations. Being able to understand the absorption spectra relies on being able to calculate the average dipoles of, and energy gaps between, of the different excitations a molecule can have. There still are open problems in predicting the spectra of proteins. This is something that I am personally invested in as a research topic, being able to model the spectra that you’d see in circular dichroism.
In theory, if you could compute the spectra of proteins via electronic structure calculations ahead of time, without having to do an experiment, you could then backsolve the shape of a protein after looking at its spectra, by matching it up to similar structures that you have precomputed. But this would be hard, and to the best of my knowledge, we don’t have much ability to compute a circular dichroism spectra for anything larger than 50 molecules. Which is still really small.
Pedagogy and Tooling
I think one of the ingredients of a research explosion is ‘underdevelopedness’. By this, I mean gaps in the field which are seemingly low hanging fruit - since they are pedagogical or ‘tooling’ based issue.
In terms of pedagogy, some part of me thinks that the field isn’t developed yet because the learning curve is still too hard. There could be a substantial contribution in simplifying and organising the literature. The most basic concepts already require a decent understanding of numerical methods, quantum physics and chemistry, and so there is a high barrier to entry. But that doesn’t mean we can’t simplify and streamline the literature to create texts that unify the three areas
I personally spent hours trying to understand probably the most foundational of the techniques, something called Hartree Fock. To the best of my knowledge, there is pretty much only one book that gives an explicit and clear exposition of the numerical methods in the field - a book by Szabo and Ostlund.
In terms of tooling, as a fairly new researcher, I think that the tools used in computational chemistry feel disjointed and clunky. If you’ve followed my blog, I’ve written about this here. Most researchers still use software written in Fortran, which then outputs a bunch of text files with analysis of the molecules that they are studying. Wouldn’t it be great if there was a culture of putting everything in Python, so that researchers wouldn’t have to spend time parsing the results of outdated software?
As I mentioned earlier - the organisation of datasets also feels very clunky. The centralised repositories are hard to use and don’t really have api’s. It’s hard to replicate experiments, and even the simplest experiments don’t have much documentation to guide people that are starting out. It’s no wonder why, anecdotally, that most people in the field take five to six years to finish it in the US. To compare it to the progress in AI, why isn’t there a Kaggle or Huggingface type platform where quantum structure calculations can be easily shared?
Conclusion
I wrote this piece as a way to get my thoughts clear on why I do research in this area. I by no means think that my arguments are definitive yet, since I’m still learning. But hopefully, I’ve put down some points that are food for thought.
Technical - Modelling Molecules is Really Complex
I’m going to start with the simplest molecule you can think of - two hydrogen atoms bonded together to make H2, and illustrate the complexity of even a simple construction. As chemists, we are interested in getting the ground state, or smallest value, of the energy that a system can have. The energy can be split up into two types.
To get what this actually looks like in math terms, remember that atoms are made of nuclei and electrons. In this case, we have two nuclei and two electrons. Modelling all four things is already too hard. For our case, we will make an approximation and pretend that the two nuclei are fixed in space, and study the kinetic and potential energy contributed by the two electrons, where we label their position in space by r_1 and r_2. You’ve probably heard of the physicist Robert Oppenheimer. This approximation carries his namesake.
Both electrons have some kinetic energy, which are the first two terms in the equation below. And both electrons interact with the positive charges nucleus through positive charges, which are the next four terms - the first electron r_1 interacts with the nucleus a_1 and a_2, and similarly with the second electron r_2. The last term is interesting, because the two electrons and also interact with each other as well - it’s the potential energy we get from the two of those electrons repelling each other.
What we want to do is find the explicit form of the states of each electron that make this expression as small as possible, and get a value for that number.
The problem though is that the equation you have to solve is really hard to solve just with a pen and paper. It’s because it’s a non linear differential equation, which means that there are some parts in it which have functions that don’t add nicely. And to the best of our knowledge, there are no really general ways to solve these kinds of problems with a formula. This difficulty in solving makes molecular physics a lot more interesting in my opinion. We need to use methods that can be solved with computers.
