Generating Crystal Structures with Large Language Models

Sep 11, 2023   •   Luis M. Antunes

The success of the deep neural network-based Large Language Model (LLM) in Natural Language Processing (NLP), perhaps most famously embodied in the ChatGPT tool, has stimulated the imaginations of computational scientists, and challenged existing beliefs about the origins of intelligence and knowledge. Many argue that these models learn more than simply spurious correlations and surface statistics, and in fact develop internal world models—intricate causal models of the world they are shown through a stream of symbols.

Although the best-known examples of the application of LLMs involve natural human language and general domains, researchers have recently begun to develop LLMs tailored specifically for solving problems in the domain of Chemistry. (See, for example, the work of the Andrew White lab, and also the SMILES-BERT and ChemBERTa models.) Most work has involved molecular property prediction tasks, and less attention has been paid to solid-state inorganic materials.

Crystal structure information is often stored in the Crystallographic Information File (CIF) format. This format consists of structured, human-readable text. A CIF file contains various important pieces crystallographic information, such as the crystal's chemical composition, space group, number of formula units in the unit cell, the unit cell's physical dimensions, and the fractional coordinates of each unique site in the unit cell. An intriguing question is whether LLMs can learn the chemistry inherent in solid-state structures by seeing very many examples of CIF files. To answer this question, we undertook the development of a model we've called CrystaLLM, a LLM of the CIF file format.

The Language of Crystals

A language model is trained on a corpus of text to predict the probability of a word occurring after a given sequence of preceding words. Normally, language models are trained on corpora of natural language. But they needn't be, and any kind of language will do. (In fact, the language needn't even be textual, and can be a sequence of pixels, for example.) The CIF file format is a language which provides a means of representing crystal structures in text, and it can also be the subject of language modelling.

The first step in training a language model on a corpus of CIF files involves tokenization. Tokenization is a procedure which converts the text of the corpus into a list of symbols, where each symbol may represent a word, a punctuation mark, or whatever abstractions we decide are most useful to work with. In the case of CIF files, we'll define tokens that represent CIF tags, such as _cell_length_a, as well as numeric digits, punctuation marks, and atomic symbols.

Once the corpus is tokenized, a language model (of the LLM variety) is trained in an unsupervised manner: a sequence of tokens is sampled from the corpus, and the model is tasked with predicting the token which follows each given token. More formally, for each given token, the model predicts a probability distribution over the vocabulary, and the cross-entropy loss between the predicted distribution and the target distribution (with all its mass on the target token) is minimized over the course of training (see Figure 1).

Figure 1: A depiction of the central concepts in training a Large Language Model of CIF files. A CIF file (left) is tokenized into a list of symbols. The list is processed by the model, which produces a list of probability distributions over the vocabulary, for each corresponding symbol in the input. The resulting predicted probability distributions are compared to the target distributions (each with all mass on the expected token), by computing the cross-entropy loss. The target tokens are the input tokens shifted one spot to the left (as the model must predict the next token given a sequence of preceding tokens). The tokens include CIF tags (blue), atoms (green), numeric digits (gold), and punctuation (red). Underlined tokens represent poor predictions.

A Crystal Corpus

To create an effective and versatile language model, more data is better. We obtained CIF files from three different sources: the Materials Project, the OQMD, and NOMAD. In total, roughly 3.5 million CIF files were obtained, representing over 800,000 unique reduced compositions. However, a reduced composition can represent many structures, as there can be polymorphs with different multiples of atoms in the unit cell, as well as different space groups. Taking into account these considerations, there are roughly 2.3 million entries in the dataset that represent unique combinations of cell composition and space group. Most of the entries represent ternary or quaternary compounds, and most come from the OQMD or NOMAD (see Figure 2). The dataset consists of over 700 million tokens when tokenized.

Figure 2: A visual description of the training dataset. The bar chart illustrates the distribution of reduced compositions in terms of the number of constituent elements they contain. The Venn diagram illustrates the numbers of unique reduced compositions obtained from each of the publicly accessible materials databases used to create the training dataset.

With the dataset in hand, we trained three versions of the model: a small model with 25 million parameters, a medium model with 85 million parameters, and a large model with 200 million parameters. (At the time of this writing, an extra-large model is in preparation, with 600 million parameters.)

Generating Crystal Structures

A language model can be used by sampling from the distributions it learned in training. In such a generative task, the model is given some number of tokens to start from (also known as a prompt), and then samples the next token given the existing context. The sampled token is concatenated to the existing sequence, and the process repeats until some terminating condition is reached (see Figure 3).

Figure 3: A depiction of the CIF file generation process. The first line of a CIF file (which includes the cell composition) is tokenized and processed by the model. A token is then sampled from the predicted distribution of the next token in the sequence. The sampled token is added to the list, and the process is repeated until a terminating condition is reached (e.g. a terminating token is generated).

In such a way, a complete and valid CIF file can be generated. To assess the quality of the model, we prompted the model with the first line of each CIF file in a held-out test set of about 10,000 structures. The models were able to generate CIFs which were consistent in terms of the printed space group more than 98% of the time, and which had reasonable bond lengths more than 76% of the time.

Experimenting with the generative capabilities of the model revealed that it was able to reliably generate well-known classes of inorganic compounds. It demonstrated the ability to generate rutiles, spinels, and elpasolites, for example (see Figure 4).

Figure 4: The generated structures of various classes of inorganic compounds. (a) AuO2. Color scheme: Au = yellow, O = red. Cell parameters: a, b: 4.838 Å, c: 3.429 Å; α, β, γ: 90.0°. (b) Sm2BO4. Color scheme: Sm = light green, B = green, O = red. Cell parameters: a, b, c: 8.918 Å; α, β, γ: 90.0°. (c) Sm2BS4. Color scheme: Sm = light green, B = green, S = yellow. Cell parameters: a, b, c: 10.884 Å; α, β, γ: 90.0°. (d) K2AgMoI6. Color scheme: K = brown, Ag = white, Mo = blue, I = purple. Cell parameters: a, b, c: 11.638 Å; α, β, γ: 90.0°. (e) KRb2TiF6. Color scheme: K = white, Rb = purple, Ti = brown, F = green. Cell parameters: a, b, c: 8.688 Å; α, β, γ: 90.0°.

The model also demonstrated the ability to generate pyrochlores. To test the model's ability to generalize beyond the training set, we constructed a space of pyrochlore compositions that the model had not seen in training, and prompted the model with those compositions. We then took the generated structures and performed DFT relaxation calculations, and then compared the generated structures to the structures obtained from this ab initio method. A good agreement was observed between the generated cell parameters and the DFT-computed cell parameters (see Figure 5).

Figure 5: The generated vs. DFT-derived value of the cell parameter a for selected pyrochlores not in the dataset. The error bars represent the ± standard deviation of the value of the a cell parameter for the three generation attempts (all of which resulted in the pyrochlore structure), while the y-coordinate of the points represents the mean value of the cell parameter across the three attempts. Inset: The generated structure of Pr2Mn2O7, with cell parameters: a, b, c: 10.343 Å; α, β, γ: 90.0°. Color scheme: Pr = yellow, Mn = purple, O = red.

Searching for Stable Structures

An important aspect which must be emphasized is that such a generative model of CIF files does not have an understanding of formation energy and structural stability. If the model was shown many polymorphs for a particular cell composition in training, there is no guarantee that it will generate the most stable one.

But the situation is analogous in NLP: a model such as GPT-4, by itself, is not a very good chatbot, in that it does not produce the kinds of responses a human would expect for a given prompt, even if the responses are syntactically correct and well-formed. To make GPT-4 useful for chatbot applications, it must be fine-tuned: its parameters must be adjusted so that it produces the responses we'd like. This is commonly performed with a technique called Reinforcement Learning from Human Feedback (RLHF). We envision that a model such as CrystaLLM can also be fine-tuned, using an analogous technique we call Reinforcement Learning from Thermodynamic Feedback (RLTF). The result of this kind of fine-tuning should be a model which can generate stable structures, and this is an avenue of investigation which we are actively pursuing.

To make a technique such as RLTF practical, we need a fast evaluator of a structure's energy. DFT would be much too slow for this purpose. Neural networks that can predict a structure's formation energy have been actively developed over the past several years, and neural networks have the advantage that they can provide fast predictions. One such model which can effectively predict a structure's energy is ALIGNN.

To evaluate the feasibility of using a model like ALIGNN as an evaluator of the formation energy per atom of generated structures, we took recently characterized structures from the inorganic chemical literature, and repeatedly prompted CrystaLLM with their compositions. We then evaluated the energy of the structures using ALIGNN. We found ALIGNN's predictions to be effective at discerning the stability of different polymorphs. An example is Tb3TeBO9, which was reported recently. CrystaLLM was able to generate the reported structure after 164 attempts, despite never having seen the compound in training, and we were guided to the relevant generation using ALIGNN's energy prediction (see Figure 6).

Figure 6: A comparison of the generated structures and ALIGNN formation energies of Tb3TeBO9 and the structure deposited at the CCDC. All structures belong to space group P63 (Z=2). Cell parameters: CCDC structure: a, b: 8.633 Å, c: 5.406 Å; α, β: 90.0°, γ: 120.0°. Generation attempt #1: a, b: 8.4878 Å, c: 5.8336 Å; α, β: 90.0°, γ: 120.0°. Generation attempt #164: a, b: 8.4681 Å, c: 5.8530 Å; α, β: 90.0°, γ: 120.0°. Color scheme: Tb = purple, Te = bronze, B = green, O = red.

Even if these results are noteworthy, random search is not a very effective way to look for stable structures. We're therefore currently investigating the use of various heuristic search algorithms, which incorporate ALIGNN's predictions. These algorithms should be an improvement over brute force random search, both in terms of reducing the number of generations required, and in the quality of the structures found.

Synthesizing New Chemical Knowledge

Computational systems which are capable of discovery and the synthesis of new knowledge, guided by some form of reasoning, are perhaps the ultimate goal of Artificial Intelligence. A recent demonstration of such a system is DeepMind's AlphaGo Zero, which shocked the world by overturning hundreds of years of thought on how the game of Go is best played. We wondered if an LLM trained on a large corpus of chemical structure information could analogously synthesize new chemical knowledge.

The new material LiTa2NiSe5 was recently investigated. The authors reported a layered structure, with Li atoms intercalated in between slabs of Ta, Ni, and Se atoms. We prompted CrystaLLM with the composition, and after the 75th attempt, it was able to produce a structure which resembled the reported structure (and which also had the lowest ALIGNN energy), despite never having seen this structure in training. We were curious about the origin of this structure, and searched its training set for compounds which resembled the composition in question. We found two compounds, Ta2NiSe5 and NaSn2CuSe5 (see Figure 7). These two compounds share structural similarities with LiTa2NiSe5, and although we can't be certain, it appears as though the model arrived at the generated structure by some form of analogy.

Figure 7: (a) The generated structure of LiTa2NiSe5 (a: 3.517 Å, b: 13.362 Å, c: 15.156 Å, Z=4), which resembles the recently reported structure of P. A. Hyde et al. 2023. (b) The structure of Ta2NiSe5, seen in training. (c) The structure of NaSn2CuSe5, a representative of NaM2CuSe5 (M=Sn, Os, Ni, Nb, Mn, Pr, Hf), which were seen in training. Colour scheme: Na = gold, Sn = purple, Cu = blue, Se = green.

Online Demo and API

We've created several means by which anyone can experiment with the models. First, an online demo app provides a way for users to prompt the model with a composition (and optionally Z and space group). Generated structures are displayed in a browser-based structure viewer provided by the Crystal Toolkit framework. Second, an API is available for programmatic access and high-throughput use. The API is implemented with a standard REST interface, and clients simply communicate over HTTP. The predicted formation energy per atom can optionally be included with the generated structure in the API response.

An API key is required to use the API. If you are interested in obtaining an API key, or have any other questions or feedback, please contact us.


Notes