Distributed Representations of Atoms

Mar 18, 2022   •   Luis M. Antunes

An important aspect of applying Machine Learning (ML) to a problem involves choosing a suitable representation for the objects under consideration. For example, to classify a flower that belongs to one of a number of different species, one might choose to represent the flower as a list of scalar and categorical features, such as petal length, petal width, and color. The ability of an ML model to properly perform a task depends strongly on the quality of the representation used. In fact, this is so important in determining the success of an ML project, that feature engineering is often the activity that receives the most attention.

For many types of problems that ML is applied to, the advent of Deep Learning changed things dramatically. Instead of requiring the manual engineering of features, Deep Learning is often capable of automatically building suitable representations from more basic, and often higher-dimensional representations of the objects under consideration. In Computer Vision and image processing, this basic representation typically consists of the pixels of the image. In Natural Language Processing (NLP), the basic representation is often a one-hot vector (when the objects are individual words), or a bag-of-words vector (when the objects are sentences). In board games, such as Go or Chess, the representation can be layers of binary matrices that represent player and opponent pieces and their relative positions (amongst other elements of the game). In all of these examples, the basic representations are high-dimensional, and require relatively little effort from the ML practitioner to devise. Classic ML algorithms, such as the Support Vector Machine, or the Random Forest, would likely have problems learning with such high-dimensional representations. A Deep Learning model, on the other hand, is often able to automatically extract the representation best suited for the task at hand starting from such basic representations, since internal representations are learned end-to-end, by taking advantage of the error signal provided by backpropagation.

Local vs. Distributed Representations

The representations a Deep Learning model learns, however, may not be readily interpretable by a human. The kinds of representations learned by a Deep Learning model usually stand in contrast to a classical, manually constructed representation. Such manually curated feature vectors can be called local representations, since each of the elements in the list of features may have little to do with the other elements, and typically represents something concrete and intelligible to a human. Representations in which each of the elements are coupled in some way, and work together to represent the object under consideration, can be called distributed representations. A Deep Learning model usually learns distributed representations.

Figure 1: Illustration of one-hot and distributed representations. In the diagram, there are n kinds of objects represented, and d is the adjustable number of dimensions of the distributed representation.

But what do these distributed representations mean? How do they represent the object under consideration? Essentially, they convert the object they represent into a point in a multi- (usually lower-) dimensional space, in such a way that the relationship to other objects is preserved. This usually means that the Euclidean distance between two such objects/points reflects their similarity. These representations thus provide a more principled structure to the input data, and are usually lower-dimensional, which should allow an ML model to learn a task more quickly and effectively.

Word2vec

This is nothing terribly new in the world of ML. In fact, the power of distributed representations was perhaps initially demonstrated in the field of NLP. In NLP tasks, individual words are commonly represented by vectors. As mentioned above, these can be one-hot vectors, but they are often pre-trained distributed representations. Pre-training is a procedure that typically involves applying an unsupervised learning algorithm to a dataset. This has been one of the ways that distributed representations of objects, such as words, are created. Representations created this way can be re-used in downstream tasks. Using pre-trained representations has the effect of accelerating and even improving learning. One such unsupervised learning algorithm for developing pre-trained distributed representations of words is Word2vec. The result of the algorithm is that a vector is assigned to each word in the vocabulary present in the dataset; each word is assigned a unique vector from a learned semantic space, extracted automatically from the structure of the data. And since the words inhabit a structured semantic space, they can be combined using logical operations (such as addition and subtraction) to produce new vectors which also inhabit the same space.

How does the Word2vec algorithm work? Consider the adage "You shall know a word by the company it keeps." (attributed to J. R. Firth). The algorithm takes this concept and formalizes it as a learning problem. Specifically, the objective is to produce a model that assigns maximal probability to a word when it occurs in the context of certain other words. In practice, this usually means the model must learn to predict the neighbouring words that most often appear with it. The idea is that a word gets its meaning from the context in which it is used. Consider the statement "A person walks into a store". How probable is it that the word "person" appears in the context of the words "walks" and "store"? It's surely more probable than the word "car" appearing in that context, for example. As we'll see shortly, the model used to predict neighbouring words is a feed-forward network with a single hidden layer, and the initial parameter matrix that transforms the input into the hidden layer is where we'll find the distributed representations for each word.

What does any of this have to do with atoms, or Materials Science? Consider that ML and Deep Learning are playing increasingly prominent roles in Computational Materials Science. It's not surprising, since many of the tasks encountered in Computational Materials Science involve the prediction of the properties of a material. The number of datasets related to materials properties is also increasing. It therefore seems natural that the advances made in ML should find application in Computational Materials Science. As such, a natural question to ask is: how are materials represented in the context of an ML task? The approaches being used are rapidly changing, but traditionally, materials and atoms are represented using local representations. For example, it is common to see atoms represented as vectors of features such as electronegativity, atomic radius, ionic radius, etc. If a specific class of material is being considered, such as the perovskites, with formula ABX3, then a material can, for example, be represented by concatenating the A, B and X atom vectors.

SkipAtom

In recent years, pre-trained distributed representations have begun making their way into Computational Materials Science as well. Two of the more widely known atomic representations of this kind are Atom2Vec and Mat2Vec. We won't go into how these algorithms work here, but in brief, Atom2Vec takes the compositions found in materials databases, constructs a large co-occurrence matrix of atoms and their atomic environments, and performs singular value decomposition to obtain atom vectors. Mat2Vec, on the other hand, applies the Word2vec algorithm to a large number of abstracts from the materials literature, resulting in representations being learned for atoms (in addition to other kinds of entities). In a paper published today in npj Computational Materials, we introduce another algorithm for building distributed representations for atoms, which we call SkipAtom.

We introduced SkipAtom because we believe it has some advantages over the other existing approaches (which we discuss in the paper). The SkipAtom approach makes an analogy between words and atoms, and sentences and chemical compositions. It attempts to formalize the idea that an atom shall be known by "the company it keeps". We expect to learn the "chemical semantics" of an atom by observing the chemo-structural context in which it occurs. This means that, analogously to Word2vec, we will build a model in an unsupervised fashion that assigns maximal probability to an atom when it occurs within the context of certain other atoms. In practice, the model learns to predict which atoms typically surround any given atom. We won't go into the details here, but we need to be precise about what we mean by "surrounds", and in brief, this means applying existing algorithms that operate on a crystal structure to create a graph representing what's connected to what, and examining the immediate neighbours of a given atom in the graph.

Figure 2: Crystal structures in a materials database are converted into co-occurring atom pairs for training of SkipAtom vectors.

A crystal structure database is used for learning the SkipAtom vectors. We use the Materials Project database, from which we obtain over 126,000 compounds and their crystal structures. The first step requires the creation of training pairs, as depicted in Figure 2. This results in over 15 million pairs being generated. These pairs of co-occurring atoms are then used in a prediction task: given the first atom in the pair, the model must predict what the other atom is. In practice, this means minimizing the cross-entropy loss between the actual paired atom and the predicted one, given that atoms are represented as one-hot vectors at this stage. The atom vectors are found in the embedding matrix, We (see Figure 3).

Figure 3: A depiction of how SkipAtom vectors are derived through training. An atom, represented as a one-hot vector, x, is multiplied with the matrix We, to yield the intermediate vector h. Then h is multiplied with Ws, to which a softmax operation is applied to obtain the output . During training, the cross-entropy loss between the context atom, represented by y, and the predicted atom, , is minimized. The columns of the matrix We will be the learned atom vectors, after training is complete.

Results

A common way of visualizing the results of learning such representations is to apply a dimensionality reduction technique, such as PCA or t-SNE. The vectors can be reduced to 2 or 3 dimensions, and then plotted. We applied t-SNE to the learned SkipAtom atom vectors, reducing them to 2 dimensions from their original 200 dimensions, and plotted them. See Figure 4. There appears to be logical structure to the data. For example, the alkali metals are clustered together, as are the light non-metals. It is important to note that while the locations of the atoms in the plot may look somewhat arbitrary, they in fact reflect chemo-structural nuances gleaned from the dataset.

Figure 4: Dimensionally reduced SkipAtom atom vectors with an original size of 200 dimensions. The vectors were reduced to 2 dimensions using t-SNE.

Visualizing the representations helps us to understand something about the structure of the learned space, however, the best way to assess their quality is to use them in a task, and examine the resulting performance. For benchmarking and comparison purposes, there exists a collection of datasets and tasks for Computational Materials Science, known as the Matbench test suite. We took a number of tasks from this suite, and compared the performance using an ElemNet architecture with various kinds of atom vector representations. The results are depicted in Figure 5.

Figure 5: A comparison of results on benchmark tasks. TBG refers to the Theoretical Band Gap task (MAE in eV), BM to the Bulk Modulus task (MAE in log(GPa)), SM to the Shear Modulus task (MAE in log(GPa)), RI to the Refractive Index task (MAE in n), and TM to the Theoretical Metallicity task (ROC-AUC). These tasks make use of structure information. EBG refers to the Experimental Band Gap task (MAE in eV), BMGF to the Bulk Metallic Glass Formation task (ROC-AUC), EM to the Experimental Metallicity task (ROC-AUC). These tasks make use of composition only. The results that are outlined in bold represent the best score for that task.

From the results above, we conclude that SkipAtom is about as effective as Mat2Vec, and better on some tasks. It also clearly performs better than Atom2Vec (not shown here). Its real advantages, however, are its conceptual simplicity, and accessibility to researchers. While Mat2Vec requires the curation of millions of scientific abstracts, and a number of hand-crafted processing rules, SkipAtom requires access to a dataset of crystal structure information, which can be of any size. These datasets are growing, and becoming more user-friendly, making them accessible to anyone. SkipAtom vectors can also be extended, by taking into consideration other aspects of chemistry, such as the oxidation states of atoms in a material (imagine learning representations for atoms in their various oxidation states). We envision researchers developing pre-trained SkipAtom vectors for their own, custom datasets. While the results aren't shown here, it appears that using pre-trained distributed representations of atoms helps most on tasks with smaller datasets. As new materials are sought to address pressing economic and environmental demands, smaller exploratory datasets of materials properties will likely be common. We expect SkipAtom vectors to play a role in devising effective models based on these smaller, targeted datasets.

We've made the source code freely available, distributed under the MIT license. There are trained SkipAtom atom vectors located in this file, for easy incorporation into any codebase. We hope that this research will be of use to the Computational Materials Science community, and aid in the quest to discover new materials that address social, economic and environmental needs.


Notes