<<

Algorithmic framework for X-ray nanocrystallographic reconstruction in the presence of the indexing ambiguity

Jeffrey J. Donatellia,b and James A. Sethiana,b,1

aDepartment of Mathematics and bLawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720

Contributed by James A. Sethian, November 21, 2013 (sent for review September 23, 2013)

X-ray nanocrystallography allows the structure of a macromolecule over multiple orientations. Although there has been some success in to be determined from a large ensemble of nanocrystals. How- determining structure from perfectly twinned data, reconstruction is ever, several parameters, including crystal sizes, orientations, and often infeasible without a good initial atomic model of the structure. incident photon flux densities, are initially unknown and images We present an algorithmic framework for X-ray nanocrystal- are highly corrupted with noise. Autoindexing techniques, com- lographic reconstruction which is based on directly reducing data monly used in conventional crystallography, can determine ori- variance and resolving the indexing ambiguity. First, we design entations using Bragg peak patterns, but only up to crystal lattice an autoindexing technique that uses both Bragg and non-Bragg symmetry. This limitation results in an ambiguity in the orienta- data to compute precise orientations, up to lattice symmetry. tions, known as the indexing ambiguity, when the diffraction Crystal sizes are then determined by performing a pattern displays less symmetry than the lattice and leads to data around Bragg peak neighborhoods from a finely sampled low- that appear twinned if left unresolved. Furthermore, missing angle image, such as from a rear detector (Fig. 1). Next, we model phase information must be recovered to determine the imaged structure factor magnitudes for each reciprocal lattice point with object’s structure. We present an algorithmic framework to deter- a multimodal Gaussian distribution, using a multistage expectation mine crystal size, incident photon flux density, and orientation in

maximization algorithm which simultaneously scales and models APPLIED

the presence of the indexing ambiguity. We show that phase in- the data. These multimodal models are used to build a weighted MATHEMATICS formation can be computed from nanocrystallographic diffraction graph which models the structure factor magnitude concurrency. using an iterative phasing algorithm, without extra experimental We formulate the solution to the indexing ambiguity problem as requirements, atomicity assumptions, or knowledge of similar finding the maximum edge weight clique in this graph, which can structures required by current phasing methods. The feasibility be solved efficiently via a greedy approach. Finally, we demon- of this approach is tested on simulated data with parameters strate the feasibility of solving the phase problem using iterative and noise levels common in current experiments. phase retrieval. Whereas several of the presented methods rely on the use of nanocrystals, we note that the scaling–multimodal anal- lthough conventional X-ray crystallography has been used ysis and indexing ambiguity resolution steps can also be applied to Aextensively to determine atomic structure, it is limited to larger crystals ð1 − 10 μmÞ. objects that can be formed into large crystal samples ð>10 μmÞ. An appealing alternative, made possible by recent advances in Formulation light source technology, is X-ray nanocrystallography, which is In X-ray crystallography, diffraction patterns are collected from able to image structures resistant to large crystallization, such as a periodic crystal made up of the target object. The 3D crystal membrane proteins, by substituting a large ensemble of easier to lattice structure may be described by its Bravais lattice charac- h ; h ; h ; h ∈ R3 build nanocrystals, typically <1 μm, often delivered to the beam teristicP ð 1 2 3Þ j , and its associated infinite lattice = 3 h : ; ; ∈ Z via a liquid jet (1–6) (Fig. 1). However, the beam power required L f j=1nj j n1 n2 n3 g. We define the lattice rotational to retrieve sufficient information destroys the crystal, hence ul- symmetry group to be the set of rotation operators which trafast pulses (≤70 fs) are required to collect data before damage preserve the lattice structure . effects alter the signal. Using nanocrystals introduces several chal- lenges. Due to the small crystal size, Bragg peaks are smeared out, Significance and there is noticeable signal between peaks. Typically, only partial peak reflections are measured, resulting in reduced in- X-ray nanocrystallography is a powerful imaging technique tensities. Variations in crystal size and incident photon flux den- which is able to determine the atomic structure of a macro- sity, unknown orientations, shot noise, and background signal from molecule from a large ensemble of nanocrystals. Determining the liquid and detector add additional uncertainty to the data. structure from this ensemble is challenging because the images If crystal orientations were known, noise and variation in the are noisy, and individual crystal sizes, orientations, and in- peak measurements could be averaged out, and the data could cident photon flux densities are unknown. Additionally, lattice be inverted to retrieve the object’s electron density. Although symmetries may lead to orientation ambiguities. Here, we autoindexing techniques can be used to determine crystal orien- show how to determine crystal size, incident photon flux tation up to lattice symmetry from the location of a sufficient density, and crystal orientation from noisy data. We also number of Bragg peaks, they typically face difficulties in the pres- demonstrate that these data can be used to perform re- ence of partial and non-Bragg reflections common in nanocrystal construction without extra experimental requirements, atom- diffraction images. Furthermore, these techniques only narrow icity assumptions, or knowledge of similar structures. down orientation to a list of possibilities when the diffraction pat- tern has less symmetry than the lattice, leading to an ambiguity Author contributions: J.J.D. and J.A.S. designed research, performed research, analyzed in the image orientation, known as the indexing ambiguity. Cur- data, and wrote the paper. rent methods of processing the diffraction data are largely based The authors declare no conflict of interest. on averaging out the data variance over several images (1–8). 1To whom correspondence should be addressed. E-mail: [email protected]. However, if the data are processed without resolving the indexing This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. ambiguity then they will appear to be perfectly twinned, i.e., averaged 1073/pnas.1321790111/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1321790111 PNAS | January 14, 2014 | vol. 111 | no. 2 | 593–598 Downloaded by guest on September 27, 2021 Fig. 2. Example of a low-angle image. The profile of the shape transform around the Bragg peaks (circled) can be used to determine the crystal sizes.

for nanocrystallography, small crystal sizes spread the shape transform, and measurements close to, but not directly at, a Bragg Fig. 1. Liquid jet (blue) delivers nanocrystal samples to the X-ray beam peak, known as partial reflections, have a decreased measured (red). Wide- and small-angle diffraction data are collected using front and rear detectors. intensity. Additionally, the signal at pixels corresponding to lines in between adjacent reciprocal lattice points is often noticeable in P nanocrystal diffraction images. Each lattice has both a Dirac comb Δ x = δ x − y , The goal of X-ray nanocrystallography is to determine the unit Lð Þ y∈L ð Þ ρ which is a sum of Dirac delta functions supported on the lat- cell electron density from a large ensemble of diffraction images, ^ tice points, and a dual L, known as the reciprocal lattice, given which vary in orientation, incident photon flux density, and crystal by the support of the Dirac comb’s * Δ^ . size and are corrupted with noise. Here we will focus on the case L† ‡ The Bravais vectors of the reciprocal lattice are given by when , which leads to the indexing ambiguity. ^ ^ ^ −T ½h1; h2; h3 = ½h1; h2; h3 . In practice, a crystal lattice consists of only a finite part of Autoindexing its associated infinite lattice. In this case, the associated Dirac Commonly used autoindexing methods, e.g., refs. 9, 10, can ac- comb’s Fourier transform , known as the shape transform curately determine a lattice’s unit cell, given by the Bravais S : R3 → C, is no longer a sum of delta functions, but is instead vectors in some reference configuration, using a large ensemble ^ a smeared-out version of ΔL. To simplify the discussion, of images. However, the orientation information that these we will be assuming that the finite crystal lattice can be de- methods compute might not be accurate enough to be used in scribed as a box with Nj unit cells in the direction of hj, i.e., the evaluation of the shape transform, especially in the presence . Here, the squared norm of its asso- of non-Bragg spots and low Bragg peak counts. Starting from this unit cell information, we devise an algorithm which uses both ciated shape transform is given by Bragg and non-Bragg data to generate precise orientations, up to lattice symmetry. 3 2 πN h · q q 2 = ∏ sin j j : [1] jSð Þj 2 j = 1 sin πhj · q Bravais Characteristic Vector Calculation. In nanocrystallography images, the strongest signal will occur at reciprocal lattice points We note that the methods presented in the following sections ξ, which satisfy cosð2πh · ξÞ = 1ifh is a Bravais vector. However, can be extended to more general crystal shapes, but possibly at signal along lattice edges, i.e., lines connecting adjacent re- ξ the cost of losing a simple closed-form expression for the shape ciprocal lattice points, is also commonly noticeable. If is a re- πh · ξ = transform. ciprocal lattice edge point then cosð2 j Þ 1 for two of the − ; The electron density of a crystal with periodic units on the Bravais vectors and can be anything in ½ 1 1Þ for the remaining lattice points of LC can be expressed in terms of the electron vector. The set B of both types of points appear as bright spots density ρ of one of its periodic unit cells by in the images, and can be located through thresholding. For The space group of a crystal introduces another form of symmetry a given image, a direction h is then likely to be a Bravais vector on its diffraction pattern, described by the Laue rotational sym- for the rotated lattice if cosð2πh · ξÞ is close to 1 for most ξ ∈ B.In h = d metry group . particular, when searching for the jth Bravais vector j Lj j, d = The image of diffraction intensities I : R2 → R due to elastic where j 1, we attempt to filter out the non-Bragg spots d d scattering from a crystal with unit cell electron density ρ : R3 → R which disagree with j by looking for a direction which solves and orientation R ∈ SOð3Þ, using a fully coherent X-ray beam X with wavelength λ and incident photon flux density J, at a de- max cos 2πLjd · ξ ; [3] jdj=1 tector with pixel size dx at distance D from the interaction point ξ∈BpðdÞ and normal to the beam, is described by d 2 where Bpð Þ is the set of pjBj points in B with the largest value of ; = 2 ; ΔΩ ; ρ^ ; ; 2; [2] π d · ξ Iðx yÞ J re Pðx yÞ ðx yÞ ðRqðx yÞÞ SðRqðx yÞÞ cosð2 Lj Þ. If multiple Bravais vectors have the same length, we search for multiple solutions approximately separated by the 2 : R2 → R known relative angles between these vectors. We generate search where re isp theffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi electron cross-section, P is a polarization factor, L = x2 + y2 + D2, ΔΩðx; yÞ = D dx2L−3=2 is the solid angle directions with an approximately uniform sampling of the half − subtended by a pixel, and qðx; yÞ = λ 1ðxL−1; yL−1; DL−1 − 1Þ. For unit sphere (11). After locating the initial directions, we repeat elastic scattering, the values of ρ^ are often called the struc- the process on a finer set of sample directions in a restricted ture factors. angular range around the previously computed directions. If a di- For large crystals, the shape transform approaches the Dirac rection is not found, we use the known relative angles between comb of the reciprocal lattice, up to a constant factor, and the the other two directions to narrow the search for, or directly diffraction images consist of a series of bright spots, known as deduce, the missing direction. Bragg peaks, concentrated at reciprocal lattice points. However, Lattice Orientation Calculation. Once Bravais directions D = ðd1; d2; d3Þ for the rotated lattice associated with an image are ^ *We will use f to denote either the continuous or discrete Fourier transform of f, depend- ing on context. † ‡ A−T refers to the transpose of the inverse of A. jAj designates the number of elements in A.

594 | www.pnas.org/cgi/doi/10.1073/pnas.1321790111 Donatelli and Sethian Downloaded by guest on September 27, 2021 sizes for each Bravais direction, then the convex hull of the projected autocorrelation , where Conv(X) is the convex hull of the set X, can be expressed as ( ) X3 T H = Conv ðx; yÞ : ðx; y; zÞ = R ± Nj − 1 hj : [6] j = 1 Fig. 3. The shape transform (Left) around a low-angle peak is Fourier ^ transformed to reveal the projected autocorrelated crystal (middle), which is By computing Ir, we can deduce the crystal sizes by analyzing T then segmented (Right). H as long as none of the rotated Bravais vectors R hj is orthog- onal to the detector. In general, the boundary of H consists of n ; n ; n located, with the sign convention that yields the correct angles a series of line segments, with three normals ð 1 2 3Þ along between vectors and detðDÞ > 0, we use the known reference with their three antiparallel directions, which can be found by computing the convex hull of the rotated projected autocorre- configuration B = ðb1; b2; b3Þ to compute an approximation ~ − ~ 6 = R = BD 1 to the orientation matrix. If detðR Þ is far from unity lated unit cell, i.e., the right-hand side of Eq. with Nj 2for a a n then the autoindexing procedure failed to compute an accurate each j.Inthedirection i, the extent of the convex hull ~ = n · x : x ∈ orientation and we thus reject R .Wethenfindtheclosestrota- bi maxfj i j HPg must be equal to that of the unpro- a = 3 nT T h − tion matrix by computing the singular value decomposition jected crystal, i.e., bi j=1 i R j ðNj 1Þ.Therefore,bydefin- ~ T ~ T = nT T h N Ra = UΣV and setting R = UV , which is used as the approxi- ing the matrix Aij i R j , the crystal sizes can be retrieved mation to the image orientation up to lattice symmetry, i.e., by solving AðN − 1Þ = b,where1 = ð1; 1; 1Þ. If the image does not R~ = QR where R is the full orientation and . directly pass through a reciprocal lattice point, this analysis is still valid; however, the Fourier-transformed images may contain oscil- Crystal Size Determination lations, which grow with distance to the lattice point. To compute accurate structure factor magnitudes from mea- sured intensities, the squared magnitude of the shape transform Image Segmentation. To retrieve the crystal sizes via the methods of must be divided out of the intensity measurements in Eq. 2. Near the previous section, we require an estimate for the support of the APPLIED

a Bragg peak, the shape transform grows quadratically with the projected autocorrelated crystal from the Fourier transformed MATHEMATICS crystal size, which often varies up to an order of magnitude in images, i.e., we need to segment the support from the noisy back- each dimension over the nanocrystal ensemble. We determine ground (Fig. 3). We begin the segmentation by initializing a set H ^ these crystal sizes by analyzing intensities around Bragg peaks in low- of pixels whose Ir value is greater than a fixed percentage of the angle images (Fig. 2) sampled at least twice the Nyquist rate for †† 1 largest value. Then, we traverse a sorted list of the remaining the crystal, i.e., the pixel spacing is at most 2W,whereW is the width values, adding pixels to H until one reaches a point more than of the crystal. A Fourier analysis of these intensities reveals the some threshold, typically a few pixels, away from all of the crystal sizes. pixels currently in H, suggesting that one has reached the end of the support and has begun to see the oscillations from the Fourier Analysis of the Shape Transform. For an image I with ori- ‡‡ § background. entation R, consider its restriction Ir to a small neighborhood centered at a low-angle Bragg peak with detector coordinates x ∈ R2 Structure Factor Magnitude Modeling o corresponding to the reciprocal lattice point , where ~ N jξj is small. In , by taking a linear approximation of q and using Once the lattice orientations Rm and crystal sizes m are known, ~ 2 the translation invariance of S on , the intensities are approx- we can use Eq. 2 to compute an approximation Fm to the imated, up to a constant C,by structure factor square magnitudes from the image I , but only m 2 up to a constant factor because they are scaled by the unknown x ≈ ~x ; [4] Irð Þ C S K R incident photon flux density Jm, which varies between images. Furthermore, due to the indexing ambiguity, one only knows −1 2 where ~x = ðx; 0Þ and K = ðλDÞ . If we denote GðxÞ = C SðK~xÞ , the corresponding reciprocal space coordinates associated ~ then Eq. 4 becomes the restriction of G to the rotated plane with the values of Fm up to the crystal lattice symmetry, i.e., ≈ { Ir GjRðR2Þ, whose Fourier transform, from the Fourier projec- the possible structure factor magnitudes for each point take tion slice theorem and the Wiener–Khinchin theorem, is approx- the form of a multimodal distribution. Moreover, these two k imately the X-ray projected autocorrelation of : problems are strongly coupled together: We cannot perform the scaling correction unless we know what modes to scale to and [5] the modes are indistinguishable in the unscaled data set. Hence, we must simultaneously determine both the scaling and the Note that the support of this projected autocorrelation is given multimodal parameters. by the Minkowski sum of the rotated projected crystal, i.e., supp . Processing the Data. We approximate the structure factor squared §§ magnitudes at the ith reciprocal lattice point ξi from the mth If we approximate** the crystal lattice as image by computing an average over the neighboring ball Bðξ ; rÞ N = ; ; i , where ðN1 N2 N3Þ are the crystal with radius r:

§ Because the shape transform is symmetric with respect to rotation by elements of , †† We exclude the origin which picks up all of the noise within the image. the use of orientations from autoindexing is sufficient here. ‡‡ H gives unitless coordinates which should be scaled by λD= N dx for a restricted image {^ ð p Þ Ir may be slightly smeared out due to grid alignment effects. with Np × Np pixels of size dx × dx. Crystal sizes are rejected if they are outside of k PRð3Þ is the X-ray projection operator through the detector plane normal and A is the a set range. autocorrelation operator. §§ To keep our notation compact, we are representing the reciprocal lattice points with **We note that alternative models of the finite lattice may be used here instead. a single index in place of the traditional Miller indices.

Donatelli and Sethian PNAS | January 14, 2014 | vol. 111 | no. 2 | 595 Downloaded by guest on September 27, 2021 Scaling Correction. In practice, variance in incident photon flux density, noise, and errors in autoindexing and crystal size de- termination smear out the peaks in the histogram, making them difficult to locate via expectation maximization (Fig. 4): Data must frequency frequency be scaled to properly model the structure factor magnitudes. To do so, we seek scaling factors which minimize the variance in the histograms and alternate this procedure with the expectation Fig. 4. (Left) Histogram of the possible unscaled variance stabilized struc- maximization step in Eq. 8. ture factor magnitudes for a reciprocal lattice point corresponding to We seek the scaling factor c for the mth image by solving a fourfold indexing ambiguity. (Right) Histogram of the scaled data with m multimodal Gaussian model (red). X 2 − μ ; [9] min cmwi;m i;j Ti;j;m c m i;j P ~ x R qðxÞ∈Bðξ ;rÞImð Þ whose solution is given by m i vi;m =   : [7] P ~ 2 P ~ 2 x ΔΩ x x μ 2 R qðxÞ∈Bðξ ;rÞre Pð Þ ð Þ Sm Rmqð Þ i;jwi;m i;jTi;j;m m i = P : [10] cm 2 2 i;jwi;mTi;j;m We discard any intensities below some fixed threshold from the above sum to prevent large errors from the division. Note OnceP the cm are computed, they are normalized so that that because we only know the orientation up to lattice c = constant and then used to scale the images by replacing {{ = ξ m m symmetry, we also set vt;m vi;m for every t such that for every w ; with c w ; . Scaling is alternated with expectation ξ = ξ i m m i m some , R t i. To simplify notation, we will assume maximization until convergence. that unmeasured values and corresponding indices are re- moved from the remaining sets and summations. To reduce Resolving the Indexing Ambiguity the dependence of the standard deviation on the size of the After the structure factor magnitude modeling, we know up to intensities, we use two applications of variance stabilization K possible structure factor magnitudes at each reciprocal lat- = 1=4 (12), and instead work with wi;m vi;m . tice point. Resolving the indexing ambiguity amounts to cor- rectly assigning one of these K values to each point. There are Multimodal Analysis. Assume for now that the structure factor K equally valid solutions, related to each other by globally magnitudes are already properly scaled. Due to the indexing am- applying a rotation from . We first use the set of multi- biguity, at this point in the procedure we only know the orientations modal model parameters to construct a graph theoretic model ξ up to the lattice symmetry. Thus, for each reciprocal lattice point i, of the structure factor magnitude concurrency, i.e., the prob- values of wi;m could correspond to K different structure factor ability that two given structure factor magnitudes occur within magnitudes, e.g., for elastic scattering .Ahisto- the same image. Then, we resolve the indexing ambiguity by ξ gram of fwi;mg for i reveals K different peaks, smeared out as noise finding the maximum edge weight clique of this graph with and parameter uncertainty are increased. Our goal is to detect these a greedy approach. peaks and model the associated multimodal distribution. To retrieve the set of possible structure factor magnitudes, we will Graph Theoretic Modeling of Structure Factor Magnitude Concurrency. model the computed values fwi;mg from each reciprocal lattice Given the scaled variance stabilized structure factor magnitudes ξ μ σ point i with a multimodal Gaussian distribution. Specifically, the fwi;mg, means f i;jg, and standard deviations f i;jg for the mth associated probability density functions can be expressed in terms image and jth mode at the ith reciprocal lattice point, we μ = μ ; ...; μ kk = ; = ; of multiple Gaussian distributions with means i ð i;1 i;K Þ, construct agraphG ðV EÞ with vertices V fði jÞg and in monotonically increasing order, and standard deviations edges E,whereðði1; j1Þ; ði2; j2ÞÞ ∈ E if and only if i1 ≠ i2 and σ = σ ; ...; σ , i.e., only one j can be se- i ð i;1 i;K Þ by , ξ lected at each reciprocal lattice point i and each j can only be se- lected once among its twin-related coordinates ξ = Rξ ,where where . t i M . Consequently, choosing a consistent set of struc- Given fwi;mgm=1, we determine its multimodal model through ture factor magnitudes, where each possible value appears exactly ††† an expectation maximization algorithm. In particular, given an once, is equivalent to finding a maximal clique*** in this graph. μð0Þ σð0Þ initial guess for model parameters i;j and i;j , we perform We define a directed weight W : E → R on G as the ratio of several iterations of the following: the sum of concurrence probabilities over the sum of the oc-

currence probabilities, where we sum over the sets I i1;i2 , con- sisting of all of the images which potentially measure intensities ξ ξ simultaneously at i and i : [8] 1 2P ; ; ; ; m∈I ; Ti1 j1 mTi2 j2 m P i1 i2 Wðði1; j1Þ; ði2; j2ÞÞ = : [11] ∈ ; Ti1;j1;m m I i1 i2 μ μ W gives a measure of how likely the values of i1;j1 and i2;j2 are to Here, Ti;j;m represents the probability that wi;m is drawn from the occur together in one of the solutions. With no noise or error, if jth Gaussian mode. To initialize, we separate the data into K μ ; ; ...; μ ; are distinct then W is exactly 1 when μ ; and μ ; μð0Þ i2 1 i2 K i1 j1 i2 j2 equal bins, define i;j as the location in the jth bin with the σð0Þ greatest density, and set i;j to be less than a typical bin size. Also, we perform outlier rejection by removing any wi;m in which kkFor known symmetry in the structure factor magnitudes (e.g., Friedel or Laue symme- ; μ ; σ pðwi;m i iÞ is below a given threshold. try), we simplify G’s structure by merging corresponding symmetric vertices. ~ ***Cl⊂V is a maximal clique if for all v1,v2 ∈ Cl, ðv1,v2Þ ∈ E with no proper superset Cl⊋Cl satisfying the same property. {{ ††† vi,m is multivalued for an image which measures intensities at both ξ and Rξ, . AnB = fx ∈ A : x ∉ Bg is the set difference of A and B.

596 | www.pnas.org/cgi/doi/10.1073/pnas.1321790111 Donatelli and Sethian Downloaded by guest on September 27, 2021 Table 1. Autoindexing performance Error*

Jo Accepted <0.001 0.001–0.004 0.004–0.016 >0.016

21,800 32,678 7,912 21,485 3,196 85 2,180 28,251 6,583 17,993 3,496 179 218 22,701 4,733 13,588 3,947 433 21.8 12,859 1,786 6,224 4,180 669 Fig. 5. Electron density contours of the exact solution (Left) and the computed 2.18 3,926 39 798 2,290 799 solution with Jo = 218 (Center) and overlay with the atomic model (Right). *Error in the Frobenius norm modulo .

are simultaneously part of one of the solutions and is 0 otherwise. However, if there are B values of k such that μ ; = μ ; then W is i2 k i2 j2 Orientation Calculation. Although we can achieve a robust ap- 1=B. With noise present, the asymmetry of the weight function favors proximation of the complete structure factor magnitudes, accuracy structure factor magnitudes with strong signal and whose histograms can be improved by using this information to directly orient each are highly multimodal, providing the most orientation information. image, and then averaging structure factor magnitudes for each We now formulate the solution to the indexing ambiguity prob- corresponding reciprocal lattice point. For every image Im,we lem. We seek the maximal clique in G with maximum edge weight: = ~ X compute its full orientation Rm QmRm,whereQm solves max Wðv1; v2Þ; [12] ∈ Cl CmðGÞ ; ∈ v1 v2 Cl [13]

where CmðGÞ is the set of all maximal cliques in G. If a sufficient number of images are used, then, in the absence of noise and Here, A is the set of indices i where w ; is measured by the ~ i m error, the maximizer of Eq. 12 retrieves one of the exact solu- image at the orientation Rm and is a set of class represen-

tions to the indexing ambiguity problem, i.e., the maximum edge tatives of the quotient group , i.e., consists of APPLIED

weight clique Cl assigns the correct structure factor magnitudes, the identity and twinning operators. If there are at least two ori- MATHEMATICS up to a global rotation. For imperfect data, this maximizer will entations close to the minimum value, then we reject the com- choose a solution which is most consistent with the observed puted orientation. Once the orientations are known, we obtain structure factor magnitude concurrency. the structure factor magnitudes by averaging the corresponding magnitudes computed from each scaled oriented image. Greedy Approach to the Maximum Edge Weight Clique Problem. Even though the maximum edge weight clique problem is, in Phase Recovery general, nondeterministic polynomial-time hard (13), when con- Once complete structure factor magnitudes are computed, one structed from the indexing ambiguity problem via Eq. 12, we can can determine missing phases and, thus, determine the electron solve it in quadratic time with a greedy approach. We initialize density of a periodic unit with any applicable phasing method, ‡‡‡ ∈ the clique Cl with some starting vertex vs V and progressively e.g., refs. 14–17. Although these methods have been used ex- add vertices that maximize the weight sum of the current clique. In tensively to determine structure in X-ray crystallography, practice, we remove any points whose associated multimodal each has limitations or introduces extra difficulties into the distributions contain less than the maximum number of modes. experimental setup. An appealing alternative is to directly For convenience, we use a single index for the vertices = N ; = −∞ ; ∉ §§§ deduce phases from Fourier magnitude data using an iterative V fvigi=1 and we set Wðv1 v2Þ if ðv1 v2Þ E. phase retrieval technique, which only requires that Fourier Algorithm 1. magnitudes be sampled at a sufficient rate. Although in- feasible in conventional crystallography, the signal from nanocrystals contains significant information between Bragg peaks, which may allow sampling at the required rates. In general, such iterative phasing is possible ifP one samples the ξ = 3 nih^ ; Fourier magnitudes at points of the form i=1 2 i where ni ∈ Z (18, 19). In ref. 20 the feasibility of such an approach was demonstrated assuming that adequate signal was collected at each of the required points. However, the square magnitude of The elements of the set Cl returned by Algorithm 1 are pairs of ξ ; the shape transform at grows quadratically in the crystal size the form ði jÞ, corresponding to choosing the jth modeled variance ξ · h ∈ Z μ for each dimension in which i . For nanocrystallography, stabilized structure factor magnitude i;j at the reciprocal lattice ξ ~ : ^ → R ~ ξ = μ this typically only results in noticeable signal at reciprocal lat- point i.Thisinducesthemapw L where wð iÞ i;j.With {{{,kkk a sufficient number of images and no noise or error, Algorithm 1 tice vertices and edges. While theory is lacking, this retrieves an exact solution to the indexing ambiguity problem, al- sampling density is sufficient in certain 3D cases (21). Alter- natively, a recent approach uses Fourier magnitude information ways preferring a vertex vn with nonzero weighted edges con- along with its gradient, assuming it can be accurately calculated, necting to all elements of the current clique, i.e., vn is only chosen if it corresponds to the same solution as the rest of clique. This only at reciprocal lattice points (22). Here, we test the feasibility approach remains robust with imperfect data, as it considers several of iterative phasing using the Fourier magnitudes computed from pairs of measured intensities over all images to choose the structure our framework at reciprocal lattice vertices and edges. factor magnitudes at any single point.

{{{The lattice edge structure factor magnitudes may be computed via Eq. 7. ‡‡‡ kkk vs is typically chosen from a point with a highly multimodal distribution and One may also need to resolve the indexing ambiguity on the used non-Bragg data if it strong signal. has less symmetry than the Laue group. §§§In Algorithm 1, denotes the assignment operator. ****This is obtained by linearly mapping the reciprocal lattice onto a uniform grid.

Donatelli and Sethian PNAS | January 14, 2014 | vol. 111 | no. 2 | 597 Downloaded by guest on September 27, 2021 Table 2. Crystal size determination performance Table 3. Orientation determination performance

Error* Jo 21,800 2,180 218 21.8 2.18

Jo Accepted <0.1 0.1–0.2 0.2–0.3 0.3–0.4 >0.4 Accepted 22,645 17,844 11,663 3,775 17 Correct, %* 99.9 99.9 99.8 99.3 58.8 21,800 30,506 19,652 9,667 903 98 186 2,180 26,260 16,080 8,996 992 71 121 *Given the possible solution sets fR1,ig and fR2,ig, the number of correct 218 20,195 10,941 7,681 1,166 126 281 orientations is maxj=1,2 Sj , where Sj consists of all computed orientations Ri 21.8 10,639 4,071 4,626 1,388 184 370 closer to the Rj,i solution in the Frobenius norm modulo . 2.18 1,851 458 458 387 112 436 which, in practice, may be deduced from autoindexing in- *Relative error in the geometric average of the crystal sizes. formation and reflection conditions. The crystal displays P4 space-group symmetry, thus diffraction data are symmetric with respect to 90° rotation about the z axis Z = Z × Z × Z Iterative Phase Retrieval. Given a domain**** α α1 α2 α3 , and inversion, and has a twinning operator given by 180° rotation Fourier magnitude values a : Zα → C, and some support S⊆Zα, about the x axis, or, equivalently, the y axis. iterative phase retrieval algorithms seek a function ρ ∈ ℓ2ðZαÞ such Image orientations are generated by randomly sampling that ρ^ = a and ρðxÞ = 0forallx ∉ S. Given a set Ω of points where from a of quaternions. Sizes were randomly a has a recorded value, such algorithms typically make use of the generated with an average crystal width of μ = 2,948.97 Å and ρ = ρ ∈ ρ = C projection operators PS,wherePS ðxÞ ðxÞ if x S and P ðxÞ 0 standard deviation σC = 982.99 Å. For each image, we generate otherwise, and PM;Ω,where random incident photon flux densities J, measured in photons 8 > per square Angstrom per pulse, from a peak density Jo, via > ^ρðkÞ; ρ^ ≠ ∈ Ω; x ∼ Uð−1; 1Þ; J = J expð−8x2Þ. We use experimental parame- > aðkÞ if 0 and k †††† o < ρ^ðkÞ ters similar to ref. 3. Intensity values are computed via Eq. 2 d P ;ΩρðkÞ = [14] along with shot and background noise, modeled with a Poisson M > aðkÞ; if ρ^ = 0 and k ∈ Ω; > distribution and a normal distribution, with a standard deviation :> ρ^ ; ∉ Ω: of 1.3 photons per pixel (Figs. S1–S5). ðkÞ if k ‡‡‡‡ Here we present statistics for our framework (Tables 1–3) along with a reconstruction from the processed simulated data We alternate between several iterations of the error reducing ðn+1Þ ðnÞ (Fig. 5). For further details, see ref. 28. algorithm (23): ρ = PSPM;Ωρ , and the hybrid input-output ρðn+1Þ = + − β ρðnÞ β ∈ ; algorithm (24): ðPSPM;Ω PSc ðI PM;ΩÞÞ , ð0 1, ACKNOWLEDGMENTS. We thank Stefano Marchesini for many valuable to seek out the solution ρ. Furthermore, we couple these conversations. This research was supported in part by the Applied Mathe- iterations with the Shrinkwrap method (25), which, starting matical Sciences subprogram of the Office of Energy Research, US De- with an initial guess such as the unit cell, updates an estimate partment of Energy (DOE) under Contract DE-AC02-05CH11231 and by the of the true support T by convolving the current iterate with Division of Mathematical Sciences of the National Science Foundation, and used resources of the National Energy Research Scientific Computing Center, a Gaussian and then thresholding. which is supported by the Office of Science of the US DOE under Contract DE-AC02-05CH11231. J.A.S. was also supported by an Einstein Visiting Results Fellowship of the Einstein Foundation, Berlin. J.J.D. was also supported by We demonstrate our methodology by determining the structure a DOE Computational Science Graduate Fellowship. of PuuE Allantoinase from simulated diffraction data using the atomic coordinates and crystal symmetry recorded in refs. 26, 27 †††† with several different peak incident photon flux densities. Each λ = 6:9Å,dx = 75 μm, D = 68/141 mm for the front/rear detectors with 1,024 × 1,024 pixels, data set consists of 33,856 diffraction images. We assume horizontal polarization, and Jo between 0.01 and 100 times current experimental levels. ‡‡‡‡ knowledge of the Bravais vector lengths and the space group, Here we use the distance in Frobenius norm modulo .

1. Aquila A, et al. (2012) Time-resolved protein nanocrystallography using an X-ray free- 15. Hauptman H, Karle J (1953) Solution of the Phase Problem I. The Centrosymmetric electron laser. Opt Express 20(3):2706–2716. Crystal, No. 3 (American Crystallographic Association, New York). 2. Boutet S, et al. (2012) High-resolution protein structure determination by serial 16. Karle J (1980) Some developments in anomalous dispersion for the structural investigation femtosecond crystallography. Science 337(6092):362–364. of macromolecular systems in biology. Int J Quantum Chem Quantum Biol Symp 18(S7): 3. Chapman HN, et al. (2011) Femtosecond X-ray protein nanocrystallography. Nature 357–367. – 470(7332):73 77. 17. Rossmann MG, Blow DM (1962) The detection of sub-units within the crystallographic 4. Johansson LC, et al. (2012) Lipidic phase membrane protein serial femtosecond crys- asymmetric unit. Acta Crystallogr 15(1):24–31. – tallography. Nat Methods 9(3):263 265. 18. Hayes MH (1982) The reconstruction of a multidimensional sequence from the phase or 5. Kern J, et al. (2013) Simultaneous femtosecond X-ray spectroscopy and diffraction of magnitude of its Fourier transform. IEEE Trans Acoust Speech Signal Process 30(2):140–154. photosystem II at room temperature. Science 340(6131):491–495. 19. Rosenblatt J (1984) Phase retrieval. Commun Math Phys 95:317–343. 6. Koopmann R, et al. (2012) In vivo protein crystallization opens new routes in struc- 20. Spence JCH, et al. (2011) Phasing of coherent femtosecond X-ray diffraction from size- tural biology. Nat Methods 9(3):259–262. varying nanocrystals. Opt Express 19(4):2866–2873. 7. Spence JCH, Weierstall U, Chapman HN (2012) X-ray lasers for structural and dynamic 21. Millane RP (1996) Multidimensional phase problems. JOptSocAmAOptImageSciVis biology. Rep Prog Phys 75(10):102601. 13(4):725–734. 8. White TA, et al. (2012) Crystfel: A software suite for snapshot serial crystallography. 22. Elser V (2013) Direct phasing of nanocrystal diffraction. Acta Crystallogr A 69:559–569. J Appl Cryst 45(2):335–341. 23. Gerchberg RW, Saxton WO (1972) A practical algorithm for the determination of the 9. Duisenberg AJM (1992) Indexing in single-crystal diffractometry with an obstinate list phase from image and diffraction plane pictures. Optik (Stuttg) 35:237–246. of reflections. J Appl Cryst 25:92–96. 24. Fienup JR (1978) Reconstruction of an object from the modulus of its Fourier trans- 10. Steller I, Bolotovsky R, Rossmann MG (1997) An algorithm for automatic indexing of – oscillation images using Fourier analysis. J Appl Cryst 30:1036–1040. form. Opt Lett 3(1):27 29. 11. Lovisolo L, da Silva EAB (2001) Uniform distribution of points on a hyper-sphere with appli- 25. Marchesini S, et al. (2003) X-ray image reconstruction from a diffraction pattern cations to vector bit-plane encoding. IEEE Proc Vision Image Signal Process 148(3):187–193. alone. Phys Rev B 68(4):140101. 12. Guan Y (2009) Variance stabilizing transformations of Poisson, binomial and negative 26. Bernstein FC, et al. (1977) The Protein Data Bank: A computer-based archival file for binomial distributions. Stat Probab Lett 14:1621–1629. macromolecular structures. J Mol Biol 112(3):535–542. 13. Alidaee B, Glover F, Kochenberger G, Wang H (2007) Solving the maximum edge weight 27. Ramazzina I, et al. (2008) Logical identification of an allantoinase analog (puuE) re- clique problem via unconstrained quadratic programming. EurJOperRes181(2):592–597. cruited from polysaccharide deacetylases. J Biol Chem 283(34):23295–23304. 14. Green DW, Ingram VM, Perutz MF (1954) The structure of haemoglobin. IV. Sign de- 28. Donatelli JJ (2013) Reconstruction algorithms for x-ray nanocrystallography via solu- termination by the isomorphous replacement method. Proc. Roy. Soc. A. 225(1162):287–307. tion of the twinning problem, PhD Thesis (University of California, Berkeley).

598 | www.pnas.org/cgi/doi/10.1073/pnas.1321790111 Donatelli and Sethian Downloaded by guest on September 27, 2021