MACROMOLECULES AND OTHER STRUCTURES WITH A POOR DATA/PARAMETER RATIO

Macromolecules often contain regions of disordered solvent and do not usually diffract to as high a resolution as small molecules. On the other hand they often contain repeated chemical units which we can exploit by means of similarity restraints to improve the effective data to parameter ratio and hence the precision of the structure. These provide an effective way of incorporating 'non-crystallographic symmetry' into structure refinement. To simplify the application of restraints etc. SHELXL-93 allows a structure to be subdivided into residues, each of which is defined by a residue number and (optionally) a residue class (up to 4 characters). Different residues of the same chemical type may be assigned to the same class and also use identical atom names, but must have different residue numbers. Thus for example the beta carbon atoms in all phenylalanine residues (class PHE) in a polypeptide may all be called 'CB'. Only one instruction would then be needed to add the appropriate idealized hydrogens to all of them and refine them with a 'riding model':

 HFIX_phe 23 CB

To apply 'similarity' distance restraints to all phenylalanines, all that is required is one SAME instruction, which should be inserted before the first atom of the residue with the best geometry (so that its connectivity array may be used to define the 1,2- and 1,3-distances):

 RESI 23 phe
 SAME_phe N > CZ              [Note: there is of course no restriction on the
 N 3 ..... ..... .....        order of the atoms in a residue, but it must be
 ...                          the same for all residues of the same class]
 CZ 1 ..... ..... .....

It would also be sensible to apply a planarity restraint to these side chains:

 FLAT_phe CB > CZ

The code '_*' is used to refer to all residues. For example it would be possible to use FLAT in this way to ensure that all peptide carbonyl carbons have planar coordination, but it is easier to do this by restraining their chiral volumes to zero (because the three bonded atoms do not then need to be named explicitly):

 CHIV_* C 0

assuming that these are the only atoms named 'C'; since the default chiral volume is zero it could be left out. In some cases it is necessary to refer to specific residues, in which case residue numbers should be used. For example the following instruction calculates the torsion angle of a disulfide bridge linking Cys_56 and Cys_124:

 CONF CB_56 SG_56 SG_124 CB_124

Protein crystallographers will have noticed that SHELXL-93 is fully compatible with the usual protein atom naming conventions, except that all atom names MUST begin with a letter, so the PDB convention of starting some hydrogen atom names with a digit is not allowed; similarly residue classes must begin with a letter and residue numbers must be pure numbers. The auxiliary program PDBINS is provided to generate a SHELXL-93 '.ins' file from a PDB file, incorporating restraints etc. taken from the dictionary file SHELXL.DIC.

The general approach to the refinement of large structures with limited reflection data is to proceed GRADUALLY, using all appropriate restraints (and possibly rigid group constraints) in the early stages of refinement, and relaxing them as far as possible only when the refinement has more or less converged. Although full-matrix refinement is normally recommended for small-molecule refinements, it is more efficient in terms of computer resources to use the Konnert-Hendrickson conjugate gradient approach (CGLS) for macromolecular refinement, with judicious insertion of large full-matrix blocks to help to resolve problem areas (e.g. solvent disorder). A final refinement with overlapping full-matrix blocks, possibly restricted to the x, y and z coordinates only, would then be required to obtain the esds in e.g. torsion angles. For a very small protein or polynucleotide with less than 500 non-hydrogen atoms (excluding solvent) a single final xyz-block would suffice. The CGLS refinement is usually very stable; erratic behavior can usually be tracked down to one or more atoms with unreasonably large isotropic or anisotropic displacement parameters, or to refinement of more parameters than the data and restraints can support.

If the second number on the L.S. or CGLS instruction is negative (-N) then every Nth reflection is ignored in the least-squares refinement, but is used instead for the calculation of independent R-values when the final structure factor cycle is performed. This enables 'R(free)' to be used to calibrate the sigmas for the various restraints and to check on possible 'over-refinement' (e.g. the refinement of noise peaks from a difference electron density map as solvent atoms). For details see A.T. Brunger, Nature 355 (1992) 472-475. Note the use of the DEFS instruction to change the default sigmas globally! A particularly effective application of R(free) is the decision as to whether the data justify (restrained) anisotropic refinement rather than isotropic. After the structure has more or less reached convergence after isotropic refinement in the usual way, two jobs are run with (for example) CGLS 20 -10 so that every 10th reflection is ignored in the refinement but is used instead for calculating R(free). One of the jobs should also contain ANIS (before the first atom), DELU and SIMU (without atom names), and ISOR (for the solvent water, e.g. ISOR O1 > LAST). Only if R(free) is significantly lower for the ANIS job is further anisotropic refinement justified. This is more likely to be the case if the data have been collected to higher resolution (i.e. the data to parameter ratio is higher), but the quality of the data is also important. In general the effective resolution should be better than (very roughly) 1.5 Angstroms for proteins and polynucleotides before anisotropic refinement is justified. It is sensible to apply this R(free) test and - if justified - initiate anisotropic refinement BEFORE attempting to resolve discrete side-chain disorder unless the components of the disorder are well separated spatially, because anisotropic motion can be regarded as an alternative to isotropic motion with discrete disorder for small separations. On the other hand it is a good idea to try to locate as many solvent atoms as possible before applying the test (see below).

The similarity restraints on the geometry are unbiased in the sense that no arbitrary numbers in the form of standard bond lengths and angles are required. Thus it should never be necessary to repeat a refinement because more precise values of these quantities are available. If R(free) is used to establish optimal esd's for the restraints, the weights may also be regarded as objective. The only assumption being made is that chemically equivalent bond lengths and angles (i.e. 1,3-distances) are equal in a statistical sense. Similarly the planarity restraints and the restraints on isotropic and anisotropic displacement parameters do not require the use of preconceived (and possibly erroneous) numbers (except zero!). This approach should be used whenever the type of problem (e.g. the extent of the non-crystallographic similarity) and the extent of the data permit.

The geometrical similarity approach works very well for 'small-molecule' structures which have become large because there are several chemically identical molecules in the crystallographic asymmetric unit, and well for polynucleotides which may also contain several examples of each repeating unit (especially when divided up into base, furanose and phosphate units). A further advantage of the similarity approach for polynucleotides is that the state of protonation of the bases may be uncertain, making it difficult to know which standard bond lengths etc. to use as target values or in fitting rigid groups; it is safer to assume that the equivalent bases have the same (partial) protonation states, i.e. the 1,2- and 1,3-distances are 'similar' but unknown.

On the other hand in proteins some amino-acids may be present many more times (and so will be better refined) than others, and geometric similarity does not help for an amino-acid which is only present once. Thus the recommended approach for proteins and large polypeptides is to use DFIX instructions to restrain 1,2- and 1,3-distances to standard values, with SAME/SADI (and small sigmas) to restrain the components of disordered residues to be similar. FLAT restraints are useful for aromatic residues and (with larger sigmas) for the five atoms involved in each main-chain peptide linkage. It is also very convenient to impose planarity on carbonyl and carboxyl carbons using CHIV (with a chiral volume of zero). All these restraints are set up automatically when the program PDBINS (Appendix B) is used to convert a PDB file for a protein into SHELXL-93 '.ins' format; the restraints are taken from the dictionary file SHELXL.DIC which users are encouraged to extend and adapt to local circumstances. Alternatively a text editor may be used to incorporate the appropriate parts of SHELXL.DIC into the .ins file.

Standard (restraint) bond lengths based on the CSD have been tabulated by F.H.Allen, O. Kennard, D.G. Watson, L. Brammer, A.G. Orpen and R. Taylor in Sections 9.5 and 9.6 of Volume C of International Tables for Crystallography (1992), Ed. A.J.C. Wilson, Kluwer Academic Publishers, Dordrecht, pp. 685-791. Suitable parameters for proteins have been given by R.A. Engh and R. Huber, Acta Cryst., A47 (1991) 392-400. For nucleic acids the necessary parameters may be taken from R. Taylor and O. Kennard, J. Mol. Struct., 78 (1982) 1-28 (bases and phosphates) and S. Arnott and D.W.L. Hukins, Biochem. J., 130 (1972) 453-465 (furanose rings). Taylor and Kennard found no evidence that the bases are non-planar, so FLAT can safely be used. With poor resolution data it might be better to fit the bases to the orthogonal coordinates given by R. Taylor and O. Kennard, J. Am. Chem. Soc., 104 (1982) 3209-3212, and then refine them as rigid groups (FRAG...FEND - possibly in an 'include file' - followed by AFIX 176 etc.).

It appears that the optimal restraint esds are very nearly independent of the type of structure and the resolution of the data, so normally the default values may be used. These have been established by R(free) and other tests on a variety of structures. The default values may if necessary be reset globally by a DEFS instruction before the individual restraints. The default esds are: all SAME and SADI distances, and DFIX with positive d: 0.03 A (first DEFS parameter); FLAT and CHIV: 0.2 A³ (second DEFS parameter); DELU: 0.01 A² (third DEFS parameter), SIMU: 0.05 (fourth DEFS parameter) if neither atom terminal, otherwise 0.1 (or twice the fourth DEFS parameter); ISOR: 0.1 if atom not bonded to exactly one other atom, otherwise 0.2; DFIX -d (anti-bumping restraints) 0.1 A. The ISOR and DFIX -d defaults are not set by DEFS.

Although the above default restraint esds give good results for small molecules and proteins which diffract to 1.2 Angstroms or better, there may be discrepancies involving the rigid bond restraints (indicating that the harmonic model is not such a good approximation, i.e. an ensemble (molecular dynamics) approach may be a better description. In such case DELU and SIMU can be relaxed to about 0.03 and 0.10 respectively for anisotropic refinement, and this model may well give the lowest value for the free R-factor. Some care is needed, because if the restraints are relaxed too far the refinement may become unstable.

The refinement may also become unstable (e.g. oscillate rather than converge) if one or more solvent atoms have unreasonably high displacement parameters, in which case they can be deleted. Otherwise either DAMP 100 (with L.S.) or SLIM .3 .1 (with CGLS) should be tried to damp the refinement (which will then require more cycles for convergence).

A further facility primarily intended for macromolecules but also useful for smaller structures is the production of tables using RTAB. When used in conjunction with residues, RTAB provides a convenient way of tabulating standard torsion angles, chiral volumes, and distances and angles involved in (for example) hydrogen bonds. Examples of the latter involving symmetry generated atoms were included in the second test structure (sigi) discussed above. The following instructions would produce sorted tables of the standard protein torsion angles and chiral volumes for the alpha-carbon atoms, assuming that the residues are numbered consecutively (CA_- means the atom CA with the residue number decreased by one):

 RTAB_* Omeg CA_- C_- N CA
 RTAB_* Phi C_- N CA C
 RTAB_* Psi N CA C N_+
 RTAB_* Chi N CA CB CG
 RTAB_* Cvol CA

If RTAB_* is not appropriate for a particular residue, e.g. some torsion angles involving the terminal residues, or chi and chiral volume for glycine, the residues in question are simply left out of the tables. The _+ and _- notation may also be used for cyclic peptides by assigning an 'alias' to the first and last residues; for example the residues in a cyclic pentapeptide could be numbered 2 to 6 inclusive, with alias 7 assigned to residue 2 and alias 1 to residue 6, so that all the torsion angles would be tabulated using the above RTAB instructions.

The SWAT option introduces one variable and one fixed parameter which enable diffuse solvent to be modeled by Babinet's principle (R. Langridge, D.A. Marvin, W.E. Seeds, H.R. Wilson, C.W. Hooper, M.H.F. Wilkins and L.D. Hamilton, J. Mol. Biol. 2 (1960) 38-64; H. Driessen, M.I.J. Haneef, G.W. Harris, B. Howlin, G. Khan and D.S. Moss, J. Appl. Cryst. 22 (1989) 510-516). This usually produces a significant but not dramatic improvement for the very low order data in macromolecular refinements.

One of the most difficult and potentially time-consuming aspects of macromolecular structure refinement is the treatment of solvent water. The relatively diffuse solvent atoms contribute primarily to the lower order reflections and so often constitute a local region in the least-squares parameter space in which there are more parameters than data, i.e. there may be many plausible sets of parameters which fit the data equally well. Thus anisotropic refinement of fully occupied atoms or isotropic refinement of a larger number of water molecules with fractional occupation factors may well fit the data equally well and involve about the same number of parameters in total. The advantage of the former approach is that chemically sensible restraints can be applied to the distances between the waters (and between the solvent and protein atoms). Even when the data only permit an isotropic refinement, it is recommended that the water be refined with full occupancies and 'anti-bumping' restraints until no more waters can be found, and then if necessary (e.g. when there are strong difference Fourier peaks closer than say 2.3 Angstrom to waters with relatively high U values) partial occupancies can be assigned.

SHELXL-93 enables anti-bumping restraints to be input by hand (DFIX -d) but they will usually be generated automatically by the program (by using the BUMP instruction and flagging the (water) atoms on which it is to operate by CONN 0). The anti-bumping restraints are generated between all water atoms, and between all water and all other atoms, including all possible symmetry equivalents and taking atom types into account (thus potential hydrogen bonds are allowed to be shorter than O..C distances etc.).

The following iterative procedure proves effective in practice at building up a network of fully-occupied water molecules, with an acceptable pattern of hydrogen-bonded distances, that is also consistent with the diffraction data. The SWAT and BUMP instructions should be included throughout, with CONN 0 to flag the water molecules and inhibit the generation of accidental bonds (which can for example upset the reidealization of hydrogen atoms each refinement cycle). If the waters are anisotropic ISOR 1.0 O1 > LAST is advisable. After each refinement job, water molecules with (an)isotropic displacement parameters which are too high (e.g. all three principal components greater than 1.2 or 1.4 A²) should be deleted, and (FMAP 2 / PLAN 200 2.3) difference peaks which make sensible hydrogen bonding distances to water molecules or to other electronegative atoms added; these will not necessarily be the highest peaks. The final table of distances between peaks should be checked to ensure that there are no short distances between the chosen peaks (PLAN 200 2.3 does this automatically). The list of 'disagreeable restraints' after the final refinement cycle in each job should also be checked for short contacts and if necessary one of the offending waters removed. At lower resolution it would be necessary to use a graphical display of the F_o-F_c or 2F_o-F_c electron density to locate the new trial water molecules. This procedure converges after a few jobs when no further water molecules can be eliminated or added. At this point the remaining difference electron density peaks should be inspected carefully to see if it is necessary to add partially occupied discrete solvent atoms in the vicinity of disordered side-chains (if any). An advantage of the full occupancy / antibumping approach is that it prevents water molecules from diffusing into protein regions and thus facilitates remodeling of disordered side-chains etc. In summary, for modeling the solvent the following instructions would be typical:

CGLS 10
SWAT 2 2            ! will be updated by the program in the .res file
BUMP                ! automatic antibumping restraints generated
CONN 0 O1 > LAST    ! flag water for antibumping and exclude from connectivity
ISOR 0.1 O1 > LAST  ! for anisotropic waters (ignored for isotropic atoms)
FMAP 2              ! F_o-F_c map
PLAN 200 2.3        ! difference peaks only written to .res for potential waters

and after each job waters would be deleted on editing .res to .ins if bad contacts remain (see final restraints summary) or if U or U_eq have risen to too high a value; selected (or perhaps all) potential waters in the peak-list are then renamed and moved to before the HKLF instruction. It is also possible to monitor progress using the free R factor (CGLS 10 -10). Even if anisotropic refinement is planned, it is a good idea (and it usually makes the eventual R-free test for the anisotropic refinement more favorable) to optimize the water structure in this way first. If this extension of the water is continued after going anisotropic, then an ANIS instruction is needed before the first new water (oxygen) atom.

Other useful features for macromolecules include an 'omit map' (OMIT atomnames followed by FMAP), the SHEL instruction for ignoring high and low resolution data, the use of 'include files' for accessing standard fragments or restraint libraries, and provision for synchrotron data at various wavelengths (DISP) as well as Laue data (LAUE plus HKLF 2).

The amount of '.lst' file output produced may be reduced substantially by putting MORE 0 before the first atom in the '.ins' file, but this facility should only be used when one is sure that the '.ins' file is correct; it might be better to edit (or write a little program to extract information from) the full '.lst' file instead, so that diagnostic information is still available if required. The UNIX 'more' command is useful for browsing through '.lst' files.

In contrast to standard macromolecular refinement programs, SHELXL-93 is able to provide reliable estimates of the standard deviations of all refined parameters and of all derived quantities, subject of course to any assumptions implied by the restraints employed (in keeping with the Bayesian philosophy). For example tight geometrical 'similarity restraints' effectively determine mean bond lengths and angles and their esd's, but leave the torsion angles free to refine independently; thus the torsion angles - and their esds - retain their diagnostic value.

In summary, a typical refinement of a small protein would take the following course. First the auxiliary program PDBINS would be used to convert the atom coordinates into SHELXL-93 '.ins' format and to extract the necessary restraints from a residue dictionary file (based on 'shelxl.dic' which is provided as a model). This is especially convenient if XPLOR has been used for the structure solution by molecular replacement and/or the initial refinement. Some editing of the '.ins' file may be needed if disorder or non-standard residues are present. Different components of disordered groups should be assigned different PART numbers, and the occupation factors of two components may be refined as p and (1-p) by the use of a free variable (i.e. set to e.g. 21 and -21 in which case a starting value for free variable number 2 should be given as the second parameter on the FVAR instruction). The first SHELXL-93 runs serve to build up a consistent network of fully occupied solvent molecules as explained above. At this point the hydrogen atoms are inserted by removing REM which precedes the HFIX instructions from PDBINS and the dictionary file. Attachment of hydrogens to more than one component of each disordered group is best performed in a subsequent job by inserting the appropriate AFIX instructions. If the resolution is very good (ca. 1.5 A or better) the R(free) test should now be performed to see whether anisotropic refinement is justified (i.e. two CGLS 20 -10 jobs should be run, differing only in that one contains an ANIS instruction). It is a mistake to model discrete disorder (unless the components are very clearly separated), or to include partially occupied solvent, until this test is applied, because anisotropic refinement may well provide an alternative way of modeling these effects. Subsequent anisotropic refinement (if justified) may be combined with improvement of the solvent model and possible modeling of discrete disorder; very often the better phase estimates resulting from the restrained anisotropic refinement give a much clearer difference electron density. Towards the end of this procedure partially occupied solvent may be introduced; where possible the occupation factors should be coupled (using free variables) to those of neighboring disordered side-chains, or an atom may be split into two components with occupancies fixed at 0.5 (i.e. set as 10.5), either as recommended by the program (see the list of principal displacement components) or as deduced from an F_o-F_c Fourier. This maintains the anti-bumping restraints with other solvent and side-chain atoms, but not between disordered components for which the occupancies add up to less than 1.1 (slightly greater than one to allow for hydrogen atom contributions etc.). At various stages in the refinement one of the LIST options can be used to write a phased reflection list to the .fcf file for input into another macromolecular FFT map generating for input into a graphics system. When the refinement has converged, it may be desired to run an xyz-only refinement with overlapping blocks (L.S./BLOC) to obtain esds on the torsion angles and hydrogen bonding distances (the antibumping list may be used to set up tables using RTAB and EQIV - see the sigi test example). Torsion angles and hydrogen-bonding distances are not usually restrained in the refinement, and so their esds have some meaning. Finally ACTA 2 and/or WPDB may be used to archive the results.

Ahead to Absolute Structure

Back to Restraints, Constraints and Group Fitting, and Disorder

Back to Table of Contents