Welcome to the Optibrium Community





Forgot login?
Register

FAQs

Search


How do you assess chemical similarity?

Tuesday, 29 September 2009 19:54
E-mail Print PDF
administrator

The chemical similarity of compounds can be based on molecular structure or any measured or predicted properties of the molecules, or a combination of the two. The most common approach is to use the molecular structure to assess the chemical similarity within a set of compounds. This approach defines the similarity in terms of the patterns of atoms present in the structures. The patterns of atoms along ‘paths’ through the 2D chemical structure are encoded in a binary ‘fingerprint’ and the similarity of two compounds defined in terms of the Tanimoto index. The advantage of a path-based fingerprint approach to similarity and diversity is that it provides a ‘generic’ method of comparing compounds. No assumption is made regarding the characteristics of molecules that will correlate most strongly with the biological activities of interest. Also, similarity assessed in this way usually corresponds well to a chemist’s view; compounds from within a chemical series will typically have a high Tanimoto index.


What are the axes of the ‘chemical structure’ space plot?

Tuesday, 29 September 2009 19:55
E-mail Print PDF
Ed Champness

It is not possible to quantify the axes of the structural chemical space plot as they do not correspond to a particular property or descriptor. The structural chemical space plot is a two-dimensional approximation of a multi-dimensional space, where similarity between each molecule in the data set represents a single dimension in the multi-dimensional space. Using either 'Visual Clustering' (tSNE - t-Distributed Stochastic Neighbour Embedding) or Principle Component Analysis (PCA), this is reduced to a two-dimensional space that maximizes the visible variation. As a result, the two dimensions are effectively a function of the molecules’ similarities and do not have an explicit meaning. The dimensions are best considered as distance measures such that the closer molecules are in the plot, the greater their similarity. Bear in mind that the more molecules there are in the plot, the more approximate the visible distance will be for any two individual points, although the overall trend covers the diversity of the entire set.


How is the ‘property space’ plot created?

Tuesday, 29 September 2009 19:55
E-mail Print PDF
Ed Champness

Property chemical space plots can use either 'Visual Clustering' (tSNE - t-Distributed Stochastic Neighbour Embedding) or Principal Component Analysis (PCA) to and enable visualization of the distribution of compounds with respect to their properties. Groups of compounds with a similar profile of properties will cluster together in these types of space. Additionally, if specific descriptors are expected to correlate with an important property, these may be imported into StarDrop and used to define a chemical space. Thus, the diversity of selections with respect to these descriptors may be visualized to aid in library design.


Can you define chemical space using your own descriptors?

Tuesday, 29 September 2009 19:55
E-mail Print PDF
Ed Champness

Yes, StarDrop can build chemical space projections using structures or any combination of continuous and categorical data, including imported data.


How many compounds can I use in a chemical space plot?

Tuesday, 29 September 2009 19:57
E-mail Print PDF
Ed Champness

The chemical space display has no hard limits. However, generating the underlying chemical space plot will take a prohibitively long time for more than ~10,000 molecules. For sets bigger than this, it is worth using StarDrop's selection tool to perform a random selection of up to 10,000 molecules and use this selection to generate the chemical space. It is then possible to project any number of molecules into this space. The chemical space plot has been designed to work most efficiently for displaying sets of up to about 100,000 molecules on typical desktop hardware.


Can you run Chemical Space without a server?

Tuesday, 29 September 2009 19:57
E-mail Print PDF
Ed Champness

Yes, this runs on the StarDrop client application on your machine.


What are the requirements to run a ‘biased’ selection?

Tuesday, 29 September 2009 19:57
E-mail Print PDF
Ed Champness

A biased selection enables you to select a diverse set of compounds that are also influenced by a particular property. You can define diversity in terms of the chemical structure and/or properties and choose and choose from a variety of metrics for determining the optimal way to choose diverse compounds.


How is chemical diversity assessed?

Tuesday, 29 September 2009 19:58
E-mail Print PDF
Ed Champness

Chemical diversity can be defined in terms of the chemical structure and/or compound properties. For the chemical structure it is defined in terms of the patterns of atoms present in their chemical structures. The patterns of atoms along ‘paths’ through the 2D chemical structure of a compound are encoded in a binary ‘fingerprint’ and the similarity of two compounds is defined by a Tanimoto coefficient. The advantage of a path-based fingerprint approach to similarity and diversity is that it provides a ‘generic’ method of comparing compounds. For properties it is defined in terms of Euclidean distance between two points - the smaller the distance, the greater the similarity. A diverse set of compounds is determined by one of three metrics:

  • Maximin - the minimum distance between any two chosen compounds is maximised
  • S-optimal - the harmonic mean distance between pairs of chosen compounds is optimised so that they are spread out
  • Max Average - the average distance between all chosen compounds is maximised


Why is the compound selection based on a genetic algorithm?

Tuesday, 29 September 2009 19:58
E-mail Print PDF
Ed Champness

The number of possible selections increases exponentially with the size of a virtual library, e.g. there are 2.6x1023 ways of choosing 10 compounds from a library of 1,000. Therefore, when considering diversity, it rapidly becomes impossible to perform an exhaustive search for the optimal selection for a given set of criteria. Instead, a ‘stochastic’ approach must be taken, which cannot guarantee to identify the optimal solution but will find the optimal or a near-optimal selection with high probability. Genetic algorithms are a well known and robust approach commonly used in this context.


Should we wait for the selection algorithm to reach 1?

Tuesday, 29 September 2009 19:58
E-mail Print PDF
Ed Champness

The maximum value that can be achieved for the optimal selection will depend on chemical structure, properties and any diversity metric used, along with the characteristics of the compound set from which the selection is being made. Commonly, this optimal value will be less than 1, unless your data set is small. Even in the cases where it is possible it could take a long time to achieve so we recommend you wait until the plot reaches a plateau.


What is the appropriate balance of quality and diversity?

Tuesday, 29 September 2009 19:59
E-mail Print PDF
Ed Champness

It is usually beneficial to explore the sensitivity of a selection to the degree of bias chosen before making a final decision. Often, a significant degree of added diversity can be explored for a small decrease in the overall quality of the compounds selected. In this case, it is advisable to spread the risk across diverse compounds, provided synthetic resources permit. Conversely, in some cases, the selection of compounds will remain the same until a large bias toward diversity is selected. In this case, the selection of a diverse set may require an unacceptable decrease in the overall quality of the compounds. As a general rule, at the earlier stages of a project where little is known about the SAR of the target, it is advisable to bias a selection in favour of diversity. Typically a diversity:value ratio of 80:20 will sample across the extremes of chemical diversity, whilst still ensuring that top scoring compounds are represented within the selection. As the project moves towards the candidate stage, it will become more important to bias the selection towards ‘good’ compounds. In this case a diversity:value ratio of 20:80 may be more appropriate. Note that a diversity:value ratio of 100:0 will select molecules on the basis of their diversity and the newly selected set will mirror the diversity of the original set (assuming a reasonable sample size of molecules has been selected). A diversity:value ratio of 0:100 is equivalent to sorting by the value and selecting the top compounds.


Can I do a random selection?

Tuesday, 29 September 2009 19:59
E-mail Print PDF
Ed Champness

Yes, there is an option for random selection. Unlike the biased options, random selection option does not need either chemical structure information or property values to be run.


Can I save the selection?

Tuesday, 29 September 2009 19:59
E-mail Print PDF
Ed Champness

Yes, the selection tool will select a set of rows in the data set. You can then easily create a new data set from this selection or tag the selected rows to keep track of them using the tools on the StarDrop toolbar.


What is the maximum number of compounds that can be selected?

Tuesday, 29 September 2009 19:59
E-mail Print PDF
administrator

Selection of compounds with a consideration of diversity can be computationally demanding. The computational cost scales as the square of the number of compounds being selected. In practice, a reasonable limit is approximately 1,000 compounds.






Latest Forums

Read more >