Similarity Downselection: A Python implementation of a heuristic search algorithm for finding the set of the n most dissimilar items with an application in conformer sampling
Finding the set of the n items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either n or the population size grows large. Finding the set of the n most dissimilar items is different than simply sorting an array of numbers b...
Saved in:
| Main Authors | , , , |
|---|---|
| Format | Journal Article |
| Language | English |
| Published |
06.05.2021
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.48550/arxiv.2105.02991 |
Cover
| Summary: | Finding the set of the n items most dissimilar from each other out of a
larger population becomes increasingly difficult and computationally expensive
as either n or the population size grows large. Finding the set of the n most
dissimilar items is different than simply sorting an array of numbers because
there exists a pairwise relationship between each item and all other items in
the population. For instance, if you have a set of the most dissimilar n=4
items, one or more of the items from n=4 might not be in the set n=5. An exact
solution would have to search all possible combinations of size n in the
population, exhaustively. We present an open-source software called similarity
downselection (SDS), written in Python and freely available on GitHub. SDS
implements a heuristic algorithm for quickly finding the approximate set(s) of
the n most dissimilar items. We benchmark SDS against a Monte Carlo method,
which attempts to find the exact solution through repeated random sampling. We
show that for SDS to find the set of n most dissimilar conformers, our method
is not only orders of magnitude faster, but is also more accurate than running
the Monte Carlo for 1,000,000 iterations, each searching for set sizes n=3-7
out of a population of 50,000. We also benchmark SDS against the exact solution
for example small populations, showing SDS produces a solution close to the
exact solution in these instances. |
|---|---|
| DOI: | 10.48550/arxiv.2105.02991 |