本文介绍了查找比给定最大距离更近的所有点对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想(有效地)找到比某个距离max_d更近的所有点对.我当前使用cdist的方法是:

I want to find (efficiently) all pairs of points that are closer than some distance max_d. My current method, using cdist, is:

import numpy as np
from scipy.spatial.distance import cdist

def close_pairs(X,max_d):
    d = cdist(X,X)

    I,J = (d<max_d).nonzero()
    IJ  = np.sort(np.vstack((I,J)), axis=0)

    # remove diagonal element
    IJ  = IJ[:,np.diff(IJ,axis=0).ravel()<>0]

    # remove duplicate
    dt = np.dtype([('i',int),('j',int)])
    pairs = np.unique(IJ.T.view(dtype=dt)).view(int).reshape(-1,2)

    return pairs

def test():
    X = np.random.rand(100,2)*20
    p = close_pairs(X,2)

    from matplotlib import pyplot as plt
    plt.clf()
    plt.plot(X[:,0],X[:,1],'.r')
    plt.plot(X[p,0].T,X[p,1].T,'-b')

但是我认为这太过分了(不是很容易理解),因为大部分工作只是为了消除与自己的距离和重复项.

But I think this is overkill (and not very readable), because most of the work is done only to remove distance-to-self and duplicates.

我的主要问题是:有更好的方法吗?

(注意:此时输出的类型(arrayset,...)并不重要)

(Note: the type of outputs (array, set, ...) is not important at this point)

我目前的想法是使用pdist,它返回一个精简的距离数组,其中仅包含正确的对.但是,一旦我从压缩距离数组中找到合适的坐标k,我该如何计算它等同于哪个i,j对?

My current thinking is on using pdist which returns a condensed distance array which contains only the right pairs. However, once I found the suitable coordinates k's from the condensed distance array, how do I compute which i,j pairs it is equivalent to?

所以另一个问题是:是否有一种简单的方法来获取相对于pdist输出项的坐标对列表:

So the alternative question is: is there an easy way to get the list of coordinate pairs relative to the entries of pdist outputs:

  • 一个功能f(k)->i,j
  • 这样cdist(X,X)[i,j] = pdist(X)[k]
  • a function f(k)->i,j
  • such that cdist(X,X)[i,j] = pdist(X)[k]

推荐答案

以我的经验,有两种最快的方法来查找3D邻居列表.一种是使用以C ++或Cython(在我的情况下,两者都是)编写的最幼稚的双重循环"代码.它以N ^ 2运行,但是对于小型系统来说非常快.另一种方法是使用线性时间算法. Scipy ckdtree是一个不错的选择,但有其局限性.分子动力学软件的邻居列表查找器功能最强大,但包装起来非常困难,并且初始化时间可能很慢.

In my experience, there are two fastest ways to find neighbor lists in 3D. One is to use a most naive double-for-loop code written in C++ or Cython (in my case, both). It runs in N^2, but is very fast for small systems. The other way is to use a linear time algorithm. Scipy ckdtree is a good choice, but has limitations. Neighbor list finders from molecular dynamics software are most powerful, but are very hard to wrap, and likely have slow initialization time.

下面我比较四种方法:

  • 纯朴的cython代码
  • OpenMM 周围的包装(很难安装,请参见下文)
  • Scipy.spatial.ckdtree
  • scipy.spatial.distance.pdist
  • Naive cython code
  • Wrapper around OpenMM (is very hard to install, see below)
  • Scipy.spatial.ckdtree
  • scipy.spatial.distance.pdist

测试设置:n点散布在一个矩形框中,体积密度为0.2.系统大小从10到1000000(百万)个粒子不等.接触半径取自0.5, 1, 2, 4, 7, 10.请注意,由于密度为0.2,所以在接触半径为0.5时,每个粒子平均大约有0.1个接触,即1 = 0.8、2 = 6.4和10-约800!对于小型系统,重复进行几次接触查找,对于> 30k粒子的系统,进行一次.如果每个呼叫的时间超过5秒,则运行中止.

Test setup: n points scattered in a rectangular box at volume density 0.2. System size ranging from 10 to a 1000000 (a million) particles. Contact radius is taken from 0.5, 1, 2, 4, 7, 10. Note that because density is 0.2, at contact radius 0.5 we'll have on average about 0.1 contacts per particle, at 1 = 0.8, at 2 = 6.4, and at 10 - about 800! Contact finding was repeated several times for small systems, done once for systems >30k particles. If time per call exceeded 5 seconds, the run was aborted.

设置:双至强2687Wv3、128GB RAM,Ubuntu 14.04,python 2.7.11,scipy 0.16.0,numpy 1.10.1.没有任何代码使用并行优化(OpenMM除外,尽管并行部分运行得如此之快,以至于在CPU图形上它甚至都没有被注意到,大部分时间都花在了从OpenMM到管道的数据传输上).

Setup: dual xeon 2687Wv3, 128GB RAM, Ubuntu 14.04, python 2.7.11, scipy 0.16.0, numpy 1.10.1. None of the code was using parallel optimizations (except for OpenMM, though parallel part went so quick that it was not even noticeable on a CPU graph, most of the time was spend piping data to-from OpenMM).

结果:请注意,下面的图是对数刻度,分布在6个数量级上.甚至很小的视觉差异实际上可能是10倍.对于少于1000个粒子的系统,Cython代码总是更快.但是,经过1000个粒子后,结果取决于接触半径. pdist的实现总是比cython慢​​,并且占用更多的内存,因为它显式创建了一个距离矩阵,该距离矩阵由于sqrt而很慢.

Results: Note that plots below are logscale, and spread over 6 orders of magnitude. Even small visual difference may be actually 10-fold.For systems less than 1000 particles, Cython code was always faster. However, after 1000 particles results are dependent on the contact radius. pdist implementation was always slower than cython, and takes much more memory, because it explicitly creates a distance matrix, which is slow because of sqrt.

  • 在小的接触半径(每个颗粒< 1个接触)的情况下,ckdtree是所有系统尺寸的不错选择.
  • 在中等接触半径(每个粒子5-50个接触)的情况下,幼稚的cython实现最适合10000个粒子,然后OpenMM开始获胜大约几个数量级,但ckdtree的性能仅差3-10倍
  • 在高接触半径(每个粒子大于200个接触)的情况下,幼稚的方法最多可以处理100k或1M粒子,那么OpenMM可能会获胜.
  • At small contact radius (<1 contact per particle), ckdtree is a good choice for all system sizes.
  • At medium contact radius, (5-50 contacts per particle) naive cython implementation is the best up to 10000 particles, then OpenMM starts to win by about several orders of magnitude, but ckdtree performs just 3-10 times worse
  • At high contact radius (>200 contacts per particle) naive methods work up to 100k or 1M particles, then OpenMM may win.

安装OpenMM非常棘手;您可以在 http://bitbucket.org/mirnylab/openmm-polymer 文件"contactmaps .py"或自述文件中.但是,以下结果表明,对于N> 100k的粒子,每个粒子仅5-50次接触是有利的.

Installing OpenMM is very tricky; you can read more in http://bitbucket.org/mirnylab/openmm-polymer file "contactmaps.py" or in the readme. However, the results below show that it is only advantageous for 5-50 contacts per particle, for N>100k particles.

下面的Cython代码:

Cython code below:

import numpy as np
cimport numpy as np
cimport cython

cdef extern from "<vector>" namespace "std":
    cdef cppclass vector[T]:
        cppclass iterator:
            T operator*()
            iterator operator++()
            bint operator==(iterator)
            bint operator!=(iterator)
        vector()
        void push_back(T&)
        T& operator[](int)
        T& at(int)
        iterator begin()
        iterator end()

np.import_array() # initialize C API to call PyArray_SimpleNewFromData
cdef public api tonumpyarray(int* data, long long size) with gil:
    if not (data and size >= 0): raise ValueError
    cdef np.npy_intp dims = size
    #NOTE: it doesn't take ownership of `data`. You must free `data` yourself
    return np.PyArray_SimpleNewFromData(1, &dims, np.NPY_INT, <void*>data)

@cython.boundscheck(False)
@cython.wraparound(False)
def contactsCython(inArray, cutoff):
    inArray = np.asarray(inArray, dtype = np.float64, order = "C")
    cdef int N = len(inArray)
    cdef np.ndarray[np.double_t, ndim = 2] data = inArray
    cdef int j,i
    cdef double curdist
    cdef double cutoff2 = cutoff * cutoff  # IMPORTANT to avoid slow sqrt calculation
    cdef vector[int] contacts1
    cdef vector[int] contacts2
    for i in range(N):
        for j in range(i+1, N):
            curdist = (data[i,0] - data[j,0]) **2 +(data[i,1] - data[j,1]) **2 + (data[i,2] - data[j,2]) **2
            if curdist < cutoff2:
                contacts1.push_back(i)
                contacts2.push_back(j)
    cdef int M = len(contacts1)

    cdef np.ndarray[np.int32_t, ndim = 2] contacts = np.zeros((M,2), dtype = np.int32)
    for i in range(M):
        contacts[i,0] = contacts1[i]
        contacts[i,1] = contacts2[i]
    return contacts

Cython代码的编译(或makefile):

Compilation (or makefile) for Cython code:

    cython --cplus fastContacts.pyx
    g++  -g -march=native -Ofast -fpic -c   fastContacts.cpp -o fastContacts.o `python-config --includes`
    g++  -g -march=native -Ofast -shared  -o fastContacts.so  fastContacts.o `python-config --libs`

测试代码:

from __future__ import print_function, division

import signal
import time
from contextlib import contextmanager

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from scipy.spatial import ckdtree
from scipy.spatial.distance import pdist

from contactmaps import giveContactsOpenMM  # remove this unless you have OpenMM and openmm-polymer libraries installed
from fastContacts import contactsCython


class TimeoutException(Exception): pass


@contextmanager
def time_limit(seconds):
    def signal_handler(signum, frame):
        raise TimeoutException("Timed out!")

    signal.signal(signal.SIGALRM, signal_handler)
    signal.alarm(seconds)
    try:
        yield
    finally:
        signal.alarm(0)


matplotlib.rcParams.update({'font.size': 8})


def close_pairs_ckdtree(X, max_d):
    tree = ckdtree.cKDTree(X)
    pairs = tree.query_pairs(max_d)
    return np.array(list(pairs))


def condensed_to_pair_indices(n, k):
    x = n - (4. * n ** 2 - 4 * n - 8 * k + 1) ** .5 / 2 - .5
    i = x.astype(int)
    j = k + i * (i + 3 - 2 * n) / 2 + 1
    return np.array([i, j]).T


def close_pairs_pdist(X, max_d):
    d = pdist(X)
    k = (d < max_d).nonzero()[0]
    return condensed_to_pair_indices(X.shape[0], k)


a = np.random.random((100, 3)) * 3  # test set
methods = {"cython": contactsCython, "ckdtree": close_pairs_ckdtree, "OpenMM": giveContactsOpenMM,
           "pdist": close_pairs_pdist}

# checking that each method gives the same value
allUniqueInds = []
for ind, method in methods.items():
    contacts = method(a, 1)
    uniqueInds = contacts[:, 0] + 100 * contacts[:, 1]  # unique index of each contacts
    allUniqueInds.append(np.sort(uniqueInds))  # adding sorted unique conatcts
for j in allUniqueInds:
    assert np.allclose(j, allUniqueInds[0])

# now actually doing testing
repeats = [30,30,30, 30, 30, 20,  20,   10,   5,   3,     2 ,       1,     1,      1]
sizes =    [10,30,100, 200, 300,  500, 1000, 2000, 3000, 10000, 30000, 100000, 300000, 1000000]
systems = [[np.random.random((n, 3)) * ((n / 0.2) ** 0.333333) for k in range(repeat)] for n, repeat in
           zip(sizes, repeats)]

for j, radius in enumerate([0.5, 1, 2, 4, 7, 10]):
    plt.subplot(2, 3, j + 1)
    plt.title("Radius = {0}; {1:.2f} cont per particle".format(radius, 0.2 * (4 / 3 * np.pi * radius ** 3)))

    times = {i: [] for i in methods}

    for name, method in methods.items():
        for n, system, repeat in zip(sizes, systems, repeats):
            if name == "pdist" and n > 30000:
                break  # memory issues
            st = time.time()
            try:
                with time_limit(5 * repeat):
                    for ind in range(repeat):
                        k = len(method(system[ind], radius))
            except:
                print("Run aborted")
                break
            end = time.time()
            mytime = (end - st) / repeat
            times[name].append((n, mytime))
            print("{0} radius={1} n={2} time={3} repeat={4} contPerParticle={5}".format(name, radius, n, mytime,repeat, 2 * k / n))

    for name in sorted(times.keys()):
        plt.plot(*zip(*times[name]), label=name)
    plt.xscale("log")
    plt.yscale("log")
    plt.xlabel("System size")
    plt.ylabel("Time (seconds)")
    plt.legend(loc=0)

plt.show()

这篇关于查找比给定最大距离更近的所有点对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-06 06:10