问题描述
我有一个数据大约一百万条记录,每条记录有6个浮点数。我想查找具有相同六个值的记录集合,理想情况下我想在Fortran中执行此操作,因为剩下的处理过程是在Fortran中完成的。对此的推荐方法是什么?最后,我想从原始索引映射到新索引,这是这些数据集的精简版本,而不会有重复。每条记录都有其他属性,我有兴趣根据这六个属性来聚合这些组。我试图通过将输出导出为csv来查找这些集合,并将其导入到MS Access,那么发现这些集合的查询需要10秒左右才能运行。我写了一个代码 this( 线性搜索?),但百万记录在10分钟左右没有完成,我刚刚放弃了这种方法。
我现在想的方法是调整排名/排序例程或,我认为这比我的原始代码更好。但我想知道这样的事情是否已经完成,我可以下载,或者如果有更好的方法。
编辑: 从原始数据记录映射到这个缩减集。我想要像imap(1:n)这样的映射数组,其中imap(1),imap(4),imap(5)具有相同的值,如果这些浮点数是pt。原始记录1,4和5中的值是相同的。希望这不会太偏离我原先所说的......
这就是我最终做的事情。 。我从取得了代码 mrgrnk
,并适合我的目的。下面的子程序 findmap
似乎正在做我想做的事。
module fndmap
use m_mrgrnk,only:mrgrnk
implicit none
包含
子程序findmap(stkprm,stkmap)
!给出二维真实数组stkprm,找到如下所示的映射:
!
! (相同的记录分配有相同的索引)
! stkmap(i)== stkmap(j)iff stkprm(:,i)== stkprm(:,j)
! (订单保存)
!如果我< j和stkmap(i)/ = stkmap(j),则stkmap(i)< stkmap(j)
! (新索引是连续的)
! set(stkmap)== {1,2,..,maxval(stkmap)}
!
real,dimension(:,:),intent(in):: stkprm
整数,维(:),intent(out):: stkmap
整数,维(size(stkprm, 2)):: irngt
integer,dimension(size(stkprm,2)):: iwork
integer :: nrec,i,j
nrec = size(stkprm,2)
!查找每条记录的等级,重复记录保存
调用ar_mrgrnk(stkprm,irngt)
!构建iwork数组,其中包含原始数组的索引,其中
!记录是相同的,并且索引是youguest
i = 1
do(i do j = i + 1,nrec
if(any(stkprm ,irngt(i))/ = stkprm(:,irngt(j))))exit
enddo
iwork(irngt(i:j-1))= minval(irngt(i:j-1 ))
i = j
enddo
!现在构建地图,其中stkmap(i)显示新数组
的索引!如果(i == iwork(i))然后
j = j + 1 $ b $保留原始订单
j = 0
do i = 1,nrec
b stkmap(i)= j
else
stkmap(i)= stkmap(iwork(i))
endif
enddo
结束子程序
递归子程序ar_mrgrnk(xdont,irngt)
!就像ORDERPACK的mrgrnk一样,除了数组是2-b
!每行按第一个字段排序,然后第二个等等
real,dimension(:, :),intent(in):: xdont
整数,维度(:),intent(out),目标:: irngt
integer,dimension(size(xdont,2)):: iwork
整数:: nfld,nrec
整数:: i,j
整数,dimension(:),pointer :: ipt
nfld = size(xdont,1)
nrec = size(xdont,2)
!排名第一场
打电话mrgrnk(xdont(1,:),irngt)
!如果只有一个字段,如果(nfld == 1)返回
,则完成
!检查级别以查看多个记录是否具有相同的
! (1,irngt(i))的第一场
i = 1
的值为(i do j = i + 1,nrec
/ = xdont(1,irngt(j)))退出
enddo
!如果一对一,如果(j-1> i)然后
,则不做任何
!如果多对一,
!收集那些许多,然后排列他们
调用ar_mrgrnk(xdont(2 :, irngt(i:j-1)),iwork)
!根据这些字段重新排列我的排名到右边
ipt => irngt(i:j-1)
ipt = ipt(iwork(1:ji))
endif
i = j
enddo
if(associated(ipt)) nullify(ipt)
结束子程序
结束模块
I have a data approximately a million record, each record have 6 floating point number. I want to find sets of records who share identical six values, and ideally I want to do it in Fortran since the rest of processing is done in Fortran. What would be the recommended approach for this? At the end i want to have mapping from original index to new index which is condensed version of these dataset without duplicate. Each record has other attributes and i am interested in aggregating those for groups based on the six attributes.
I tried to find those sets by exporting output as csv, import it into MS Access, then a query that finds those sets took 10 seconds or so to run. I wrote a code which does http://rosettacode.org/wiki/Remove_duplicate_elements#Fortran this ("linear search"?), but with million record it didnt complete after 10 min or so, i just abandoned this approach.
Approach I am thinking now is adapting ranking/sorting routine from slatec or orderpack which i assume do better than my crude code. But I am wondering if such things are already done and i can download, or if there is better approach for this.
EDIT:
I said "finding duplicate", but i actually need mapping from original data records to this reduced sets. I want to have mapping array like imap(1:n), where imap(1), imap(4), imap(5) has same values if those 6 float pt. values in original record 1, 4 and 5 are the same. Hope this is not too much a deviation from what I said originally...
This is what I ended up doing... I took code mrgrnk
from ORDERPACK , and adapted for my purpose. The subroutine findmap
below appears to be doing what I wanted it to do.
module fndmap
use m_mrgrnk, only:mrgrnk
implicit none
contains
subroutine findmap(stkprm, stkmap )
! given 2-d real array stkprm, find a mapping described below:
!
! (identical records are assigned with same index)
! stkmap(i) == stkmap(j) iff stkprm(:,i) == stkprm(:,j)
! (order conserved)
! if i < j and stkmap(i) /= stkmap(j), then stkmap(i) < stkmap(j)
! (new index are contiguous)
! set(stkmap) == {1,2,..,maxval(stkmap)}
!
real,dimension(:,:),intent(in) :: stkprm
integer,dimension(:), intent(out) :: stkmap
integer, dimension(size(stkprm,2)) :: irngt
integer, dimension(size(stkprm,2)) :: iwork
integer :: nrec, i, j
nrec = size(stkprm,2)
! find rank of each record, duplicate records kept
call ar_mrgrnk(stkprm, irngt)
! construct iwork array, which has index of original array where the
! record are identical, and the index is youguest
i = 1
do while(i<=nrec)
do j=i+1,nrec
if (any(stkprm(:,irngt(i))/=stkprm(:,irngt(j)))) exit
enddo
iwork(irngt(i:j-1)) = minval(irngt(i:j-1))
i = j
enddo
! now construct the map, where stkmap(i) shows index of new array
! with duplicated record eliminated, original order kept
j = 0
do i=1,nrec
if (i==iwork(i)) then
j = j+1
stkmap(i) = j
else
stkmap(i) = stkmap(iwork(i))
endif
enddo
end subroutine
recursive subroutine ar_mrgrnk(xdont, irngt)
! behaves like mrgrnk of ORDERPACK, except that array is 2-d
! each row are ranked by first field, then second and so on
real, dimension(:,:), intent(in) :: xdont
integer, dimension(:), intent(out), target :: irngt
integer, dimension(size(xdont,2)) :: iwork
integer :: nfld,nrec
integer :: i, j
integer, dimension(:), pointer :: ipt
nfld=size(xdont,1)
nrec=size(xdont,2)
! rank by the first field
call mrgrnk(xdont(1,:), irngt)
! if there's only one field, it's done
if (nfld==1) return
! examine the rank to see if multiple record has identical
! values for the first field
i = 1
do while(i<=nrec)
do j=i+1,nrec
if (xdont(1,irngt(i))/=xdont(1,irngt(j))) exit
enddo
! if one-to-one, do nothing
if (j-1>i) then
! if many-to-one,
! gather those many, and rank them
call ar_mrgrnk(xdont(2:,irngt(i:j-1)),iwork)
! rearrange my rank based on those fields to the right
ipt => irngt(i:j-1)
ipt = ipt(iwork(1:j-i))
endif
i = j
enddo
if(associated(ipt)) nullify(ipt)
end subroutine
end module
这篇关于在fortran中查找重复记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!