多列整数排序

扫码查看
本文介绍了多列整数排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道是否为此问题选择了合适的标题(如果没有,请相应地更改它),但请考虑以下我正在使用的简化表结构:

----------------------------------------------
|  date  |  i  |  j  |  k  |  x  |  y  |  z  |
----------------------------------------------
| 100209 |  1  |  2  |  3  |  4  |  5  |  6  |
----------------------------------------------
| 100210 |  2  |  3  |  4  |  5  |  6  |  7  |
----------------------------------------------
| 100211 |  0  |  1  |  2  |  3  |  4  |  5  |
----------------------------------------------
| 100212 |  1  |  2  |  3  |  4  |  5  |  6  |
----------------------------------------------
| 100213 |  6  |  5  |  4  |  3  |  2  |  1  |
----------------------------------------------

ijkxyz都是不相关的整数/浮点数,它们都表示不同的因子,并且可以具有非常不同的数量级( i的范围是1-10,而j的范围是100-1000 .

我正在尝试选择条件相似的日期;给定一组ijkxyz值,我需要

How I should build such a query? Is this possible with SQL alone?


@Pentium10:

I'll try to answer your comment the best way I can. Here is a sample of my data:

---------------------------------------------------------------------------------
  date  |  temperature  |  humidity  |  pressure  |  windSpeed  |  moonDistance
---------------------------------------------------------------------------------
 090206 |  7            |  87        |  998.8     |  3          |  363953
---------------------------------------------------------------------------------
 ...... |  ...          |  ...       |  ....      |  ...        |  ......
---------------------------------------------------------------------------------
 100206 |  10           |  86        |  1024      |  2          |  386342
---------------------------------------------------------------------------------
 100207 |  9            |  90        |  1015      |  1          |  391750
---------------------------------------------------------------------------------
 100208 |  13           |  90        |  1005      |  2          |  396392
---------------------------------------------------------------------------------
 100209 |  12           |  89        |  1008      |  2          |  400157
---------------------------------------------------------------------------------
 100210 |  11           |  92        |  1007      |  3          |  403012
---------------------------------------------------------------------------------
 100211 |  6            |  86        |  1012      |  2          |  404984
---------------------------------------------------------------------------------
 100212 |  6            |  61        |  1010      |  3          |  406135
---------------------------------------------------------------------------------
 100213 |  7            |  57        |  1010      |  2          |  406542
---------------------------------------------------------------------------------

My table structure has more columns and thousands of rows but hopefully this will be enough to get my point clear. I'm not going to attempt to order these values like I did in my previous example because I would probably get it wrong, but I basically need to do two types of queries with this data:

  1. show me all dates, ordered by the resemblance of the conditions provided by me
  2. show me all dates, ordered by the resemblance of the conditions observed in date X

I understand that the second query can easily be archived by using the first one, but my problem lies in sorting by resemblance using several columns, that's what I meant by "ordered by closeness of all values as a whole". As in, if I was dealing with only one column it would be a lot easier to order by likeness but my head really starts spinning when dealing with several columns.

The goal is to to be able to produce results like this:

In my case I'm working with weather and atmospheric data but if it helps I guess you can think in this problem in terms of employees (having attendance, hoursPerWeek and monthlySalary columns) and order by employees that most closely resemble attendance = 100%, hoursPerWeek = 40 and monthlySalary = $5000, for instance.

PS: Now that I've given this employees example I'm not really sure anymore if it can be compared to the weather example I'm working with, since with the employees table you can compute (rating = monthlySalary / hoursPerWeek * attendance for instance) and kinda weight the columns, but I don't think the same can be done with the weather table - input is appreciated anyhow.

PS2: I'm not sure if I expressed myself well enough, if you still have doubts please let me know.


解决方案

The problem you seem to have is that each column has a different scale and so you can't easily combine them. This problem can be solved using a technique called whitening. This involves calculating the average and standard deviation of each column (you could do this in 1 SQL statement) and then rescaling each column to this when selecting:

colSortPos = (colValue-colMean) / colStdev

Doing this will give you each column ranging around 0 which +/- 1 standard deviation within the range +/- 1. The trick then is to combine these so that similar dates are together. The problem here is that this is not a 2 dimensional problem and so you need to think multidimensionally. So my suggestion its to take the the Euclidean distance as your sort order.

SELECT
    date,
    i,
    j,
    k,
    SQRT( POW((i-@iMean)/@iStdDEv, 2) + POW((j-@jMean)/@jStdDEv, 2) + POW((k-@kMean)/@kStdDEv, 2) )
AS
    sort_order
FROM
    table
ORDER BY
    sort_order

The only problem with this is that it projects your problem onto a 1 dimensional space that may make you miss some correlations. To work around this I suggest using a clustering technique like K-means which is pretty simple to implement and is really fast. This will allow you to group your dates into k clusters that display the most similarity [ http://en.wikipedia.org/wiki/K-means_clustering ]. If you have the raw data and want to play around with these (and other) techniques then I suggest trying the weka toolkit [ http://www.cs.waikato.ac.nz/ml/weka/ ] which will let you play around with these techniques.

这篇关于多列整数排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-18 17:45
查看更多