本文介绍了TSQL md5哈希不同于C#.NET md5的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经生成了一个md5哈希,如下所示:

  DECLARE @varchar varchar(400)

SET @varchar ='è'

SELECT CONVERT(VARCHAR(2000),HASHBYTES('MD5',@varchar),2)
pre>

哪些输出:

  785D512BE4316D578E6650613B45E934 

然而,使用以下方式生成MD5哈希:

  System.Text.Encoding.UTF8.GetBytes(è)

生成: / p>

  0a35e149dbbb2d10d744bf675c7744b1 

C#.NET方法中的编码设置为UTF8,我假定varchar也是UTF8,任何关于我做错什么的想法?

解决方案

如果您正在处理 NVARCHAR / NCHAR 数据(存储为UTF-16 Little Endian ),那么您将使用 Unicode 编码,而不是 BigEndianUnicode 。在.NET中,UTF-16称为 Unicode ,而其他Unicode编码则由实际名称引用:UTF7,UTF8和UTF32。因此, Unicode 本身就是 Little Endian 而不是 BigEndianUnicode 更新:请参阅最后一节关于UCS-2和补充字符的部分。



在数据库端:

  SELECT HASHBYTES('MD5',N'è')AS [HashBytesNVARCHAR] 
- FAC02CD988801F0495D35611223782CF

在.NET端:

  System.Text.Encoding.ASCII.GetBytes(è)
// D1457B72C3FB323A2671125AEF3EAB5D

System.Text.Encoding.UTF7 .GetBytes(è)
// F63A0999FE759C5054613DDE20346193

System.Text.Encoding.UTF8.GetBytes(è)
// 0A35E149DBBB2D10D744BF675C7744B1

System.Text.Encoding.UTF32.GetBytes(è)
// 86D29922AC56CF022B639187828137F8

System.Text.Encoding.BigEndianUnicode.GetBytes(è)
// 407256AC97E4C5AEBCA825DEB3D2E89C

System.Text.Encoding.Unicode.GetBytes(è)//这个匹配HASHBYTES('MD5',N'è )
// FAC02CD988801F0495D35611223782CF

但是,这个问题属于 VARCHAR / CHAR 数据,这是ASCII,所以有点复杂。



在数据库端:

  SELECT HASHBYTES('MD5','è')AS [HashBytesVARCHAR] 
- 785D512BE4316D578E6650613B45E934

我们已经看到上面的.NET端。从这些哈希值中,应该有两个问题:




  • 为什么他们的任何不符合 HASHBYTES value?

  • 为什么在@Eric J.的答案中链接的sqlteam.com文章显示其中三个( ASCII UTF7 UTF8 )都匹配 HASHBYTES value?



有一个答案涵盖了两个问题:代码页。在sqlteam文章中完成的测试使用的是安全ASCII字符,它们在0 - 127范围内(以int / decimal值计),代码页之间不会有所不同。但是,我们发现è字符的128 - 255范围是扩展集合,它们因代码页而异(这是有道理的,因为这是编码页面的原因)。



现在尝试:

  SELECT HASHBYTES('MD5','è'COLLATE SQL_Latin1_General_CP1255_CI_AS)AS [HashBytes] 
- D1457B72C3FB323A2671125AEF3EAB5D

匹配 ASCII hashed值(再次,因为sqlteam文章/测试使用0 - 127范围内的值,所以在使用时没有看到任何更改 COLLATE )。好的,现在我们终于找到了一种匹配 VARCHAR / CHAR 数据的方法。好吗?



嗯,不是真的。让我们来看看我们实际上是在做什么:

  SELECT'è'AS [ TheChar],
ASCII('è')AS [TheASCIIvalue],
'è'COLLATE SQL_Latin1_General_CP1255_CI_AS AS [CharCP1255],
ASCII('è'COLLATE SQL_Latin1_General_CP1255_CI_AS)AS [TheASCIIvalueCP1255];

返回:

  TheChar TheASCIIvalue CharCP1255 TheASCIIvalueCP1255 
è232? 63

A ?只需验证,运行:

  SELECT CHAR(63)AS [WhatIs63?]; 
- ?

啊,所以代码页1255没有è字符,所以它被翻译为每个人最喜欢的。但是为什么在使用ASCII编码时,为什么匹配MD5在.NET中的哈希值呢?可能是我们实际上并不匹配è的散列值,而是匹配的散列值?

  SELECT HASHBYTES('MD5','?')AS [HashBytesVARCHAR] 
- 0xD1457B72C3FB323A2671125AEF3EAB5D

Yup。真正的字符集只是前128个字符(值0 - 127)。正如我们刚刚看到的那样,è是232.所以在.NET中使用 ASCII 编码不是那么有用。也不是在T-SQL端使用 COLLATE



是否可以获得更好的编码。 NET端?是的,通过使用,它允许指定代码页。可以使用以下查询(使用 sys.columns 在使用列而不是文字或变量时可以发现使用的代码页):

  SELECT sd。[collat​​ion_name],
COLLATIONPROPERTY(sd。[collat​​ion_name],'CodePage')AS [CodePage ]
FROM sys.databases sd
WHERE sd。[name] = DB_NAME(); - 如果不在DB中运行,则使用N'{db_name}'替换函数

上面的查询返回(for me):

  Latin1_General_100_CI_AS_SC 1252 
/ pre>

所以,让我们尝试代码页1252:

  System.Text.Encoding.GetEncoding(1252).GetBytes(è)//匹配HASHBYTES('MD5','è')
// 785D512BE4316D578E6650613B45E934

Woo hoo!我们有使用我们的默认SQL Server排序规则的 VARCHAR 数据的匹配项。当然,如果数据来自设置为不同排序规则的数据库或字段,那么 GetEncoding(1252) 可能不工作,您将必须使用上面显示的查询找到实际匹配的代码页(一个代码页用于多个排序规则,因此不同的排序规则不一定意味着不同的代码页)。



要查看可能的代码页值,以及它们涉及的文化/区域设置,请参阅代码页列表(列表在备注部分)。






有关实际存储在 NVARCHAR / NCHAR 字段:



任何字符(2或4个字节)可以存储,尽管默认行为的内置函数假定所有字符都是UCS-2(每个2个字节),这是UTF-16的一个子集。从SQL Server 2012开始,可以访问一组支持4字节字符的Windows归类,称为补充字符。使用以$或code> _SC 结尾的其中一个Windows归类(对于列或直接在查询中指定)将允许内置函数正确处理4个字节的字符。 / p>

   - 数据库的归类设置为:SQL_Latin1_General_CP1_CI_AS 
SELECT N'

I've generated an md5 hash as below:

DECLARE @varchar varchar(400)

SET @varchar = 'è'

SELECT CONVERT(VARCHAR(2000), HASHBYTES( 'MD5', @varchar ), 2)

Which outputs:

785D512BE4316D578E6650613B45E934

However generating an MD5 hash using:

System.Text.Encoding.UTF8.GetBytes("è")

generates:

0a35e149dbbb2d10d744bf675c7744b1

The encoding in the C# .NET method is set to UTF8 and I had assumed that varchar was also UTF8, any ideas on what I'm doing wrong?

解决方案

If you are dealing with NVARCHAR / NCHAR data (which is stored as UTF-16 Little Endian), then you would use the Unicode encoding, not BigEndianUnicode. In .NET, UTF-16 is called Unicode while other Unicode encodings are referred to by their actual names: UTF7, UTF8, and UTF32. Hence, Unicode by itself is Little Endian as opposed to BigEndianUnicode. UPDATE: Please see the section at the end regarding UCS-2 and Supplementary Characters.

On the database side:

SELECT HASHBYTES('MD5', N'è') AS [HashBytesNVARCHAR]
-- FAC02CD988801F0495D35611223782CF

On the .NET side:

System.Text.Encoding.ASCII.GetBytes("è")
// D1457B72C3FB323A2671125AEF3EAB5D

System.Text.Encoding.UTF7.GetBytes("è")
// F63A0999FE759C5054613DDE20346193

System.Text.Encoding.UTF8.GetBytes("è")
// 0A35E149DBBB2D10D744BF675C7744B1

System.Text.Encoding.UTF32.GetBytes("è")
// 86D29922AC56CF022B639187828137F8

System.Text.Encoding.BigEndianUnicode.GetBytes("è")
// 407256AC97E4C5AEBCA825DEB3D2E89C

System.Text.Encoding.Unicode.GetBytes("è")  // this one matches HASHBYTES('MD5', N'è')
// FAC02CD988801F0495D35611223782CF

However, this question pertains to VARCHAR / CHAR data, which is ASCII, and so things are a bit more complicated.

On the database side:

SELECT HASHBYTES('MD5', 'è') AS [HashBytesVARCHAR]
-- 785D512BE4316D578E6650613B45E934

We already see the .NET side above. From those hashed values there should be two questions:

  • Why don't any of them match the HASHBYTES value?
  • Why does the "sqlteam.com" article linked in @Eric J.'s answer show that three of them (ASCII, UTF7, and UTF8) all match the HASHBYTES value?

There is one answer that covers both questions: Code Pages. The test done in the "sqlteam" article used "safe" ASCII characters that are in the 0 - 127 range (in terms of the int / decimal value) that do not vary between Code Pages. But the 128 - 255 range -- where we find the "è" character -- is the Extended set that does vary by Code Page (which makes sense as this is the reason for having Code Pages).

Now try:

SELECT HASHBYTES('MD5', 'è' COLLATE SQL_Latin1_General_CP1255_CI_AS) AS [HashBytes]
-- D1457B72C3FB323A2671125AEF3EAB5D

That matches the ASCII hashed value (and again, because the "sqlteam" article / test used values in the 0 - 127 range, they did not see any changes when using COLLATE). Great, now we finally found a way to match VARCHAR / CHAR data. All good?

Well, not really. Let's take a look-see at what we were actually hashing:

SELECT 'è' AS [TheChar],
       ASCII('è') AS [TheASCIIvalue],
       'è' COLLATE SQL_Latin1_General_CP1255_CI_AS AS [CharCP1255],
       ASCII('è' COLLATE SQL_Latin1_General_CP1255_CI_AS) AS [TheASCIIvalueCP1255];

Returns:

TheChar TheASCIIvalue   CharCP1255  TheASCIIvalueCP1255
è       232             ?           63

A ? ? Just to verify, run:

SELECT CHAR(63) AS [WhatIs63?];
-- ?

Ah, so Code Page 1255 doesn't have the è character, so it gets translated as everyone's favorite ?. But then why did that match the MD5 hashed value in .NET when using the ASCII encoding? Could it be that we weren't actually matching the hashed value of è, but instead were matching the hashed value of ?:

SELECT HASHBYTES('MD5', '?') AS [HashBytesVARCHAR]
-- 0xD1457B72C3FB323A2671125AEF3EAB5D

Yup. The true ASCII character set is just the first 128 characters (values 0 - 127). And as we just saw, the è is 232. So, using the ASCII encoding in .NET is not that helpful. Nor was using COLLATE on the T-SQL side.

Is it possible to get a better encoding on the .NET side? Yes, by using Encoding.GetEncoding(Int32), which allows for specifying the Code Page. The Code Page to use can be discovered using the following query (use sys.columns when working with a column instead of a literal or variable):

SELECT sd.[collation_name],
       COLLATIONPROPERTY(sd.[collation_name], 'CodePage') AS [CodePage]
FROM   sys.databases sd
WHERE  sd.[name] = DB_NAME(); -- replace function with N'{db_name}' if not running in the DB

The query above returns (for me):

Latin1_General_100_CI_AS_SC    1252

So, let's try Code Page 1252:

System.Text.Encoding.GetEncoding(1252).GetBytes("è") // Matches HASHBYTES('MD5', 'è')
// 785D512BE4316D578E6650613B45E934

Woo hoo! We have a match for VARCHAR data that uses our default SQL Server collation :). Of course, if the data is coming from a database or field set to a different collation, then GetEncoding(1252) might not work and you will have to find the actual matching Code Page using the query shown above (a Code Page is used across many Collations, so a different Collation does not necessarily imply a different Code Page).

To see what the possible Code Page values are, and what culture / locale they pertain to, please see the list of Code Pages here (list is in the "Remarks" section).


Additional info related to what is actually stored in NVARCHAR / NCHAR fields:

Any UTF-16 character (2 or 4 bytes) can be stored, though the default behavior of the built-in functions assumes that all characters are UCS-2 (2 bytes each), which is a subset of UTF-16. Starting in SQL Server 2012, it is possible to access a set of Windows collations that support the 4 byte characters known as Supplementary Characters. Using one of these Windows collations ending in _SC, either specified for a column or directly in a query, will allow the built-in functions to properly handle the 4 byte characters.

-- The database's collation is set to: SQL_Latin1_General_CP1_CI_ASSELECT  N'                        

这篇关于TSQL md5哈希不同于C#.NET md5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-30 14:34