I have a method that download a webpage and extract the title tag but depending of the website, the result can be encoded or in the wrong character set. Is there a bulletproof way to get websites title when they are encoded differently?
Some urls that i have tested with different result:
- https://fr.wikipedia.org/wiki/魁北克返回魁北克-维基百科".结果很好.
- http://www.remax-quebec.com/fr/index.rmx 返回"Condo,chalet ou maison & agrave vendre avec un courtier unmobilier | RE/MAX Québec ".
- http://www.restomontreal.ca/返回餐馆蒙特拉尔 | RestoMontreal"
- https://fr.wikipedia.org/wiki/Québec return "Québec — Wikipédia". The result is good.
- http://www.remax-quebec.com/fr/index.rmx return "Condo, chalet ou maison à vendre avec un courtier immobilier | RE/MAX Québec".
- http://www.restomontreal.ca/ return "Restaurants Montr�al | RestoMontreal"
private string GetUrlTitle(Uri uri)
string title = "";
using (HttpClient client = new HttpClient())
HttpResponseMessage response = null;
response = client.GetAsync(uri).Result;
if (!response.IsSuccessStatusCode)
string errorMessage = "";
XmlSerializer xml = new XmlSerializer(typeof(HttpError));
HttpError error = xml.Deserialize(response.Content.ReadAsStreamAsync().Result) as HttpError;
errorMessage = error.Message;
catch (Exception)
errorMessage = response.ReasonPhrase;
throw new Exception(errorMessage);
var html = response.Content.ReadAsStringAsync().Result;
title = Regex.Match(html, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
if (title == string.Empty)
title = uri.ToString();
return title;
The charset is not always present in the header so we must also check for the meta tags or if it's not there neither, fallback to UTF8 (or something else). Also, the title might be encoded so we just need to decode it.
下面的代码来自github项目 Abot .我已经对其进行了一些修改.
The code below come from the github project Abot. I have modified it a little bit.
private string GetUrlTitle(Uri uri)
string title = "";
using (HttpClient client = new HttpClient())
HttpResponseMessage response = client.GetAsync(uri).Result;
if (!response.IsSuccessStatusCode)
throw new Exception(response.ReasonPhrase);
var contentStream = response.Content.ReadAsStreamAsync().Result;
var charset = response.Content.Headers.ContentType.CharSet ?? GetCharsetFromBody(contentStream);
Encoding encoding = GetEncodingOrDefaultToUTF8(charset);
string content = GetContent(contentStream, encoding);
Match titleMatch = Regex.Match(content, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase);
if (titleMatch.Success)
title = titleMatch.Groups["Title"].Value;
// decode the title in case it have been encoded
title = WebUtility.HtmlDecode(title).Trim();
if (string.IsNullOrWhiteSpace(title))
title = uri.ToString();
return title;
private string GetContent(Stream contentStream, Encoding encoding)
contentStream.Seek(0, SeekOrigin.Begin);
using (StreamReader sr = new StreamReader(contentStream, encoding))
return sr.ReadToEnd();
/// <summary>
/// Try getting the charset from the body content.
/// </summary>
/// <param name="contentStream"></param>
/// <returns></returns>
private string GetCharsetFromBody(Stream contentStream)
contentStream.Seek(0, SeekOrigin.Begin);
StreamReader srr = new StreamReader(contentStream, Encoding.ASCII);
string body = srr.ReadToEnd();
string charset = null;
if (body != null)
//find expression from : http://stackoverflow.com/questions/3458217/how-to-use-regular-expression-to-match-the-charset-string-in-html
Match match = Regex.Match(body, @"<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s""']*)?([^>]*?)[\s""';]*charset\s*=[\s""']*([^\s""'/>]*)", RegexOptions.IgnoreCase);
if (match.Success)
charset = string.IsNullOrWhiteSpace(match.Groups[2].Value) ? null : match.Groups[2].Value;
return charset;
/// <summary>
/// Try parsing the charset or fallback to UTF8
/// </summary>
/// <param name="charset"></param>
/// <returns></returns>
private Encoding GetEncodingOrDefaultToUTF8(string charset)
Encoding e = Encoding.UTF8;
if (charset != null)
e = Encoding.GetEncoding(charset);
return e;