问题描述
好的,我可以读取unicode文件,但现在我看到我将整个文件放在一个字符串中,现在我无法将其打破,然后是单词。我很迷茫。我之前的帖子,但由于问题不同,现在我发布新的问题。目标: - 我想过滤文件中的一些单词(例如用双引号括起来) - 我已经读取了unicode(UTF16文件)并且它的单个字符串 - 我需要逐行打破它然后使用cstok打破它用语言
平台Windows,Visual Studio 2010,Unicode:UTF16如果你有不同的建议,我愿意改变代码,如果你能改变它也会很棒粘贴示例代码以了解。
粘贴以下代码:
Ok i could read the unicode file but now i see that i get the entire file in one string and now i am unable to break it in line and then words. I am very confused. i had previous post but since the problem is different now i am posting new ques. objective: - i want to filter some words from the file (e.g. enclosed in double quotes) - i have read the unicode (UTF16 file )and its got in single string - i need to break it line by line and then using cstok break it in words
Platform Windows , Visual studio 2010 , Unicode: UTF16 If you have different suggestions, i am open to change the code ,also it would be great if you could paste the sample code to understand.
Pasting the code below:
#include <codecvt>
#include <locale>
wifstream fin("profiles.txt", ios_base::binary); //open a file
wofstream fout("out.txt",ios_base::binary); // this dumps the parsing ouput
fin.imbue(std::locale(fin.getloc(),new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
fout.imbue(std::locale(fin.getloc(),new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
wstring line;
getline(fin,line); //-----------------here i get the entire file in wstring line
// Need suggestions on below code on how to handle
while (!fin.eof())
{
// read an entire line into memory
// wchar_t buf[MAX_CHARS_PER_LINE];
//fin.getline(buf, MAX_CHARS_PER_LINE);
// parse the line into blank-delimited tokens
int n = 0; // a for-loop index
// array to store memory addresses of the tokens in buf
const wchar_t* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0
// parse the line
token[0] = wcstok(buf, DELIMITER); // first token
if (token[0]) // zero if line is blank
{
for (n = 0; n < MAX_TOKENS_PER_LINE; n++) // setting n=0 as we want to ignore the first token
{
token[n] = wcstok(0, DELIMITER); // subsequent tokens
if (!token[n]) break; // no more tokens
std::wstring str2 =token[n];
}
}
}
推荐答案
in.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff,
std::codecvt_mode(std::little_endian|std::consume_header)>);
after修复这个代码的其余部分按预期工作。
感谢@Richard,@ nv3,@ pablo的回复。非常感谢。
after fixing this the rest of the code worked as expected.
Thanks @Richard , @nv3 ,@pablo for response. much appreciated.
这篇关于C ++中的Unicode字符串处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!