本文介绍了导出维基百科翻译标题的简便方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一种简单的方法可以导出维基百科的翻译标题以获得这样的集合:
russian_title ->english_title?

Is there an easy way to export Wikipedia's translated titles to get a set like this:
russian_title -> english_title?

我试图从ruwiki-latest-pages-meta-current.xml.bz2ruwiki-latest-pages-articles.xml.bz2,然而,有少于 25k 的翻译.

I tried to get ones fromruwiki-latest-pages-meta-current.xml.bz2 and ruwiki-latest-pages-articles.xml.bz2, however, there are less than 25k translations.

我发现有些不存在.例如.可以在此处看到英文维基链接,但没有链接[[en:Yandex]] 在转储中.

I found out some are not present. E.g. one can see a link to English wiki here, but there is no link [[en:Yandex]] in the dump.

也许我应该尝试解析英文维基百科,但我确信有更好的解决方案.

Maybe I should try to parse English Wikipedia, but I'm sure there is a nicer solution.

顺便说一句,我正在使用 wikixmlj + 试图找到 en:Yandexgrep.

BTW, I'm using wikixmlj + tried to find en:Yandex with grep.

UPD:@svick 解决方案数据的链接:http://dumps.wikimedia.org/ [语言代码] wiki/latest/例如http://dumps.wikimedia.org/ruwiki/latest/

UPD: link to @svick's solution data: http://dumps.wikimedia.org/ [language code] wiki/latest/e.g. http://dumps.wikimedia.org/ruwiki/latest/

推荐答案

各种语言的维基百科文章之间的大多数链接现在位于 维基数据.因此,如果您想访问源代码,可以下载 Wikidata 的转储并对其进行解析(采用 JSON 格式).

Most of the links between Wikipedia articles in various languages is now on Wikidata. So, if you wanted to get to the source, you could download the dump of Wikidata and parse that (it's in JSON).

但我认为更好的方法是使用 langlinks.这包含您想要的准确信息,包括来自维基数据的链接和仍为旧形式的链接.

But I think a better way would be to use the dump of the langlinks table. This contains exactly the information you want, both for links from Wikidata and links that are still in the old form.

此转储采用 SQL 格式.您可以将该转储导入 MySQL 数据库,也可以直接对其进行解析(我写了 a.Net 库可以做到这一点).

This dump is in SQL format. You can import that dump into an MySQL database, or you can parse it directly (I have written a .Net library that does that).

该表包含从您的 wiki(在您的情况下是俄罗斯维基百科)的页面 id 到其他 wiki 中的页面标题的映射.这意味着您将需要您感兴趣的页面的页面 ID.对于少量页面,您可以使用 页面信息"链接,或你可以使用API但是,如果您需要为大量页面使用此功能,则应下载包含此映射的 page 表的转储.

The table contains mappings from page id of your wiki (in your case the Russian Wikipedia) to page titles in other wikis. This means you will need the page ids of the pages you're interested in. For small number of pages, you can look them up manually using the "Page information" link, or you could use the API. But if you need this for large number of pages, you should download the dump of the page table, which contains this mapping.

这篇关于导出维基百科翻译标题的简便方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 04:32