我想从文本中提取所有引文.此外,应提取被引人的名字. DayLife做得很好.
I would like to extract all citations from a text. Additionally, the name of the cited person should be extracted. DayLife does this very well.
他们认为这是游戏结束" ,应提取引用的人一名高级行政官员.
The phrase They think it's 'game over' and the cited person one senior administration official should be extracted.
Do you think that's possible? You can only distinguish between citations and words in quotes if you check whether there's a cited person mentioned.
国际电联段落不是引号.但是,您如何检测到这一点? a)您检查是否提到了被引人. b)您计算假定报价中的空格.如果少于3个空格将不会被引用,对不对?我宁愿选择b),因为并非总是有被引证的人.
The passage State of the Union is not a quotation. But how do you detect this? a) You check if there's a cited person mentioned. b) You count the blank spaces in the supposed quotation. If there are less than 3 blank spaces it won't be a quotation, right? I would prefer b) since there's not always a cited person named.
I would first replace all types of quotes by a single type so that you'll have to check for only one quote mark later.
$text = '';
$quote_marks = array('"', '"', '„', '»', '«');
$text = str_replace($quote_marks, '"', $text);
Then I would extract all phrases between quotation marks which contain more than 3 blank spaces:
function extract_quotations($text) {
$result = preg_match_all('/"([^"]+)"/', $text, $found_quotations);
if ($result == TRUE) {
return $found_quotations;
// check for count of blank spaces
return array();
I hope you can help me. Thank you very much in advance!
正如ceejayoz指出的那样,这不适合单个函数.您在问题中所描述的内容(检测句子中用引号转义的部分的语法功能-即我认为它很严重且正在恶化"与国情咨文")最好通过图书馆解决可以将自然语言分解为标记.我不知道PHP中有任何这样的库,但是您可以看一下将在python中使用的项目的项目大小: http://www.nltk.org/
As ceejayoz already pointed out, this won't fit into a single function. What you're describing in your question (detecting grammatical function of a quote-escaped part of a sentence - i.e. "I think it is serious and it is deteriorating," vs "State of the Union") would be best solved with a library that can break down natural language into tokens. I am not aware of any such library in PHP, but you can have a look at the project size of something you would use in python: http://www.nltk.org/
I think the best you can do is define a set of syntax rules that you verify manually. What about something like this:
abstract class QuotationExtractor {
protected static $instances;
public static function getAllPossibleQuotations($string) {
$possibleQuotations = array();
foreach (self::$instances as $instance) {
$possibleQuotations = array_merge(
return $possibleQuotations;
public function __construct() {
self::$instances[] = $this;
public abstract function extractQuotations($string);
class RegexExtractor extends QuotationExtractor {
protected $rules;
public function extractQuotations($string) {
$quotes = array();
foreach ($this->rules as $rule) {
preg_match_all($rule[0], $string, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$quotes[] = array(
'quote' => trim($match[$rule[1]]),
'cited' => trim($match[$rule[2]])
return $quotes;
public function addRule($regex, $quoteIndex, $authorIndex) {
$this->rules[] = array($regex, $quoteIndex, $authorIndex);
$regexExtractor = new RegexExtractor();
$regexExtractor->addRule('/"(.*?)[,.]?\h*"\h*said\h*(.*?)\./', 1, 2);
$regexExtractor->addRule('/"(.*?)\h*"(.*)said/', 1, 2);
$regexExtractor->addRule('/\.\h*(.*)(once)?\h*said[\-]*"(.*?)"/', 3, 1);
class AnotherExtractor extends Quot...
If you have a structure like the above you can run the same text through any/all of them and list the possible quotations to select the correct ones. I've run the code with this thread as input for testing and the result was:
array(4) {
array(2) {
string(15) "Not necessarily"
string(8) "ceejayoz"
array(2) {
string(28) "They think it's `game over,'"
string(34) "one senior administration official"
array(2) {
string(46) "I think it is serious and it is deteriorating,"
string(14) "Admiral Mullen"
array(2) {
string(16) "Not necessarily,"
string(0) ""