言語処理100本ノック #054

54. 品詞タグ付け

Stanford Core NLPの解析結果XMLを読み込み，単語，レンマ，品詞をタブ区切り形式で出力せよ．

解答

「レンマ (lemma)」は基本形。例えば、"is" のレンマは "be" で、"an" のレンマは "a"。日本語の言語処理 #040 で base としていたものと同じようなものだと思う。

今回、 #040 と同様にクラスを定義した。クラス Token は、単語(word)、レンマ(lemma)、品詞(pos) をメンバ関数に持つ。

# knock_054.py
import xml.etree.ElementTree as ET

xmlfile = 'nlp.txt.xml'

class Token:
    def __init__(self, token):
        self.word = token.find('word').text
        self.lemma = token.find('lemma').text
        self.pos = token.find('POS').text

    def __str__(self):
        return '%s\t%s\t%s' % (self.word, self.lemma, self.pos)


if __name__ == '__main__':
    tree = ET.parse(xmlfile)
    root = tree.getroot()

    for token in root.findall('./document/sentences/sentence/tokens/token'):
        t = Token(token)
        print(t)

出力

# python knock_054.py >> knock_054.txt (20行目まで)
Natural natural JJ
language    language    NN
processing  processing  NN
From    from    IN
Wikipedia   Wikipedia   NNP
,   ,   ,
the the DT
free    free    JJ
encyclopedia    encyclopedia    NN
Natural natural JJ
language    language    NN
processing  processing  NN
-LRB-   -lrb-   -LRB-
NLP nlp NN
-RRB-   -rrb-   -RRB-
is  be  VBZ
a   a   DT
field   field   NN
of  of  IN
computer    computer    NN

言語処理100本ノック #055

55. 固有表現抽出

入力文中の人名をすべて抜き出せ．

解答

先ほどの #054 で書いたクラス Token に、メンバ変数 ner を追加した。

NER は Named Entity Recognition の略で、固有名詞や数量のタグが付く。固有名詞や数量以外の場合は、O が付く。

For English, by default, this annotator recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities (12 classes).

NER に PERSON が付いているものを表示する。

# knock_054.py
import xml.etree.ElementTree as ET

xmlfile = 'nlp.txt.xml'

class Token:
    def __init__(self, token):
        self.word = token.find('word').text
        self.lemma = token.find('lemma').text
        self.pos = token.find('POS').text
        self.ner = token.find('NER').text

    def __str__(self):
        return '%s\t%s\t%s' % (self.word, self.lemma, self.pos)


if __name__ == '__main__':
    tree = ET.parse(xmlfile)
    root = tree.getroot()

    for token in root.findall('./document/sentences/sentence/tokens/token'):
        t = Token(token)
        if t.ner == 'PERSON':
            print(t.word)

出力

Alan
Turing
Joseph
Weizenbaum
MARGIE
Schank
Wilensky
Meehan
Lehnert
Carbonell
Lehnert
Racter
Jabberwacky
Moore

言語処理100本ノック #056

56. 共参照解析

Stanford Core NLPの共参照解析の結果に基づき，文中の参照表現（mention）を代表参照表現（representative mention）に置換せよ．ただし，置換するときは，「代表参照表現（参照表現）」のように，元の参照表現が分かるように配慮せよ．

共参照解析って何やろ。言語情報処理用語集によれば、

共参照 (co-reference) 同一指示。二つ以上の名詞句が同一の指示物を指すこと。同一指標付与(coindexing)により示される。

XMLファイルの中には、<sentences> ~~~ </sentences> の後に <coreference> ~~~ </coreference> で囲われた共参照解析の結果がある。

CoreNLP のサンプル "input.txt" と解析結果XMLファイル ”input.txt.xml”を見てみる。

XMLファイルの<coreference> ~~~ </coreference> で囲われた部分：

      <coreference>
        <mention representative="true">
          <sentence>1</sentence>
          <start>1</start>
          <end>3</end>
          <head>2</head>
          <text>Stanford University</text>
        </mention>
        <mention>
          <sentence>2</sentence>
          <start>1</start>
          <end>2</end>
          <head>1</head>
          <text>It</text>
        </mention>
      </coreference>

"input.txt" の文（代表参照表現と参照表現を赤字で示す）：

Stanford University is located in California. It is a great university, founded in 1891.

"It" は、"Stanford University" と同一の物を指している。representative="true" となっているのが代表参照表現なので、この場合は、It を ~~It (Stanford University)~~ Stanford University (It) で置き換えればよいことになる。

解答は次回

実装中につき。

アイソモカ

isomocha: 知の遊牧民の開発記録

開発記録 200127 Mon (100本ノック #054, #055, #056途中)

言語処理100本ノック #054

解答

出力

言語処理100本ノック #055

解答

出力

言語処理100本ノック #056

解答は次回