

開発記録 200219 Wed (100本ノック #070, 機械学習データの整形)

第7章 後半はちょっと後回しにすることにしまして、第8章。今回は機械学習に使うデータの整形をしました。

言語処理100本ノック #070

言語処理100本ノック 2015

第8章: 機械学習

本章では,Bo Pang氏とLillian Lee氏が公開している Movie Review Datasentence polarity dataset v1.0 を用い,文を肯定的(ポジティブ)もしくは否定的(ネガティブ)に分類するタスク(極性分析)に取り組む.


70. データの入手・整形

文に関する極性分析の正解データ を用い,以下の要領で正解データ(sentiment.txt)を作成せよ. 1. rt-polarity.posの各行の先頭に”+1 “という文字列を追加する(極性ラベル”+1”とスペースに続けて肯定的な文の内容が続く) 2. rt-polarity.negの各行の先頭に”-1 “という文字列を追加する(極性ラベル”-1”とスペースに続けて否定的な文の内容が続く) 3. 上述1と2の内容を結合(concatenate)し,行をランダムに並び替える



調べたことメモ:Unicode じゃない文字が混ざっている

pythonutf-8 で読もうとしたらエラーが出た。

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xf3 in position 4645: invalid continuation byte

というのは、どうやら Unicode じゃない文字が混ざっているためっぽい。

# rt-polaritydata/rt-polaritydata/rt-polarity.pos 
# 44行目
'compleja e intelectualmente retadora , el ladr�n de orqu�deas es uno de esos filmes que vale la pena ver precisamente por su originalidad . '


Unicode じゃない文字を読み飛ばすために、codecs をインポートして使う。 参考 👉 Python 3でファイル読み込み時のUnicodeDecodeErrorを回避する - kumilog.net


参考 👉 Pythonでリストの要素をランダムソート(シャッフル) | note.nkmk.me

import random


# knock_070.py
import codecs
import random

POSFILE = ‘./rt-polaritydata/rt-polaritydata/rt-polarity.pos’
NEGFILE = ‘./rt-polaritydata/rt-polaritydata/rt-polarity.neg’
OUTFILE = ‘./sentiment.txt’

def add_label(file, chars):
    with codecs.open(file, ‘r’, ‘utf-8’, ‘ignore’) as fr:
        out_list = []
        for line in fr:
            out_list.append(chars+’ ‘+line)
    return out_list

def get_sentiment_list():
    concatenated = add_label(POSFILE, ‘+1’) + add_label(NEGFILE, ‘-1’)
    return concatenated

def count_posneg(file):
    pos = 0
    neg = 0
    with open(file, ‘r’, encoding=‘utf-8’) as fr:
        for line in fr:
            if line.split(‘ ‘)[0] == ‘+1’:
                pos +=1
                neg +=1
    return (pos, neg)

if __name__ == ‘__main__’:
    sentiment_list = get_sentiment_list()
    with open(OUTFILE, ‘w’) as fw:

    posneg = count_posneg(OUTFILE)
    print(‘pos count =‘, posneg[0])
    print(‘neg count =‘, posneg[1])


$ python knock_070.py
pos count = 5331
neg count = 5331
#sentiment.txt 20行目まで
-1 it shares the first two films' loose-jointed structure , but laugh-out-loud bits are few and far between . 
-1 shot perhaps 'artistically' with handheld cameras and apparently no movie lights by joaquin baca-asay , the low-budget production swings annoyingly between vertigo and opacity . 
-1 the result is an 'action film' mired in stasis . 
-1 the characters are based on stock clichs , and the attempt to complicate the story only defies credibility . 
+1 has enough gun battles and throwaway humor to cover up the yawning chasm where the plot should be . 
-1 the beautiful , unusual music is this film's chief draw , but its dreaminess may lull you to sleep . 
+1 a good thriller . 
-1 it's a bad sign in a thriller when you instantly know whodunit . 
+1 the sundance film festival has become so buzz-obsessed that fans and producers descend upon utah each january to ferret out the next great thing . 'tadpole' was one of the films so declared this year , but it's really more of the next pretty good thing . 
-1 woody , what happened ? 
-1 does anyone much think the central story of brendan behan is that he was a bisexual sweetheart before he took to drink ? 
-1 get out your pooper-scoopers . 
-1 fails to bring as much to the table . 
+1 a distant , even sterile , yet compulsively watchable look at the sordid life of hogan's heroes star bob crane . 
+1 macdowell . . . gives give a solid , anguished performance that eclipses nearly everything else she's ever done . 
+1 an enchanting spectacular for potter fans anxious to ride the hogwarts express toward a new year of magic and mischief . 
-1 more successful at relating history than in creating an emotionally complex , dramatically satisfying heroine
+1 a very capable nailbiter . 
+1  . . . a story we haven't seen on the big screen before , and it's a story that we as americans , and human beings , should know . 
-1 i can take infantile humor . . . but this is the sort of infantile that makes you wonder about changing the director and writer's diapers .