第7章: 単語ベクトル
今回は2020年版の言語処理100本ノック 第7章を解いていきます
60. 単語ベクトルの読み込みと表示
Google Newsデータセット(約1,000億単語)での学習済み単語ベクトル(300万単語・フレーズ,300次元)をダウンロードし,”United States”の単語ベクトルを表示せよ.ただし,”United States”は内部的には”United_States”と表現されていることに注意せよ
1 2 3 |
!pip install --upgrade gensim |
1 2 3 4 5 6 |
from gensim.models import KeyedVectors filepath = "./GoogleNews-vectors-negative300.bin" wv_from_bin = KeyedVectors.load_word2vec_format(filepath, binary=True) |
1 2 3 4 |
print(wv_from_bin['United_States']) print(len(wv_from_bin['United_States'])) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
[-3.61328125e-02 -4.83398438e-02 2.35351562e-01 1.74804688e-01 -1.46484375e-01 -7.42187500e-02 -1.01562500e-01 -7.71484375e-02 1.09375000e-01 -5.71289062e-02 -1.48437500e-01 -6.00585938e-02 1.74804688e-01 -7.71484375e-02 2.58789062e-02 -7.66601562e-02 -3.80859375e-02 1.35742188e-01 3.75976562e-02 -4.19921875e-02 -3.56445312e-02 5.34667969e-02 3.68118286e-04 -1.66992188e-01 -1.17187500e-01 1.41601562e-01 -1.69921875e-01 -6.49414062e-02 -1.66992188e-01 1.00585938e-01 1.15722656e-01 -2.18750000e-01 -9.86328125e-02 -2.56347656e-02 1.23046875e-01 -3.54003906e-02 -1.58203125e-01 -1.60156250e-01 2.94189453e-02 8.15429688e-02 6.88476562e-02 1.87500000e-01 6.49414062e-02 1.15234375e-01 -2.27050781e-02 3.32031250e-01 -3.27148438e-02 1.77734375e-01 -2.08007812e-01 4.54101562e-02 -1.23901367e-02 1.19628906e-01 7.44628906e-03 -9.03320312e-03 1.14257812e-01 1.69921875e-01 -2.38281250e-01 -2.79541016e-02 -1.21093750e-01 2.47802734e-02 7.71484375e-02 -2.81982422e-02 -4.71191406e-02 1.78222656e-02 -1.23046875e-01 -5.32226562e-02 2.68554688e-02 -3.11279297e-02 -5.59082031e-02 -5.00488281e-02 -3.73535156e-02 1.25976562e-01 5.61523438e-02 1.51367188e-01 4.29687500e-02 -2.08007812e-01 -4.78515625e-02 2.78320312e-02 1.81640625e-01 2.20703125e-01 -3.61328125e-02 -8.39843750e-02 -3.69548798e-05 -9.52148438e-02 -1.25000000e-01 -1.95312500e-01 -1.50390625e-01 -4.15039062e-02 1.31835938e-01 1.17675781e-01 1.91650391e-02 5.51757812e-02 -9.42382812e-02 -1.08886719e-01 7.32421875e-02 -1.15234375e-01 8.93554688e-02 -1.40625000e-01 1.45507812e-01 4.49218750e-02 -1.10473633e-02 -1.62353516e-02 4.05883789e-03 3.75976562e-02 -6.98242188e-02 -5.46875000e-02 2.17285156e-02 -9.47265625e-02 4.24804688e-02 1.81884766e-02 -1.73339844e-02 4.63867188e-02 -1.42578125e-01 1.99218750e-01 1.10839844e-01 2.58789062e-02 -7.08007812e-02 -5.54199219e-02 3.45703125e-01 1.61132812e-01 -2.44140625e-01 -2.59765625e-01 -9.71679688e-02 8.00781250e-02 -8.78906250e-02 -7.22656250e-02 1.42578125e-01 -8.54492188e-02 -3.18359375e-01 8.30078125e-02 6.34765625e-02 1.64062500e-01 -1.92382812e-01 -1.17675781e-01 -5.41992188e-02 -1.56250000e-01 -1.21582031e-01 -4.95605469e-02 1.20117188e-01 -3.83300781e-02 5.51757812e-02 -8.97216797e-03 4.32128906e-02 6.93359375e-02 8.93554688e-02 2.53906250e-01 1.65039062e-01 1.64062500e-01 -1.41601562e-01 4.58984375e-02 1.97265625e-01 -8.98437500e-02 3.90625000e-02 -1.51367188e-01 -8.60595703e-03 -1.17675781e-01 -1.97265625e-01 -1.12792969e-01 1.29882812e-01 1.96289062e-01 1.56402588e-03 3.93066406e-02 2.17773438e-01 -1.43554688e-01 6.03027344e-02 -1.35742188e-01 1.16210938e-01 -1.59912109e-02 2.79296875e-01 1.46484375e-01 -1.19628906e-01 1.76757812e-01 1.28906250e-01 -1.49414062e-01 6.93359375e-02 -1.72851562e-01 9.22851562e-02 1.33056641e-02 -2.00195312e-01 -9.76562500e-02 -1.65039062e-01 -2.46093750e-01 -2.35595703e-02 -2.11914062e-01 1.84570312e-01 -1.85546875e-02 2.16796875e-01 5.05371094e-02 2.02636719e-02 4.25781250e-01 1.28906250e-01 -2.77099609e-02 1.29882812e-01 -1.15722656e-01 -2.05078125e-02 1.49414062e-01 7.81250000e-03 -2.05078125e-01 -8.05664062e-02 -2.67578125e-01 -2.29492188e-02 -8.20312500e-02 8.64257812e-02 7.61718750e-02 -3.66210938e-02 5.22460938e-02 -1.22070312e-01 -1.44042969e-02 -2.69531250e-01 8.44726562e-02 -2.52685547e-02 -2.96630859e-02 -1.68945312e-01 1.93359375e-01 -1.08398438e-01 1.94091797e-02 -1.80664062e-01 1.93359375e-01 -7.08007812e-02 5.85937500e-02 -1.01562500e-01 -1.31835938e-01 7.51953125e-02 -7.66601562e-02 3.37219238e-03 -8.59375000e-02 1.25000000e-01 2.92968750e-02 1.70898438e-01 -9.37500000e-02 -1.09375000e-01 -2.50244141e-02 2.11914062e-01 -4.44335938e-02 6.12792969e-02 2.62451172e-02 -1.77734375e-01 1.23046875e-01 -7.42187500e-02 -1.67968750e-01 -1.08886719e-01 -9.04083252e-04 -7.37304688e-02 5.49316406e-02 6.03027344e-02 8.39843750e-02 9.17968750e-02 -1.32812500e-01 1.22070312e-01 -8.78906250e-03 1.19140625e-01 -1.94335938e-01 -6.64062500e-02 -2.07031250e-01 7.37304688e-02 8.93554688e-02 1.81884766e-02 -1.20605469e-01 -2.61230469e-02 2.67333984e-02 7.76367188e-02 -8.30078125e-02 6.78710938e-02 -3.54003906e-02 3.10546875e-01 -2.42919922e-02 -1.41601562e-01 -2.08007812e-01 -4.57763672e-03 -6.54296875e-02 -4.95605469e-02 2.22656250e-01 1.53320312e-01 -1.38671875e-01 -5.24902344e-02 4.24804688e-02 -2.38281250e-01 1.56250000e-01 5.83648682e-04 -1.20605469e-01 -9.22851562e-02 -4.44335938e-02 3.61328125e-02 -1.86767578e-02 -8.25195312e-02 -8.25195312e-02 -4.05273438e-02 1.19018555e-02 1.69921875e-01 -2.80761719e-02 3.03649902e-03 9.32617188e-02 -8.49609375e-02 1.57470703e-02 7.03125000e-02 1.62353516e-02 -2.27050781e-02 3.51562500e-02 2.47070312e-01 -2.67333984e-02] 300 |
61. 単語の類似度
“United States”と”U.S.”のコサイン類似度を計算せよ
1 2 3 |
print(wv_from_bin.similarity('United_States', 'U.S.')) |
63. 加法構成性によるアナロジー
1 2 3 |
print(wv_from_bin.most_similar(positive=['Spain', 'Athens'], negative=['Madrid'], topn=10)) |
1 2 3 |
[('Greece', 0.6898480653762817), ('Aristeidis_Grigoriadis', 0.5606849193572998), ('Ioannis_Drymonakos', 0.555290937423706), ('Greeks', 0.5450686812400818), ('Ioannis_Christou', 0.5400862693786621), ('Hrysopiyi_Devetzi', 0.5248445272445679), ('Heraklio', 0.5207759141921997), ('Athens_Greece', 0.516880989074707), ('Lithuania', 0.5166865587234497), ('Iraklion', 0.5146791338920593)] |
64. アナロジーデータでの実験
単語アナロジーの評価データをダウンロードし,vec(2列目の単語) – vec(1列目の単語) + vec(3列目の単語)を計算し,そのベクトルと類似度が最も高い単語と,その類似度を求めよ.求めた単語と類似度は,各事例の末尾に追記せよ.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import re datas = [] stat = '0' with open('questions-words.txt', 'r') as f1, open('64.txt', 'w') as f2: for line in tqdm(f1): if not re.match(r'^:', line): vecs = line.replace('\n', '').split(' ') result = wv_from_bin.most_similar(positive=[vecs[1], vecs[2]], negative=[vecs[0]], topn=1) vecs.insert(0, stat) vecs.append(result[0][0]) vecs.append(str(result[0][1])) datas.append(vecs) #print('\r{} {}'.format(result[0][0], stat), end='') f2.write(line.replace('\n', '')+' '+result[0][0]+'\n') else: stat = line.replace('\n', '').replace(': ', '') f2.write(line) |
1 2 3 |
19558it [1:19:32, 4.10it/s] |
65. アナロジータスクでの正解率
64の実行結果を用い,意味的アナロジー(semantic analogy)と文法的アナロジー(syntactic analogy)の正解率を測定せよ.
1 2 3 4 5 6 7 8 9 10 |
import re import numpy as np sem = [d for d in datas if not re.match(r'^gram.*', d[0])] syn= [d for d in datas if re.match(r'^gram.*', d[0])] print("semantic analogy: {}".format(np.mean([e[4]==e[5] for e in sem]))) print("syntantic analogy: {}".format(np.mean([e[4]==e[5] for e in syn]))) |
1 2 3 4 |
semantic analogy: 0.7308602999210734 syntantic analogy: 0.7400468384074942 |
66. WordSimilarity-353での評価
The WordSimilarity-353 Test Collectionの評価データをダウンロードし,単語ベクトルにより計算される類似度のランキングと,人間の類似度判定のランキングの間のスピアマン相関係数を計算せよ.
1 2 3 4 5 6 7 8 9 |
with open('combined.csv', 'r') as f: data = [] for line in f: if line=='Word 1,Word 2,Human (mean)\n': continue d = line.replace('\n', '').split(',') d.append(wv_from_bin.similarity(d[0], d[1])) data.append(d) |
1 2 3 4 5 6 7 8 |
import pandas as pd pd.DataFrame( data[:min(15, len(data))], columns = ['Word1', 'Word2', 'Human(mean)', 'Similarity(Vec)'] ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from scipy.stats import rankdata def spearman(lst1, lst2): lst1 = np.array(lst1) lst2 = np.array(lst2) N = len(lst1) return 1 - (6*sum((lst1-lst2)**2) / (N**3-N)) print(spearman( rankdata([float(d[2]) for d in data]), rankdata([float(d[3]) for d in data]) )) |
1 2 3 |
0.7000217838950313 |
67. k-meansクラスタリング
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import re data = [] stat = '0' with open('64.txt', 'r') as f: for line in f: if not re.match(r'^:', line): vecs = line.replace('\n', '').split(' ') vecs.insert(0, stat) data.append(vecs) else: stat = line.replace('\n', '').replace(': ', '') |
1 2 3 4 5 6 7 8 9 |
countries = { c for d in data for c in [d[2], d[4]] if d[0] in ['capital-common-countries', 'capital-world'] } countries = list(countries) |
1 2 3 4 5 6 7 8 9 10 11 |
from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=5) kmeans.fit([wv_from_bin[c] for c in countries]) for i in range(5): cluster = np.where(kmeans.labels_ == i)[0] print('クラス', i) print(', '.join([countries[k] for k in cluster])) |
1 2 3 4 5 6 7 8 9 10 11 12 |
クラス 0 Libya, Pakistan, China, Qatar, Egypt, Bahrain, Iraq, Nepal, Bangladesh, Oman, Bhutan, Jordan, Morocco, Japan, Syria, Lebanon, Afghanistan, Greenland, Iran クラス 1 Indonesia, Taiwan, Belize, Bahamas, Honduras, Ecuador, Vietnam, Guyana, Tuvalu, Cuba, Thailand, Nicaragua, Venezuela, Chile, Jamaica, Uruguay, Dominica, Philippines, Suriname, Samoa, Fiji, Laos, Peru クラス 2 Botswana, Guinea, Zambia, Nigeria, Kenya, Somalia, Zimbabwe, Senegal, Rwanda, Mozambique, Algeria, Burundi, Sudan, Uganda, Mali, Gambia, Gabon, Tunisia, Eritrea, Madagascar, Angola, Ghana, Namibia, Liberia, Malawi, Mauritania, Niger クラス 3 Malta, Liechtenstein, Italy, Spain, Norway, Denmark, Cyprus, Greece, Switzerland, Austria, Canada, Finland, Ireland, Australia, England, Belgium, France, Germany, Sweden, Portugal クラス 4 Russia, Uzbekistan, Slovenia, Kyrgyzstan, Ukraine, Croatia, Serbia, Georgia, Azerbaijan, Albania, Macedonia, Estonia, Latvia, Turkey, Romania, Lithuania, Turkmenistan, Hungary, Belarus, Tajikistan, Poland, Moldova, Kazakhstan, Slovakia, Armenia, Montenegro, Bulgaria |
1 2 3 4 5 6 7 8 9 10 11 |
from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=7) kmeans.fit([wv_from_bin[c] for c in countries]) for i in range(7): cluster = np.where(kmeans.labels_ == i)[0] print('クラス', i) print(', '.join([countries[k] for k in cluster])) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
クラス 0 Indonesia, Taiwan, Pakistan, China, Qatar, Bahrain, Nepal, Tuvalu, Thailand, Bangladesh, Oman, Bhutan, Japan, Philippines, Samoa, Australia, Fiji, Laos クラス 1 Malta, Italy, Spain, Norway, Denmark, Cyprus, Greece, Switzerland, Canada, Morocco, Finland, Ireland, England, Belgium, Greenland, France, Germany, Sweden, Portugal クラス 2 Botswana, Guinea, Zambia, Nigeria, Kenya, Zimbabwe, Senegal, Rwanda, Mozambique, Algeria, Burundi, Sudan, Uganda, Mali, Gambia, Gabon, Tunisia, Eritrea, Madagascar, Angola, Ghana, Namibia, Liberia, Malawi, Mauritania, Niger クラス 3 Slovenia, Liechtenstein, Croatia, Serbia, Albania, Macedonia, Estonia, Latvia, Romania, Lithuania, Hungary, Austria, Poland, Slovakia, Montenegro, Bulgaria クラス 4 Belize, Bahamas, Honduras, Ecuador, Guyana, Cuba, Nicaragua, Venezuela, Chile, Jamaica, Uruguay, Dominica, Suriname, Peru クラス 5 Russia, Uzbekistan, Kyrgyzstan, Ukraine, Georgia, Azerbaijan, Turkey, Turkmenistan, Belarus, Tajikistan, Moldova, Kazakhstan, Armenia クラス 6 Libya, Somalia, Egypt, Vietnam, Iraq, Jordan, Syria, Lebanon, Afghanistan, Iran |
68. Ward法によるクラスタリング
1 2 3 4 5 6 7 8 9 10 11 |
from scipy.cluster.hierarchy import dendrogram, linkage import pandas as pd import matplotlib.pyplot as plt plt.figure(figsize=(20, 5), dpi=200) plt.title("Dedrogram") Z = linkage([wv_from_bin[c] for c in countries], method='ward') dendrogram(Z, labels = countries) plt.show() |
69. t-SNEによる可視化
1 2 3 4 5 6 7 8 9 10 11 12 |
from sklearn.manifold import TSNE tsne = TSNE() tsne.fit([wv_from_bin[c] for c in countries]) cmap = plt.get_cmap('Set1') plt.figure(figsize=(15, 15), dpi=300) plt.scatter(tsne.embedding_[:, 0], tsne.embedding_[:, 1]) for i, ((x, y), name) in enumerate(zip(tsne.embedding_, countries)): plt.annotate(name, (x, y), color=cmap(kmeans.labels_[i])) plt.show() |