TFIDF 计算的学习
✨文章摘要(AI生成)
笔者在这篇博客中详细介绍了 TF-IDF(Term Frequency-Inverse Document Frequency)的计算过程,首先通过转码函数确保文本文件的编码为 UTF-8,并读取文件列表。接着,使用正则表达式对文本进行分词,从而建立词典并计算每个词的词频(TF)。随后,笔者构建了 TF 矩阵,并逐步计算每个词的逆文档频率(IDF),最终合并 TF 和 IDF 值以得到 TF-IDF 值。
此外,笔者也展示了使用 Sklearn 库来简化 TF-IDF 的计算过程,并介绍了如何计算文档之间的余弦相似度,以评估它们的相似性。整个过程通过代码示例和数据框展示,使得读者能够直观理解 TF-IDF 的实现细节及其应用。
转码
定义转码函数
python
# ! pip install codecs
# ! pip install chardet
import codecs
import chardet
def convert(filename, out_enc="UTF-8"):
content = codecs.open(filename, 'rb').read()
source_encoding = chardet.detect(content)['encoding']
content = content.decode(source_encoding).encode(out_enc)
codecs.open(filename, 'wb').write(content)
# 获取编码
def get_encoding(file):
with open(file,'rb') as f:
return chardet.detect(f.read())['encoding']
ERROR: Could not find a version that satisfies the requirement codecs
ERROR: No matching distribution found for codecs
Requirement already satisfied: chardet in c:\users\justin3go\appdata\roaming\python\python38\site-packages (3.0.4)
读入文件并转码
python
import chardet
import codecs
import os
# 读取文件
file_list = []
for root, _, files in os.walk("./实验六所用语料库"):
for file in files:
# print(os.path.join(root, file))
file_list.append(os.path.join(root, file))
for file in file_list:
convert(file)
get_encoding(file_list[0])
'ascii'
生成词典
python
import re
import pandas as pd
import numpy as np
# 分词建立词典,得到词频
dict_words = {}
files = []
files_ = []
for file in file_list:
with open(file, 'r', encoding='ascii') as f:
text = f.read().lower()
files_.append(text)
text_ = re.findall('[a-z]+', text)
files.append(text_)
for t in text_:
dict_words[t] = dict_words.get(t, 0) + 1
生成 TF 矩阵
python
import numpy as np
words2index = {w: i for i,w in enumerate(dict_words)}
index2words = {i: w for i,w in enumerate(dict_words)}
zeros_m = np.zeros((len(files),len(words2index)))
for i, f in enumerate(files):
for t in f:
# print(t)
# print(words.index(f))
zeros_m[i][words2index[t]] += 1
# tf 在个文档中的矩阵
zeros_m
array([[1., 5., 5., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 1., 0., ..., 0., 0., 0.],
...,
[0., 4., 0., ..., 0., 0., 0.],
[1., 5., 0., ..., 0., 0., 0.],
[0., 1., 0., ..., 1., 1., 1.]])
逐步计算 IDF 值
python
df1 = pd.DataFrame(dict_words,index=['TF']).T
df1.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
TF | |
---|---|
call | 2 |
for | 20 |
presentations | 5 |
navy | 9 |
scientific | 6 |
python
# print(dict_words)
dict_words_idf = {}
for key in dict_words:
count = 0
# files 要上面那个单元运行之后存入内存才有
for text_ in files:
if key in text_:
count += 1
dict_words_idf[key] = count
df2 = pd.DataFrame(dict_words_idf,index=['DF']).T
df = pd.concat([df1,df2], axis=1)
df.head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
TF | DF | |
---|---|---|
call | 2 | 2 |
for | 20 | 8 |
presentations | 5 | 1 |
navy | 9 | 1 |
scientific | 6 | 2 |
visualization | 9 | 4 |
and | 50 | 9 |
virtual | 5 | 1 |
reality | 5 | 1 |
seminar | 5 | 1 |
python
import math
# log(len(files)/df,2)
df['IDF'] = df['DF'].apply(lambda x: math.log(len(files)/x,2))
df.head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
TF | DF | IDF | |
---|---|---|---|
call | 2 | 2 | 2.321928 |
for | 20 | 8 | 0.321928 |
presentations | 5 | 1 | 3.321928 |
navy | 9 | 1 | 3.321928 |
scientific | 6 | 2 | 2.321928 |
visualization | 9 | 4 | 1.321928 |
and | 50 | 9 | 0.152003 |
virtual | 5 | 1 | 3.321928 |
reality | 5 | 1 | 3.321928 |
seminar | 5 | 1 | 3.321928 |
python
idf = list(df['IDF'])
result = zeros_m*idf
result
array([[ 2.32192809, 1.60964047, 16.60964047, ..., 0. ,
0. , 0. ],
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ],
[ 0. , 0.32192809, 0. , ..., 0. ,
0. , 0. ],
...,
[ 0. , 1.28771238, 0. , ..., 0. ,
0. , 0. ],
[ 2.32192809, 1.60964047, 0. , ..., 0. ,
0. , 0. ],
[ 0. , 0.32192809, 0. , ..., 3.32192809,
3.32192809, 3.32192809]])
使用 SKlearn 计算 TFIDF 值
python
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(files_))
word = vectorizer.get_feature_names()
print(word[40:50])
weight = tfidf.toarray().T
print(weight)
['accepted', 'accessible', 'across', 'add', 'address', 'addresses', 'adresses', 'advance', 'advises', 'affiliated']
[[0. 0.11537929 0. ... 0. 0. 0. ]
[0.03906779 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
...
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0.15731715 ... 0.04130626 0.09597341 0.05024117]
[0. 0. 0. ... 0. 0.11918574 0. ]]
计算余弦相似度
python
from sklearn.metrics.pairwise import cosine_similarity
test = weight[0] # 假设其他的一篇文档就是第一篇文档
cos_sim = []
for i in range(len(weight)):
cos_sim.append(cosine_similarity([list(test),list(weight[i])]))
print(cos_sim[0]) #第一行的值是 a1 中的第一个行向量与 a2 中所有的行向量之间的余弦相似度
print(cos_sim[5])
[[1. 1.]
[1. 1.]]
[[1. 0.]
[0. 1.]]