TFIDF 计算的学习 
✨文章摘要(AI生成)
笔者在这篇博客中详细介绍了 TF-IDF(Term Frequency-Inverse Document Frequency)的计算过程,首先通过转码函数确保文本文件的编码为 UTF-8,并读取文件列表。接着,使用正则表达式对文本进行分词,从而建立词典并计算每个词的词频(TF)。随后,笔者构建了 TF 矩阵,并逐步计算每个词的逆文档频率(IDF),最终合并 TF 和 IDF 值以得到 TF-IDF 值。
此外,笔者也展示了使用 Sklearn 库来简化 TF-IDF 的计算过程,并介绍了如何计算文档之间的余弦相似度,以评估它们的相似性。整个过程通过代码示例和数据框展示,使得读者能够直观理解 TF-IDF 的实现细节及其应用。
转码 
定义转码函数 
python
# ! pip install codecs
# ! pip install chardet
import codecs
import chardet
def convert(filename, out_enc="UTF-8"):
    content = codecs.open(filename, 'rb').read()
    source_encoding = chardet.detect(content)['encoding']
    content = content.decode(source_encoding).encode(out_enc)
    codecs.open(filename, 'wb').write(content)
# 获取编码
def get_encoding(file):
	with open(file,'rb') as f:
		return chardet.detect(f.read())['encoding']ERROR: Could not find a version that satisfies the requirement codecs
ERROR: No matching distribution found for codecs
Requirement already satisfied: chardet in c:\users\justin3go\appdata\roaming\python\python38\site-packages (3.0.4)
读入文件并转码 
python
import chardet
import codecs
import os
# 读取文件
file_list = []
for root, _, files in os.walk("./实验六所用语料库"):
    for file in files:
        # print(os.path.join(root, file))
        file_list.append(os.path.join(root, file))
for file in file_list:
    convert(file)
get_encoding(file_list[0])'ascii'
生成词典 
python
import re
import pandas as pd
import numpy as np
# 分词建立词典,得到词频
dict_words = {}
files = []
files_ = []
for file in file_list:
	with open(file, 'r', encoding='ascii') as f:
		text = f.read().lower()
		files_.append(text)
	
	text_ = re.findall('[a-z]+', text)
	files.append(text_)
	for t in text_:
		dict_words[t] = dict_words.get(t, 0) + 1生成 TF 矩阵 
python
import numpy as np
words2index = {w: i for i,w in enumerate(dict_words)}
index2words = {i: w for i,w in enumerate(dict_words)}
zeros_m = np.zeros((len(files),len(words2index)))
for i, f in enumerate(files):
	for t in f:
		# print(t)
		# print(words.index(f))
		zeros_m[i][words2index[t]] += 1
# tf 在个文档中的矩阵
zeros_marray([[1., 5., 5., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 4., 0., ..., 0., 0., 0.],
       [1., 5., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 1., 1., 1.]])
逐步计算 IDF 值 
python
df1 = pd.DataFrame(dict_words,index=['TF']).T
df1.head().dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
| TF | |
|---|---|
| call | 2 | 
| for | 20 | 
| presentations | 5 | 
| navy | 9 | 
| scientific | 6 | 
python
# print(dict_words)
dict_words_idf = {}
for key in dict_words:
	count = 0
	# files 要上面那个单元运行之后存入内存才有
	for text_ in files:
		if key in text_:
			count += 1
	dict_words_idf[key] = count
df2 = pd.DataFrame(dict_words_idf,index=['DF']).T
df = pd.concat([df1,df2], axis=1)
df.head(10).dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
| TF | DF | |
|---|---|---|
| call | 2 | 2 | 
| for | 20 | 8 | 
| presentations | 5 | 1 | 
| navy | 9 | 1 | 
| scientific | 6 | 2 | 
| visualization | 9 | 4 | 
| and | 50 | 9 | 
| virtual | 5 | 1 | 
| reality | 5 | 1 | 
| seminar | 5 | 1 | 
python
import math
# log(len(files)/df,2)
df['IDF'] = df['DF'].apply(lambda x: math.log(len(files)/x,2))
df.head(10).dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}
| TF | DF | IDF | |
|---|---|---|---|
| call | 2 | 2 | 2.321928 | 
| for | 20 | 8 | 0.321928 | 
| presentations | 5 | 1 | 3.321928 | 
| navy | 9 | 1 | 3.321928 | 
| scientific | 6 | 2 | 2.321928 | 
| visualization | 9 | 4 | 1.321928 | 
| and | 50 | 9 | 0.152003 | 
| virtual | 5 | 1 | 3.321928 | 
| reality | 5 | 1 | 3.321928 | 
| seminar | 5 | 1 | 3.321928 | 
python
idf = list(df['IDF'])
result = zeros_m*idf
resultarray([[ 2.32192809,  1.60964047, 16.60964047, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.32192809,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  1.28771238,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 2.32192809,  1.60964047,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.32192809,  0.        , ...,  3.32192809,
         3.32192809,  3.32192809]])
使用 SKlearn 计算 TFIDF 值 
python
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(files_))
word = vectorizer.get_feature_names()
print(word[40:50])
weight = tfidf.toarray().T
print(weight)['accepted', 'accessible', 'across', 'add', 'address', 'addresses', 'adresses', 'advance', 'advises', 'affiliated']
[[0.         0.11537929 0.         ... 0.         0.         0.        ]
 [0.03906779 0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.15731715 ... 0.04130626 0.09597341 0.05024117]
 [0.         0.         0.         ... 0.         0.11918574 0.        ]]
计算余弦相似度 
python
from sklearn.metrics.pairwise import cosine_similarity
test = weight[0]  # 假设其他的一篇文档就是第一篇文档
cos_sim = []
for i in range(len(weight)):
	cos_sim.append(cosine_similarity([list(test),list(weight[i])]))
print(cos_sim[0]) #第一行的值是 a1 中的第一个行向量与 a2 中所有的行向量之间的余弦相似度
print(cos_sim[5])[[1. 1.]
 [1. 1.]]
[[1. 0.]
 [0. 1.]]