论文查重python文本相似性计算simhash源码论文查重python文本相似性计算

场景： 【论文查重python文本相似性计算simhash源码】1.计算SimHash值，及Hamming距离。
2.SimHash适用于较长文本（大于三五百字）的相似性比较，文本越短误判率越高。
Python实现：代码如下

# -*- encoding:utf-8 -*-import mathimport jiebaimport jieba.analyseclass SimHash(object):def getBinStr(self, source):if source == "":return 0else:x = ord(source[0]) << 7m = 1000003mask = 2 ** 128 - 1for c in source:x = ((x * m) ^ ord(c)) & maskx ^= len(source)if x == -1:x = -2x = bin(x).replace('0b', '').zfill(64)[-64:]return str(x)def getWeight(self, source):return ord(source)def unwrap_weight(self, arr):ret = ""for item in arr:tmp = 0if int(item) > 0:tmp = 1ret += str(tmp)return retdef sim_hash(self, rawstr):seg = jieba.cut(rawstr)keywords = jieba.analyse.extract_tags("|".join(seg), topK=100, withWeight=True)ret = []for keyword, weight in keywords:binstr = self.getBinStr(keyword)keylist = []for c in binstr:weight = math.ceil(weight)if c == "1":keylist.append(int(weight))else:keylist.append(-int(weight))ret.append(keylist)# 降维rows = len(ret)cols = len(ret[0])result = []for i in range(cols):tmp = 0for j in range(rows):tmp += int(ret[j][i])if tmp > 0:tmp = "1"elif tmp <= 0:tmp = "0"result.append(tmp)return "".join(result)def distince(self, hashstr1, hashstr2):length = 0for index, char in enumerate(hashstr1):if char == hashstr2[index]:continueelse:length += 1return lengthif __name__ == "__main__":simhash = SimHash()str1 = '咱哥俩谁跟谁啊'str2 = '咱们俩谁跟谁啊'hash1 = simhash.sim_hash(str1)print(hash1)hash2 = simhash.sim_hash(str2)distince = simhash.distince(hash1, hash2)value = https://www.it610.com/article/5print("simhash", distince, "距离：", value, "是否相似：", distince<=value)

以上就是论文查重python文本相似性计算simhash源码的详细内容，更多关于python文本相似性计算simhash的资料请关注脚本之家其它相关文章！