小组作业:基于 Python 的舆情预测

2021-12-12
5分钟阅读时长

说在前面

伴随着 Python 课程结课,老师也给我们布置了最终的小组作业, 一个人拖一个组的感觉可真不好( 然而,小组作业的最后期限也临近。由于我校 Python 课程比较划水,课程内容与课后作业联系不紧密,要求小组其他成员共同完成项目极为吃力,无奈只能自己上阵硬肛这稍显困难的小(ge)组(ren)作业了。

NLP 初探

作为 Python 初学者,我首先需要着手了解关于 NLP 的相关概念。

自然语言处理是人工智能和语言学领域的分支学科。此领域探讨如何处理及运用自然语言;自然语言处理包括多方面和步骤,基本有认知、理解、生成等部分。 自然语言认知和理解是让电脑把输入的语言变成有意思的符号和关系,然后根据目的再处理。自然语言生成系统则是把计算机数据转化为自然语言。(维基百科)

小组项目解析

我们需要做什么

Python for Finance Project: Financial Statement Sentiment Analysis

从字面上看,这个项目和金融数学联系紧密,目的是将舆情与金融走势联系起来。不愧叫做 Python Programming for Beginners (Financial Mathematics)。

情感分析(Sentiment Analysis) 从文本中识别提取文本的主观态度信息。情感分析也称为意见挖掘,是指利用自然语言处理、文本分析和计算语言学来识别和提取源材料中的主观信息。 (维基百科)

在解释中,原文档也提到:

A simple solution is, every word is given a score based on its extent of positiveness, negativeness or neutral. The Sentiment Analysis is done by calculating the algorithmic score of each word, and returning with the combined score for the given set of text.

对于每一个词,我们需要计算出该词的分数,并返回文本集的完整分数。

给了什么

这个项目的坑点是:

Please noted, you are not allow to use packages that are from outside of this course

没错!这居然成了我水过这个项目的最大阻碍。

那么,我直奔主题:

3. What you need to do in this project

3.1 What is already given

train.txt: contains 3877 rows with 3 columns “index”, “sentiment”, “text”, where “text” denotes the financial news sentence, “sentiment” denotes the human annotated sentiment result (3 categories “positive”, “negative”, “neutral”), and “index” denotes the ID of the news sentence.

test_x.txt: contains 970 rows with 2 columns “index” and “text” whose explanation is the same as “train.txt”.

senti_dict.csv: The sentiment word dictionary, where the value of each word indicates how positive of a word is. Negative score means the word tends to be negative. Note that this word sentiment dictionary is not guaranteed to be thorough and complete.

fyi folder: resources you may be interested in.

我们需要提交什么

3.2 What you need to submit

You need to submit a zipped package containing following files:

test_y.txt. You need to put your prediction results in a file named as “test_y.txt”. The file contains 970 rows with 2 columns, in the form of index,label. “index” denotes the ID of corresponding sentence in “text_x.txt”, and “label” denotes the sentiment you predict with respect to that sentence, which should be “positive”, “negative”, or “neutral”.

.ipynb file. You need to implement your solution in this jupyter notebook file, and add sufficient introduction to your thoughts and program (e.g. use highlight, bulletin, etc, just like the lecture slides), using the markdown language.

Important: your .ipynb file should be able to output (write) the “test_y.txt” to hard disk. If there is only the prediction result (test_y.txt) and your .ipynb file does not include the code which can output the prediction result, you will get zero marks. I will go into the .ipynb file and run each cell of your program. So, make sure your .ipynb is well organized so that I can easily find the part which can output your prediction result.

sentiment_words_dict.txt (optional). If you use your own sentiment dictionary in the solution, you need to submit the dictionary used in your project as well.

In all, you need to submit all the materials so that I can successfully run your code and obtain the prediction result.

小组项目上手

CSV 格式文档处理

在对原材料进行处理的过程中,我发现其中 train.txt 以及 test_x.txt 的内容大致符合 csv 文档的格式,所以一并进行处理:

import csv

# 定义全局变量
senti_dict = {}
idf_dict = {}
test_x = []
test_y = []
train = []
custom_dict = {}
merged_dict = {}
accuCount = 0.0
class handleData:

    def __init__(self, senti_dict_path, test_x_path, test_y_path, train_path=None, custom_dict_path=None):
        # 执行处理
        if not senti_dict_path:
            print('请输入 senti_dict_path')
            quit()
        else:
            self.handle_senti_dict(senti_dict_path)
        if not test_x_path:
            print('请输入 test_x_path')
            quit()
        else:
            self.handle_test_x(test_x_path)
        if not test_y_path:
            print('请输入 test_y_path')
            quit()
        if train_path:
            self.handle_train(train_path)
        if custom_dict_path:
            self.handle_custom_dict(custom_dict_path)
            merged_dict.update(custom_dict)
        merged_dict.update(senti_dict)

    def handle_senti_dict(self, path):
        global senti_dict
        # 预处理 senti_dict 类型为 dictionary
        with open(path) as path:
            csv_reader = csv.reader(path)
            for line in csv_reader:
                if csv_reader.line_num == 1:
                    continue
                senti_dict[line[1]] = float(line[2])
            print('senti_dict 库共有 %s 行' %len(senti_dict))

    def handle_test_x(self, path):
        global test_x
        # 预处理 test_x 类型为 二维数组
        with open(path) as path:
            csv_reader = csv.reader(path)
            for line in csv_reader:
                line[0] = int(line[0])
                test_x.append(line)
            #test_x = sorted(test_x)
            print('test_x 库共有 %s 行' %len(test_x))

    def handle_train(self, path):
        global train
        # 预处理 train 类型为 二维数组
        with open(path) as path:
            csv_reader = csv.reader(path)
            for line in csv_reader:
                if csv_reader.line_num == 1:
                    continue
                line[0] = int(line[0])
                train.append(line)
            #train = sorted(train)
            print('train 库共有 %s 行' %len(train))

    def handle_custom_dict(self, path):
        global custom_dict
        # 预处理 senti_dict 类型为 dictionary
        with open(path) as path:
            csv_reader = csv.reader(path)
            for line in csv_reader:
                if csv_reader.line_num == 1:
                    continue
                custom_dict[line[0]] = float(line[1])
                idf_dict[line[0]] = float(line[2])
            print('custom_dict 库共有 %s 行' %len(custom_dict))
def listToCSV(list, path):
    with open(path, 'w', encoding='UTF8', newline='') as f:
        writer = csv.writer(f)
        writer.writerows(list)
        print('已将结果保存于 %s' %path)

main()

需要正式处理时,只需这样调用:

def main():
    print('Python for Finance Project: Financial Statement Sentiment Analysis')
    handleData('./senti_dict.csv', './test_x.txt', './test_y.txt', './train.txt', './sentiment_words_dict.txt')
    if train != []:
        listToCSV(sentimentAnalysis(train, 2), './test_train.txt')
    else:
        listToCSV(sentimentAnalysis(test_x, 1), './test_y.txt')

其中'./train.txt''./sentiment_words_dict.txt'是两个可选参数。

sentimentAnalysis() 部分

这一部分是令我最头痛的地方了。因为这门课程一直强调基础性,对于算法部分也没有进行深入探究(只讲了递归一种)。于是只能自己摸着石头过河,乱算一气了。

并且不让引入包!

但有意思的一点是,TA 给了一个看起来啥都没有的词库:

Words Scores
0 abil 0.00012416859533089645
1 actual 0.0006961488543271923
2 advertis -0.005592582215163179
3 agenc 0.0006353814403691496
4 aggreg -0.0010330308303088183
5 agreement 7.96548813763754e-05
6 allow 0.002174006034169924
7 although -0.003593985395296731

那么,第三列的数字代表着什么呢?

(笔者至今仍未搞懂)

但经过比对,发现这些数据来自这里:

https://github.com/nproellochs/SentimentDictionaries/blob/master/Dictionary8K.csv

同时,出处给的文档的第三列为 Idf(逆文档频率) 数据。

Words Scores Idf
abil 0.013451985735806887 2.440750228642685
abl -0.004787871091608642 1.9253479378492828
absolut 0.003360788277744489 2.728432301094466
academi 0.007129395655638781 2.9089206768067597
accent -0.003550084101686155 2.894374965804381
accept -0.010454735567790118 1.5262960445758316
accomplish 0.004308459321682365 2.7162740966146566
act -0.022561708229224143 0.89426188633043
actor -0.011279482911515365 1.0184159310395977
actual -0.022918668083275685 1.5116972451546788

这给我的启发是:我们能否提取出词频来计算文本中词语的罕见程度呢?

一个容易想到的思路,就是找到出现次数最多的词。如果某个词很重要,它应该在这篇文章中多次出现。于是,我们进行"词频"(Term Frequency,缩写为TF)统计。

http://www.ruanyifeng.com/blog/2013/03/tf-idf.html

这时,思路有了。我们首先提取出给定短句的词,再统计该句子的总词数,实现如下:

# 返回一个 list
def getWords(text):
    # 用正则表达式取出符合规范的部分
    text = re.sub("[^a-zA-Z]", " ", text)
    # 小写化所有的词,并转成词list
    words = text.lower().split()
    # 返回words
    return words

# 返回一个 Integer
def countWords(text):
    return len(getWords(text))

然后,在进行词语比对时,进行 TF-IDF 运算,得出词频最高的词语:

def sentimentAnalysis(data, num):
    global accuCount
    # -= 以行为单位提取 =-
    # 对于 test_x,line[0] 为 id,line[1] 为 sentence
    # 对于 train,line[0] 为 id,line[1] 为 句子的情感分析结果,line[2] 为 sentence
    # 因此在 main() 中有对两种数据源的判断语句
    # index 为该句子的从 0 开始的计数
    for index, line in enumerate(data):
        positiveCount = 0
        negativeCount = 0
        sum = 0
        diff = 0
        sentiPoint = 0.0
        max_tf_idf = 0
        max_word = ''
        sentiType = 'neutral'
        # -= 以单词为单位执行判断 =-
        # 分别统计正面与负面词汇的数量
        # line[num] 为 string,为整个句子的字符串值
        for word in getWords(line[num]):
            # word 命中字典词库
            # merged_dict[word] 为极端分,idf_dict[word] 为该词语的 idf 值
            if word in merged_dict:
                tf_idf_val = tf_idf(word, line[num], idf_dict[word])
                if tf_idf_val > max_tf_idf:
                    max_tf_idf = tf_idf_val
                    max_word = word
        if max_word != '':
            if merged_dict[max_word] > 0.01:
                sentiType = 'positive'
            elif merged_dict[max_word] < -0.01:
                sentiType = 'negative'
            else:
                sentiType = 'neutral'
        else:
            sentiType = 'neutral'
        test_y.append([line[0], sentiType])
        if sentiType == line[1]:
            accuCount += 1
    # 根据 train.txt 计算准确率
    if num == 2:
        print('精确度为', accuCount / len(data))
    return test_y

此时我们已经能得到好东西了:

kirin@KirindeMacBook-Pro Senti % python3 main.py
Python for Finance Project: Financial Statement Sentiment Analysis
senti_dict 库共有 172 行
test_x 库共有 970 行
train 库共有 3876 行
custom_dict 库共有 683 行
精确度为 0.5092879256965944
已将结果保存于 ./test_train.txt
kirin@KirindeMacBook-Pro Senti %