73. 深度学习完成假新闻分类#

73.1. 介绍#

深度学习在自然语言处理中有十分重要的应用，实际上前面的循环神经网络内容中，已经接触过相关的知识。本次挑战需要借助前面学习过的知识，提升假新闻文本分类的准确率。

73.2. 知识点#

文本分类
深度神经网络

实验中，我们使用了 WSDM 假新闻分类数据学习了文本分类的过程。不过，实验的结果并不特别理想，测试集准确度基本 \(65\%\) 左右。本次挑战中，你需要利用文本分类实验中学习到的数据预处理技巧，以及前面深度学习中学过的相关知识，对假新闻数据重新分类。

Exercise 73.1

开放型挑战

挑战：利用文本分类预处理及深度学习知识，构建深度神经网络对假新闻数据进行分类。

规定：对提供的数据进行 \(8:2\) 切分，最终测试集准确度 \(>70\%\)。你可以自由选择文本预处理方法，特征提取手段，以及深度神经网络结构。

挑战需使用实验中提供的假新闻数据。

                      https://cdn.aibydoing.com/aibydoing/files/wsdm_mini.csv  # 假新闻数据

                    

## 补充代码 ###

参考答案 Exercise 73.1

                          wget -nc "https://cdn.aibydoing.com/aibydoing/files/wsdm_mini.csv"  # 假新闻数据
wget -nc "https://cdn.aibydoing.com/aibydoing/files/stopwords.txt"  # 停用词词典

                          import pandas as pd

df = pd.read_csv("wsdm_mini.csv")
df['title_zh'] = df[['title1_zh', 'title2_zh']].apply(
    lambda x: ''.join(x), axis=1)  # 合并文本数据列
df.head()

                        

                          import jieba
from tqdm import tqdm_notebook

def load_stopwords(file_path):
    with open(file_path, 'r') as f:
        stopwords = [line.strip('\n') for line in f.readlines()]
    return stopwords

stopwords = load_stopwords('stopwords.txt')

corpus = []
for line in tqdm_notebook(df['title_zh']):
    words = []
    seg_list = list(jieba.cut(line))  # 分词
    for word in seg_list:
        if word in stopwords:  # 删除停用词
            continue
        words.append(word)
    corpus.append(" ".join(words))

                        

                          import tensorflow as tf

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000)
tokenizer

                          tokenizer.fit_on_texts(corpus)
X_ = tokenizer.texts_to_sequences(corpus)

for seq in X_[:1]:
    print([tokenizer.index_word[idx] for idx in seq])

X = tf.keras.preprocessing.sequence.pad_sequences(X_, maxlen=20)
X.shape

                        

                          from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
y_onehot = encoder.fit_transform(df.label.values.reshape(len(df), -1))
y_onehot

                          from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_onehot, test_size=0.2)

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(10000, 16, input_length=20))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.summary()

model.compile(optimizer='Adam', loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(X_train, y_train, 64, 10, validation_data=(X_test, y_test))

                        

○ 欢迎分享本文链接到你的社交账号、博客、论坛等。更多的外链会增加搜索引擎对本站收录的权重，从而让更多人看到这些内容。

如果你觉得这些内容对你有帮助，可以请我喝杯咖啡

72. 文本分类原理与实践

74. 自然语言处理框架拓展