NLP多文本分类Baseline复现

首先创建一个新的虚拟环境 conda create -n NLP2020 python=3.7

激活环境，进入文件目录，安装依赖pip install -r requirement_dev.txt

报错：

ERROR: Could not install packages due to an EnvironmentError: [WinError 5] 拒绝访问。: ‘C:\Users\18771\AppData\Local\Temp\pip-uninstall-5kehbgst\pip.exe’
Consider using the --user option or check the permissions.

按描述加上--user ，安装成功但有警告

WARNING: The script twine.exe is installed in ‘C:\Users\18771\AppData\Roaming\Python\Python37\Scripts’ which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use –no-warn-script-location.

于是还是用管理员方式运行Anaconda Prompt, 重新安装了一下，顺便升级了pip python -m pip install --upgrade pip

按README继续安装其他依赖

pytorch==1.4.0
cudatoolkit==9.2
tensorboard==2.2.1
scikit-learn==0.22
jieba==0.42.1

由于没有GPU，我就没有安装cudatoolkit

pip install pytorch tensorboard scikit-learn jieba

安装pytorch报错

Exception: You tried to install “pytorch”. The package named for PyTorch is “torch”
ERROR: Failed building wheel for pytorch

去到pytorch官网选择自己所需的版本

用生成的命令进行安装

pip install torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

查看安装好的依赖包

然鹅奇怪的是requirement中的安装包并没有显示出来，难道没装好？

回过头去看了一下安装信息，不知道路径会不会在后来产生一些问题

先用online_shopping_10_cats数据尝试运行

查看args.py中的参数配置，再修改 scripts/train_classification.sh

哦豁，完蛋是sh脚本

今天我尝试着用pycharm打开项目文件, 发现一个神奇的东西，这插件也不知道能不能行

安装完重启IDE后配置git路径(所以前提是已经安装好了git)

后续又遇到一点问题：当用git去执行的时候，不是在虚拟环境中去执行的，所以无法导入依赖。

算了，还是重写一个bat吧。主要修改:路径前用set ${ path}改为%path%

1
2
3

set WORKSPACE="C:\Users\18771\Desktop\NLP_2020\NLP2020\NLP2020-classification\nlp_2020"
set DATADIR="%WORKSPACE%\data"
python ..\nlp_2020\classification\train.py --data_dir %DATADIR%\classification --model_name_or_path %DATADIR%\model --output_dir %DATADIR%\output --cache_dir %DATADIR%\cache --embed_path %DATADIR%\sgns.sogounews.bigram-char

打开代码一片红…

补充安装pip install torchtext

torchtext预处理流程：

定义Field：声明如何处理数据
定义Dataset：得到数据集，此时数据集里每一个样本是一个经过 Field声明的预处理预处理后的 wordlist
建立vocab：在这一步建立词汇表，词向量(word embeddings)
构造迭代器：构造迭代器，用来分批次训练模型

修改路径中重复报错

FileNotFoundError: [WinError 3] 系统找不到指定的路径。: ‘${DATADIR}\output’

把train.py第171行的os.makedir(args.output_dir) 改为 os.makedirs(args.output_dir)

数据文件总是匹配不到

FileNotFoundError: [Errno 2] No such file or directory: ‘C:\Users\18771\Desktop\NLP_2020\NLP2020\NLP2020-classification\nlp_2020\data\classification\{mode}.csv’

此时mode = ‘train’, 尝试了多次无果，没办法只能把文件名写死了os.path.join(args.data_dir, 'train.csv'),

词向量加载完毕