李宏毅机器学习HW|Homework 1: COVID-19 Cases Prediction (Regression)

2021Homework 1: COVID-19 Cases Prediction (Regression) 我的最终优化版:
https://github.com/Orange-yy/ML2021/blob/main/%E2%80%9CML2021Spring_HW1_ipynb%E2%80%9D%EF%BC%88%E6%94%B9%E8%BF%9B%E7%89%88%EF%BC%89.ipynb
Objectives:

  • Solve a regression problem with deep neural networks (DNN).
  • Understand basic DNN training tips.
  • Get familiar with PyTorch.
Download Data If the Google drive links are dead, you can download data from kaggle, and upload data manually to the workspace.
tr_path = 'covid.train.csv'# path to training data tt_path = 'covid.test.csv'# path to testing data!gdown --id '19CCyCgJrUxtvgZF53vnctJiOJ23T5mqF' --output covid.train.csv !gdown --id '1CE240jLm2npU-tdz81-oVKEF3T2yfT1O' --output covid.test.csv

Import Some Packages
# PyTorch import torch import torch.nn as nn from torch.utils.data import Dataset, DataLoader# For data preprocess import numpy as np import csv import os# For plotting import matplotlib.pyplot as plt from matplotlib.pyplot import figuremyseed = 42069# set a random seed for reproducibility torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False #下面几行代码是把将来可能会用到的参数用随机种子固定 np.random.seed(myseed) torch.manual_seed(myseed) if torch.cuda.is_available(): torch.cuda.manual_seed_all(myseed)

torch.backends.cudnn.deterministic是啥?
顾名思义,设置为True的话,每次返回的卷积算法将是确定的,即默认算法。如果配合上设置 Torch 的随机种子为固定值的话,可以保证每次运行网络的时候相同输入的输出是固定的。
torch.backends.cudnn.benchmark = False
设置 torch.backends.cudnn.benchmark=True 将会让程序在开始时花费一点额外时间,为整个网络的每个卷积层搜索最适合它的卷积实现算法,进而实现网络的加速。适用场景是网络结构固定(不是动态变化的),网络的输入形状(包括 batch size,图片大小,输入的通道)是不变的,其实也就是一般情况下都比较适用。反之,如果卷积层的设置一直变化,将会导致程序不停地做优化,反而会耗费更多的时间。
【李宏毅机器学习HW|Homework 1: COVID-19 Cases Prediction (Regression)】具体请参照: https://blog.csdn.net/byron123456sfsfsfa/article/details/96003317
Some Utilities (画图用) You do not need to modify this part.
def get_device(): ''' Get device (if GPU is available, use GPU) ''' return 'cuda' if torch.cuda.is_available() else 'cpu'def plot_learning_curve(loss_record, title=''): ''' Plot learning curve of your DNN (train & dev loss) ''' total_steps = len(loss_record['train']) x_1 = range(total_steps) x_2 = x_1[::len(loss_record['train']) // len(loss_record['dev'])] figure(figsize=(6, 4)) plt.plot(x_1, loss_record['train'], c='tab:red', label='train') plt.plot(x_2, loss_record['dev'], c='tab:cyan', label='dev') plt.ylim(0.0, 5.) plt.xlabel('Training steps') plt.ylabel('MSE loss') plt.title('Learning curve of {}'.format(title)) plt.legend() plt.show()def plot_pred(dv_set, model, device, lim=35., preds=None, targets=None): ''' Plot prediction of your DNN ''' if preds is None or targets is None: model.eval() preds, targets = [], [] for x, y in dv_set: x, y = x.to(device), y.to(device) with torch.no_grad(): pred = model(x) preds.append(pred.detach().cpu()) targets.append(y.detach().cpu()) preds = torch.cat(preds, dim=0).numpy() targets = torch.cat(targets, dim=0).numpy()figure(figsize=(5, 5)) plt.scatter(targets, preds, c='r', alpha=0.5) plt.plot([-0.2, lim], [-0.2, lim], c='b') plt.xlim(-0.2, lim) plt.ylim(-0.2, lim) plt.xlabel('ground truth value') plt.ylabel('predicted value') plt.title('Ground Truth v.s. Prediction') plt.show()

Preprocess We have three kinds of datasets:
  • train: for training
  • dev: for validation
  • test: for testing (w/o target value)
新段落 Dataset The COVID19Dataset below does:
  • read .csv files
  • extract features
  • split covid.train.csv into train/dev sets
  • normalize features
Finishing TODO below might make you pass medium baseline.
有关COVID19Dataset的类,有以下注解:
在处理任何机器学习问题之前都需要数据读取,并进行预处理。Pytorch提供了许多方法使得数据读取和预处理变得很容易。
torch.utils.data.Dataset是代表自定义数据集方法的抽象类,你可以自己定义你的数据类继承这个抽象类,非常简单,只需要定义__len__和__getitem__这两个方法就可以。
通过继承torch.utils.data.Dataset的这个抽象类,我们可以定义好我们需要的数据类。当我们通过迭代的方式来取得每一个数据,但是这样很难实现取batch,shuffle或者多线程读取数据,所以pytorch还提供了一个简单的方法来做这件事情,通过torch.utils.data.DataLoader类来定义一个新的迭代器,用来将自定义的数据读取接口的输出或者PyTorch已有的数据读取接口的输入按照batch size封装成Tensor,后续只需要再包装成Variable即可作为模型的输入。
总之,通过torch.utils.data.Dataset和torch.utils.data.DataLoader这两个类,使数据的读取变得非常简单、快捷。
具体参照:https://blog.csdn.net/qq_36653505/article/details/83351808
class COVID19Dataset(Dataset): ''' Dataset for loading and preprocessing the COVID19 dataset ''' def __init__(self, path, mode='train', target_only=False): self.mode = mode# Read data into numpy arrays with open(path, 'r') as fp: data = https://www.it610.com/article/list(csv.reader(fp))#按行读取,数据放进列表 data = np.array(data[1:])[:, 1:].astype(float)if not target_only: feats = list(range(93)) else: # TODO: Using 40 states & 2 tested_positive features (indices = 57 & 75) passif mode =='test': # Testing data # data: 893 x 93 (40 states + day 1 (18) + day 2 (18) + day 3 (17)) data = data[:, feats] self.data = torch.FloatTensor(data) else: # Training data (train/dev sets) # data: 2700 x 94 (40 states + day 1 (18) + day 2 (18) + day 3 (18)) target = data[:, -1] data = data[:, feats]# Splitting training data into train & dev sets if mode == 'train': indices = [i for i in range(len(data)) if i % 10 != 0] elif mode == 'dev': indices = [i for i in range(len(data)) if i % 10 == 0]# Convert data into PyTorch tensors self.data = https://www.it610.com/article/torch.FloatTensor(data[indices]) self.target = torch.FloatTensor(target[indices])# Normalize features (you may remove this part to see what will happen) self.data[:, 40:] = / (self.data[:, 40:] - self.data[:, 40:].mean(dim=0, keepdim=True)) / / self.data[:, 40:].std(dim=0, keepdim=True)self.dim = self.data.shape[1]print('Finished reading the {} set of COVID19 Dataset ({} samples found, each dim = {})' .format(mode, len(self.data), self.dim))def __getitem__(self, index): # Returns one sample at a time if self.mode in ['train', 'dev']: # For training return self.data[index], self.target[index] else: # For testing (no target) return self.data[index]def __len__(self): # Returns the size of the dataset return len(self.data)

DataLoader A DataLoader loads data from a given Dataset into batches.
def prep_dataloader(path, mode, batch_size, n_jobs=0, target_only=False): ''' Generates a dataset, then is put into a dataloader. ''' dataset = COVID19Dataset(path, mode=mode, target_only=target_only)# Construct dataset dataloader = DataLoader( dataset, batch_size, shuffle=(mode == 'train'), drop_last=False, num_workers=n_jobs, pin_memory=True)# Construct dataloader return dataloader

Deep Neural Network NeuralNet is an nn.Module designed for regression.
The DNN consists of 2 fully-connected layers with ReLU activation.
This module also included a function cal_loss for calculating loss.
class NeuralNet(nn.Module): ''' A simple fully-connected deep neural network ''' def __init__(self, input_dim): super(NeuralNet, self).__init__()# Define your neural network here # TODO: How to modify this model to achieve better performance? self.net = nn.Sequential( nn.Linear(input_dim, 64), nn.ReLU(), nn.Linear(64, 1) )# Mean squared error loss self.criterion = nn.MSELoss(reduction='mean')def forward(self, x): ''' Given input of size (batch_size x input_dim), compute output of the network ''' return self.net(x).squeeze(1)def cal_loss(self, pred, target): ''' Calculate loss ''' # TODO: you may implement L1/L2 regularization here return self.criterion(pred, target)

Train/Dev/Test Training
def train(tr_set, dv_set, model, config, device): ''' DNN training '''n_epochs = config['n_epochs']# Maximum number of epochs# Setup optimizer optimizer = getattr(torch.optim, config['optimizer'])( model.parameters(), **config['optim_hparas'])min_mse = 1000. loss_record = {'train': [], 'dev': []}# for recording training loss early_stop_cnt = 0 epoch = 0 while epoch < n_epochs: model.train()# set model to training mode for x, y in tr_set:# iterate through the dataloader optimizer.zero_grad()# set gradient to zero x, y = x.to(device), y.to(device)# move data to device (cpu/cuda) pred = model(x)# forward pass (compute output) mse_loss = model.cal_loss(pred, y)# compute loss mse_loss.backward()# compute gradient (backpropagation) optimizer.step()# update model with optimizer loss_record['train'].append(mse_loss.detach().cpu().item())# After each epoch, test your model on the validation (development) set. dev_mse = dev(dv_set, model, device) if dev_mse < min_mse: # Save model if your model improved min_mse = dev_mse print('Saving model (epoch = {:4d}, loss = {:.4f})' .format(epoch + 1, min_mse)) torch.save(model.state_dict(), config['save_path'])# Save model to specified path early_stop_cnt = 0 else: early_stop_cnt += 1epoch += 1 loss_record['dev'].append(dev_mse) if early_stop_cnt > config['early_stop']: # Stop training if your model stops improving for "config['early_stop']" epochs. breakprint('Finished training after {} epochs'.format(epoch)) return min_mse, loss_record

Validation
def dev(dv_set, model, device): model.eval()# set model to evalutation mode total_loss = 0 for x, y in dv_set:# iterate through the dataloader x, y = x.to(device), y.to(device)# move data to device (cpu/cuda) with torch.no_grad():# disable gradient calculation pred = model(x)# forward pass (compute output) mse_loss = model.cal_loss(pred, y)# compute loss total_loss += mse_loss.detach().cpu().item() * len(x)# accumulate loss total_loss = total_loss / len(dv_set.dataset)# compute averaged lossreturn total_loss

Testing
def test(tt_set, model, device): model.eval()# set model to evalutation mode preds = [] for x in tt_set:# iterate through the dataloader x = x.to(device)# move data to device (cpu/cuda) with torch.no_grad():# disable gradient calculation pred = model(x)# forward pass (compute output) preds.append(pred.detach().cpu())# collect prediction preds = torch.cat(preds, dim=0).numpy()# concatenate all predictions and convert to a numpy array return preds

Setup Hyper-parameters config contains hyper-parameters for training and the path to save your model.
device = get_device()# get the current available device ('cpu' or 'cuda') os.makedirs('models', exist_ok=True)# The trained model will be saved to ./models/ target_only = False# TODO: Using 40 states & 2 tested_positive features# TODO: How to tune these hyper-parameters to improve your model's performance? config = { 'n_epochs': 3000,# maximum number of epochs 'batch_size': 270,# mini-batch size for dataloader 'optimizer': 'SGD',# optimization algorithm (optimizer in torch.optim) 'optim_hparas': {# hyper-parameters for the optimizer (depends on which optimizer you are using) 'lr': 0.001,# learning rate of SGD 'momentum': 0.9# momentum for SGD }, 'early_stop': 200,# early stopping epochs (the number epochs since your model's last improvement) 'save_path': 'models/model.pth'# your model will be saved here }

Load data and model
tr_set = prep_dataloader(tr_path, 'train', config['batch_size'], target_only=target_only) dv_set = prep_dataloader(tr_path, 'dev', config['batch_size'], target_only=target_only) tt_set = prep_dataloader(tt_path, 'test', config['batch_size'], target_only=target_only)

model = NeuralNet(tr_set.dataset.dim).to(device)# Construct model and move to device

Start Training!
model_loss, model_loss_record = train(tr_set, dv_set, model, config, device)

plot_learning_curve(model_loss_record, title='deep model')

del model model = NeuralNet(tr_set.dataset.dim).to(device) ckpt = torch.load(config['save_path'], map_location='cpu')# Load your best model model.load_state_dict(ckpt) plot_pred(dv_set, model, device)# Show prediction on the validation set

Testing The predictions of your model on testing set will be stored at pred.csv.
def save_pred(preds, file): ''' Save predictions to specified file ''' print('Saving results to {}'.format(file)) with open(file, 'w') as fp: writer = csv.writer(fp) writer.writerow(['id', 'tested_positive']) for i, p in enumerate(preds): writer.writerow([i, p])preds = test(tt_set, model, device)# predict COVID-19 cases with your model save_pred(preds, 'pred.csv')# save prediction file to pred.csv

Hints Simple Baseline
  • Run sample code
Medium Baseline
  • Feature selection: 40 states + 2 tested_positive (TODO in dataset)
Strong Baseline
  • Feature selection (what other features are useful?)
  • DNN architecture (layers? dimension? activation function?)
  • Training (mini-batch? optimizer? learning rate?)
  • L2 regularization
  • There are some mistakes in the sample code, can you find them?
Reference This code is completely written by Heng-Jui Chang @ NTUEE.
Copying or reusing this code is required to specify the original author.
E.g.
Source: Heng-Jui Chang @ NTUEE(https://github.com/ga642381/ML2021-Spring/blob/main/HW01/HW01.ipynb)
优化参考https://github.com/wolfparticle/machineLearningDeepLearning/blob/main/homework_code/hw1/HW1_local参考代码/HW1_local.ipynb

    推荐阅读