CH3-数据准备和特征工程|2-3 处理缺失数据

2.3 处理缺失数据 与本节相关的视频课程:处理缺失数据
检查缺失数据
【CH3-数据准备和特征工程|2-3 处理缺失数据】基础知识

def foo(): pass f = foo() print(f)

None

type(f)

NoneType

None + 2

TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

import numpy as np np.nan + 2

nan

type(np.nan)

float

import pandas as pd s = pd.Series([1, 2, None, np.nan])# ① s

01.0 12.0 2NaN 3NaN dtype: float64

s.sum()

3.0

s.isna()

0False 1False 2True 3True dtype: bool

df = pd.DataFrame({"one":[1, 2, np.nan], "two":[np.nan, 3, 4]}) df.isna()

one two
0 False True
1 False False
2 True False
项目案例
hitters = pd.read_csv("/home/aistudio/data/data20507/Hitters.csv") hitters.isna().any()

AtBatFalse HitsFalse HmRunFalse RunsFalse RBIFalse WalksFalse YearsFalse CAtBatFalse CHitsFalse CHmRunFalse CRunsFalse CRBIFalse CWalksFalse LeagueFalse DivisionFalse PutOutsFalse AssistsFalse ErrorsFalse SalaryTrue NewLeagueFalse dtype: bool

(hitters.shape[0] - hitters.count()) / hitters.shape[0]

AtBat0.00000 Hits0.00000 HmRun0.00000 Runs0.00000 RBI0.00000 Walks0.00000 Years0.00000 CAtBat0.00000 CHits0.00000 CHmRun0.00000 CRuns0.00000 CRBI0.00000 CWalks0.00000 League0.00000 Division0.00000 PutOuts0.00000 Assists0.00000 Errors0.00000 Salary0.18323 NewLeague0.00000 dtype: float64

hitters.shape

(322, 20)

hitters.count()

AtBat322 Hits322 HmRun322 Runs322 RBI322 Walks322 Years322 CAtBat322 CHits322 CHmRun322 CRuns322 CRBI322 CWalks322 League322 Division322 PutOuts322 Assists322 Errors322 Salary263 NewLeague322 dtype: int64

(df.shape[1] - df.T.count()) / df.shape[1]

00.5 10.0 20.5 dtype: float64

df.dropna()

one two
1 2.0 3.0
df = pd.concat([df, pd.DataFrame({"one": [np.nan], "two": [np.nan], "three": [np.nan]})], ignore_index=True, sort=False)# 重新构建一个含有缺失值的DataFrame对象 df

one two three
0 1.0 NaN NaN
1 2.0 3.0 NaN
2 NaN 4.0 NaN
3 NaN NaN NaN
df.dropna(axis=0, how='all')# how声明删除条件

one two three
0 1.0 NaN NaN
1 2.0 3.0 NaN
2 NaN 4.0 NaN
df.dropna(thresh=2)# 非缺失值小于2的删除

one two three
1 2.0 3.0 NaN
new_hitters = hitters.dropna() new_hitters.isna().any()

AtBatFalse HitsFalse HmRunFalse RunsFalse RBIFalse WalksFalse YearsFalse CAtBatFalse CHitsFalse CHmRunFalse CRunsFalse CRBIFalse CWalksFalse LeagueFalse DivisionFalse PutOutsFalse AssistsFalse ErrorsFalse SalaryFalse NewLeagueFalse dtype: bool

动手练习
eles = pd.read_csv("/home/aistudio/data/data20507/elements.csv") eles.isna().any()

atomic numberFalse symbolFalse nameFalse atomic massFalse CPKFalse electronic configurationFalse electronegativityTrue atomic radiusTrue ion radiusTrue van der Waals radiusTrue IE-1True EATrue standard stateTrue bonding typeTrue melting pointTrue boiling pointTrue densityTrue metalFalse year discoveredFalse groupFalse periodFalse dtype: bool

(eles.shape[0] - eles.count()) / eles.shape[0]

atomic number0.000000 symbol0.000000 name0.000000 atomic mass0.000000 CPK0.000000 electronic configuration0.000000 electronegativity0.177966 atomic radius0.398305 ion radius0.220339 van der Waals radius0.677966 IE-10.135593 EA0.279661 standard state0.161017 bonding type0.169492 melting point0.144068 boiling point0.203390 density0.186441 metal0.000000 year discovered0.000000 group0.000000 period0.000000 dtype: float64

eles_nona = eles.dropna() eles_nona.isna().any()

atomic numberFalse symbolFalse nameFalse atomic massFalse CPKFalse electronic configurationFalse electronegativityFalse atomic radiusFalse ion radiusFalse van der Waals radiusFalse IE-1False EAFalse standard stateFalse bonding typeFalse melting pointFalse boiling pointFalse densityFalse metalFalse year discoveredFalse groupFalse periodFalse dtype: bool

2.3.2 用指定值填补缺失数据
基础知识
df = pd.DataFrame({"one":[10, 11, 12], 'two':[np.nan, 21, 22], "three":[30, np.nan, 33]}) df

one two three
0 10 NaN 30.0
1 11 21.0 NaN
2 12 22.0 33.0
df = pd.DataFrame({'ColA':[1, np.nan, np.nan, 4, 5, 6, 7], 'ColB':[1, 1, 1, 1, 2, 2, 2]}) df['ColA'].fillna(method='ffill')

01.0 11.0 21.0 34.0 45.0 56.0 67.0 Name: ColA, dtype: float64

df['ColA'].fillna(method='bfill')

01.0 14.0 24.0 34.0 45.0 56.0 67.0 Name: ColA, dtype: float64

项目案例
persons = pd.read_csv("/home/aistudio/data/data20507/Person.csv")# 为了适应平台要求,数据的名称与教材中的稍有差异 pdf = persons.sample(20)# ① pdf['Height-na'] = np.where(pdf['Height'] % 5 == 0, np.nan, pdf['Height'])# ② pdf

Gender Height Weight Index Height-na
64 Male 175 135 5 NaN
225 Female 155 144 5 NaN
484 Female 188 115 4 188.0
293 Female 165 83 4 NaN
102 Male 161 155 5 161.0
282 Female 147 94 5 147.0
139 Male 159 124 5 159.0
66 Female 172 96 4 172.0
365 Male 141 80 5 141.0
397 Male 169 136 5 169.0
18 Male 144 145 5 144.0
172 Male 167 151 5 167.0
443 Male 152 146 5 152.0
358 Female 180 58 1 NaN
447 Female 176 121 4 176.0
251 Male 140 143 5 NaN
360 Female 193 61 1 193.0
346 Female 191 68 2 191.0
5 Male 189 104 3 189.0
294 Female 168 143 5 168.0
pdf['Height-na'].fillna(pdf['Height-na'].mean(), inplace=True) pdf

Gender Height Weight Index Height-na
64 Male 175 135 5 167.8
225 Female 155 144 5 167.8
484 Female 188 115 4 188.0
293 Female 165 83 4 167.8
102 Male 161 155 5 161.0
282 Female 147 94 5 147.0
139 Male 159 124 5 159.0
66 Female 172 96 4 172.0
365 Male 141 80 5 141.0
397 Male 169 136 5 169.0
18 Male 144 145 5 144.0
172 Male 167 151 5 167.0
443 Male 152 146 5 152.0
358 Female 180 58 1 167.8
447 Female 176 121 4 176.0
251 Male 140 143 5 167.8
360 Female 193 61 1 193.0
346 Female 191 68 2 191.0
5 Male 189 104 3 189.0
294 Female 168 143 5 168.0
pdf['Height'].describe()

count20.000000 mean166.600000 std16.740748 min140.000000 25%154.250000 50%167.500000 75%177.000000 max193.000000 Name: Height, dtype: float64

pdf['Height-na'].describe()

count20.000000 mean167.800000 std14.882699 min141.000000 25%160.500000 50%167.800000 75%173.000000 max193.000000 Name: Height-na, dtype: float64

扩展研究
pdf2 = persons.sample(20) pdf2['Height-na'] = np.where(pdf2['Height'] % 5 == 0, np.nan, pdf2['Height'])# 制造缺失值from sklearn.impute import SimpleImputer imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')# ③ col_values = imp_mean.fit_transform(pdf2['Height-na'].values.reshape((-1, 1)))# ④ col_values

array([[169.], [188.], [178.33333333], [178.33333333], [166.], [193.], [178.33333333], [178.], [142.], [178.33333333], [197.], [178.33333333], [186.], [171.], [178.], [176.], [183.], [184.], [188.], [176.]])

df = pd.DataFrame({"name": ["Google", "Huawei", "Facebook", "Alibaba"], "price": [100, -1, -1, 90] }) df

name price
0 Google 100
1 Huawei -1
2 Facebook -1
3 Alibaba 90
imp = SimpleImputer(missing_values=-1, strategy='constant', fill_value=https://www.it610.com/article/110)# ⑤ imp.fit_transform(df['price'].values.reshape((-1, 1)))

array([[100], [110], [110], [ 90]])

2.3.3 根据规律填补缺失值
df = pd.DataFrame({"one":np.random.randint(1, 100, 10), "two": [2, 4, 6, 8, 10, 12, 14, 16, 18, 20], "three":[5, 9, 13, np.nan, 21, np.nan, 29, 33, 37, 41]}) df

one two three
0 38 2 5.0
1 69 4 9.0
2 86 6 13.0
3 79 8 NaN
4 73 10 21.0
5 90 12 NaN
6 77 14 29.0
7 31 16 33.0
8 66 18 37.0
9 22 20 41.0
from sklearn.linear_model import LinearRegression# ⑥df_train = df.dropna()#训练集 df_test = df[df['three'].isnull()]#测试集regr = LinearRegression() regr.fit(df_train['two'].values.reshape(-1, 1), df_train['three'].values.reshape(-1, 1))# ⑦ df_three_pred = regr.predict(df_test['two'].values.reshape(-1, 1))# ⑧# 将所得数值填补到原数据集中 df.loc[(df.three.isnull()), 'three'] = df_three_pred df

one two three
0 38 2 5.0
1 69 4 9.0
2 86 6 13.0
3 79 8 17.0
4 73 10 21.0
5 90 12 25.0
6 77 14 29.0
7 31 16 33.0
8 66 18 37.0
9 22 20 41.0
项目案例
import pandas as pdtrain_data = https://www.it610.com/article/pd.read_csv("/home/aistudio/data/data20507/train.csv") train_data.info()# ⑨

RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId891 non-null int64 Survived891 non-null int64 Pclass891 non-null int64 Name891 non-null object Sex891 non-null object Age714 non-null float64 SibSp891 non-null int64 Parch891 non-null int64 Ticket891 non-null object Fare891 non-null float64 Cabin204 non-null object Embarked889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB

train_data.isna().any()

PassengerIdFalse SurvivedFalse PclassFalse NameFalse SexFalse AgeTrue SibSpFalse ParchFalse TicketFalse FareFalse CabinTrue EmbarkedTrue dtype: bool

df = train_data[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]#可能跟年龄有关的特征 known_age = df[df['Age'].notnull()].values unknown_age = df[df['Age'].isnull()].valuesy = known_age[:, 0] X = known_age[:, 1:]from sklearn.ensemble import RandomForestRegressor# ⑩ rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)# ○11 rfr.fit(X, y)pred_age = rfr.predict(unknown_age[:, 1:])# ○13 pred_age.mean()

29.438010170664793

train_data.loc[(train_data.Age.isnull()), 'Age'] = pred_age train_data.isna().any()

PassengerIdFalse SurvivedFalse PclassFalse NameFalse SexFalse AgeFalse SibSpFalse ParchFalse TicketFalse FareFalse CabinTrue EmbarkedTrue dtype: bool

!mkdir /home/aistudio/external-libraries !pip install seaborn -t /home/aistudio/external-libraries

import sys sys.path.append('/home/aistudio/external-libraries')

%matplotlib inline import seaborn as sns sns.distplot(y)

CH3-数据准备和特征工程|2-3 处理缺失数据
文章图片

sns.distplot(train_data['Age'])

CH3-数据准备和特征工程|2-3 处理缺失数据
文章图片

df_mean = df['Age'].fillna(df['Age'].mean()) sns.distplot(df_mean)

CH3-数据准备和特征工程|2-3 处理缺失数据
文章图片

扩展研究
!pip install missingpy -t /home/aistudio/external-libraries

from sklearn.datasets import load_iris# 引入鸢尾花数据集 import numpy as npiris = load_iris() X = iris.data # 制造含有缺失值的数据集 rng = np.random.RandomState(0) X_missing = X.copy() mask = np.abs(X[:, 2] - rng.normal(loc=5.5, scale=0.7, size=X.shape[0])) < 0.6 X_missing[mask, 3] = np.nan# X_missing是包含了缺失值的数据集from missingpy import KNNImputer# 引入KNN填充缺失值的模型 imputer = KNNImputer(n_neighbors=3, weights="uniform") X_imputed = imputer.fit_transform(X_missing)

/home/aistudio/external-libraries/missingpy/utils.py:124: RuntimeWarning: invalid value encountered in sqrt return distances if squared else np.sqrt(distances, out=distances)

sns.distplot(X.reshape((-1, 1)))

CH3-数据准备和特征工程|2-3 处理缺失数据
文章图片

sns.distplot(X_imputed.reshape((-1, 1)))# 填补缺失数据后的分布

CH3-数据准备和特征工程|2-3 处理缺失数据
文章图片

    推荐阅读