當前位置:網站首頁>紐約市建築能源得分預測代碼分析

紐約市建築能源得分預測代碼分析

2022-01-27 20:14:16 QT-Smile

項目:紐約市建築能源得分預測

目錄

0簡介

1數據清洗與格式轉化

1.1數據簡介

1.2導入所需的基本工具包

1.3數據分析

1.4數據類型與缺失值

1.5缺失值處理模板

2 Exploratory Data Analysis

2.2剔除離群點

2.3觀察那些變量會對結果產生影響

3特征工程

3.1特征變換

3.2雙變量繪圖

3.3提出共線特征

4分割數據集

4.1劃分數據

4.2建立Baseline

4.3結果保存下來,建模再用

5建立基礎模型,嘗試多種算法

5.1缺失值填充

5.2特征進行與歸一化

6建立基礎模型,嘗試多種算法(回歸問題)

6.1建立損失函數

6.2選擇機器學習算法

7模型調參

7.1調參

7.2對比損失函數

8評估與測試:預測與真實之間的差异圖

9解釋模型:基於重要性進行特征選擇

正文:

0簡介

本次將介紹使用了真實數據集的機器學習項目的完整解决方案,讓同學們的了解所有碎片是如何拼接在一起的。

編碼之前是了解我們試圖解决的問題和可用的數據。在這個項目中,我們將使用公共可用的紐約市的建築能源數據。目標是使用能源數據建立一個模型,來預測建築物的 Energy star Score (能源之星分數),並解釋結果以找出影晌評分的因素。

數據包括 Energy star Score ,意味著這是一個監督回歸機餐學習任務:監督:我們可以知道數據的特征和目標,我們的目標是訓練可以學習兩者之間映射關系的模型。回歸: Energy Star Score 是一個連續變量。我們想要開發一個模型准確性,它可以實現預測Energy Star Score,並且結果接近班實值。

1數據清洗與格式轉化

1.1數據簡介

1.2導入所需的基本工具包

import pandas as pd
import numpy as np

# API需要昇級或者遺弃了,不想看就設置一下warning
pd.options.mode.chained_assignment = None

# 經常用到head(),最多展示多少條數
pd.set_option('display.max_columns', 60) 
import matplotlib.pyplot as plt

# %matplotlib inline 可以在Ipython編譯器比如jupyter notebook 或者 jupyter qtconsole裏直接使用,功能是可以內嵌繪圖,並且省略掉plt.show()。
%matplotlib inline

# pylot使用rc配置文件來自定義圖形的各種默認屬性,稱之為rc配置或rc參數。通過rc參數可以修改默認的屬性,包括窗體大小、每英寸的點數、線條寬度、顏色、樣式、坐標軸、坐標和網絡屬性、文本、字體等。
# rc參數存儲在字典變量中,通過字典的方式進行訪問
#繪圖全局的設置好了,畫圖字體大小
plt.rcParams['font.size'] = 24
from IPython.core.pylabtools import figsize

# matplotlib中的[seaborn](https://so.csdn.net/so/search?q=seaborn)繪圖
import seaborn as sns
sns.set(font_scale = 2)
from sklearn.model_selection import train_test_split

# 忽略代碼中的警告消息
import warnings
warnings.filterwarnings("ignore")

1.3數據分析

# 加載數據
data = pd.read_csv('data/Energy.csv')

# 展示前3行
data.head(3)
Order Property Id Property Name Parent Property Id Parent Property Name BBL - 10 digits NYC Borough, Block and Lot (BBL) self-reported NYC Building Identification Number (BIN) Address 1 (self-reported) Address 2 Postal Code Street Number Street Name Borough DOF Gross Floor Area Primary Property Type - Self Selected List of All Property Use Types at Property Largest Property Use Type Largest Property Use Type - Gross Floor Area (ft²) 2nd Largest Property Use Type 2nd Largest Property Use - Gross Floor Area (ft²) 3rd Largest Property Use Type 3rd Largest Property Use Type - Gross Floor Area (ft²) Year Built Number of Buildings - Self-reported Occupancy Metered Areas (Energy) Metered Areas (Water) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Weather Normalized Site Electricity Intensity (kWh/ft²) Weather Normalized Site Natural Gas Intensity (therms/ft²) Weather Normalized Source EUI (kBtu/ft²) Fuel Oil #1 Use (kBtu) Fuel Oil #2 Use (kBtu) Fuel Oil #4 Use (kBtu) Fuel Oil #5 & 6 Use (kBtu) Diesel #2 Use (kBtu) District Steam Use (kBtu) Natural Gas Use (kBtu) Weather Normalized Site Natural Gas Use (therms) Electricity Use - Grid Purchase (kBtu) Weather Normalized Site Electricity (kWh) Total GHG Emissions (Metric Tons CO2e) Direct GHG Emissions (Metric Tons CO2e) Indirect GHG Emissions (Metric Tons CO2e) Property GFA - Self-Reported (ft²) Water Use (All Water Sources) (kgal) Water Intensity (All Water Sources) (gal/ft²) Source EUI (kBtu/ft²) Release Date Water Required? DOF Benchmarking Submission Status Unnamed: 54
0 1 13286 201/205 13286 201/205 1013160001 1013160001 1037549 201/205 East 42nd st. Not Available 10017 675 3 AVENUE Manhattan 289356.0 Office Office Office 293447 Not Available Not Available Not Available Not Available 1963 2 100 Whole Building Not Available Not Available 305.6 303.1 37.8 Not Available 614.2 Not Available Not Available Not Available Not Available Not Available 51550675.1 Not Available Not Available 38139374.2 11082770.5 6962.2 0 6962.2 762051 Not Available Not Available 619.4 5/1/17 5:32 PM No In Compliance NaN
1 2 28400 NYP Columbia (West Campus) 28400 NYP Columbia (West Campus) 1021380040 1-02138-0040 1084198; 1084387;1084385; 1084386; 1084388; 10... 622 168th Street Not Available 10032 180 FT WASHINGTON AVENUE Manhattan 3693539.0 Hospital (General Medical & Surgical) Hospital (General Medical & Surgical) Hospital (General Medical & Surgical) 3889181 Not Available Not Available Not Available Not Available 1969 12 100 Whole Building Whole Building 55 229.8 228.8 24.8 2.4 401.1 Not Available 19624847.2 Not Available Not Available Not Available -391414802.6 933073441 9330734.4 332365924 96261312.1 55870.4 51016.4 4854.1 3889181 Not Available Not Available 404.3 4/27/17 11:23 AM No In Compliance NaN
2 3 4778226 MSCHoNY North 28400 NYP Columbia (West Campus) 1021380030 1-02138-0030 1063380 3975 Broadway Not Available 10032 3975 BROADWAY Manhattan 152765.0 Hospital (General Medical & Surgical) Hospital (General Medical & Surgical) Hospital (General Medical & Surgical) 231342 Not Available Not Available Not Available Not Available 1924 1 100 Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available 0 0 0 231342 Not Available Not Available Not Available 4/27/17 11:23 AM No In Compliance NaN
print((np.array(data)).shape)
(11746, 55)
# 在括號中填入n,便能看見數據的前N行
data.head(2)
Order Property Id Property Name Parent Property Id Parent Property Name BBL - 10 digits NYC Borough, Block and Lot (BBL) self-reported NYC Building Identification Number (BIN) Address 1 (self-reported) Address 2 Postal Code Street Number Street Name Borough DOF Gross Floor Area Primary Property Type - Self Selected List of All Property Use Types at Property Largest Property Use Type Largest Property Use Type - Gross Floor Area (ft²) 2nd Largest Property Use Type 2nd Largest Property Use - Gross Floor Area (ft²) 3rd Largest Property Use Type 3rd Largest Property Use Type - Gross Floor Area (ft²) Year Built Number of Buildings - Self-reported Occupancy Metered Areas (Energy) Metered Areas (Water) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Weather Normalized Site Electricity Intensity (kWh/ft²) Weather Normalized Site Natural Gas Intensity (therms/ft²) Weather Normalized Source EUI (kBtu/ft²) Fuel Oil #1 Use (kBtu) Fuel Oil #2 Use (kBtu) Fuel Oil #4 Use (kBtu) Fuel Oil #5 & 6 Use (kBtu) Diesel #2 Use (kBtu) District Steam Use (kBtu) Natural Gas Use (kBtu) Weather Normalized Site Natural Gas Use (therms) Electricity Use - Grid Purchase (kBtu) Weather Normalized Site Electricity (kWh) Total GHG Emissions (Metric Tons CO2e) Direct GHG Emissions (Metric Tons CO2e) Indirect GHG Emissions (Metric Tons CO2e) Property GFA - Self-Reported (ft²) Water Use (All Water Sources) (kgal) Water Intensity (All Water Sources) (gal/ft²) Source EUI (kBtu/ft²) Release Date Water Required? DOF Benchmarking Submission Status Unnamed: 54
0 1 13286 201/205 13286 201/205 1013160001 1013160001 1037549 201/205 East 42nd st. Not Available 10017 675 3 AVENUE Manhattan 289356.0 Office Office Office 293447 Not Available Not Available Not Available Not Available 1963 2 100 Whole Building Not Available Not Available 305.6 303.1 37.8 Not Available 614.2 Not Available Not Available Not Available Not Available Not Available 51550675.1 Not Available Not Available 38139374.2 11082770.5 6962.2 0 6962.2 762051 Not Available Not Available 619.4 5/1/17 5:32 PM No In Compliance NaN
1 2 28400 NYP Columbia (West Campus) 28400 NYP Columbia (West Campus) 1021380040 1-02138-0040 1084198; 1084387;1084385; 1084386; 1084388; 10... 622 168th Street Not Available 10032 180 FT WASHINGTON AVENUE Manhattan 3693539.0 Hospital (General Medical & Surgical) Hospital (General Medical & Surgical) Hospital (General Medical & Surgical) 3889181 Not Available Not Available Not Available Not Available 1969 12 100 Whole Building Whole Building 55 229.8 228.8 24.8 2.4 401.1 Not Available 19624847.2 Not Available Not Available Not Available -391414802.6 933073441 9330734.4 332365924 96261312.1 55870.4 51016.4 4854.1 3889181 Not Available Not Available 404.3 4/27/17 11:23 AM No In Compliance NaN

1.4數據類型與缺失值

data.info() # 可以快速讓我們知道數據類型與缺失值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11746 entries, 0 to 11745
Data columns (total 55 columns):
 #   Column                                                      Non-Null Count  Dtype  
---  ------                                                      --------------  -----  
 0   Order                                                       11746 non-null  int64  
 1   Property Id                                                 11746 non-null  int64  
 2   Property Name                                               11746 non-null  object 
 3   Parent Property Id                                          11746 non-null  object 
 4   Parent Property Name                                        11746 non-null  object 
 5   BBL - 10 digits                                             11746 non-null  object 
 6   NYC Borough, Block and Lot (BBL) self-reported              11746 non-null  object 
 7   NYC Building Identification Number (BIN)                    11746 non-null  object 
 8   Address 1 (self-reported)                                   11746 non-null  object 
 9   Address 2                                                   11746 non-null  object 
 10  Postal Code                                                 11746 non-null  object 
 11  Street Number                                               11622 non-null  object 
 12  Street Name                                                 11624 non-null  object 
 13  Borough                                                     11628 non-null  object 
 14  DOF Gross Floor Area                                        11628 non-null  float64
 15  Primary Property Type - Self Selected                       11746 non-null  object 
 16  List of All Property Use Types at Property                  11746 non-null  object 
 17  Largest Property Use Type                                   11746 non-null  object 
 18  Largest Property Use Type - Gross Floor Area (ft²)          11746 non-null  object 
 19  2nd Largest Property Use Type                               11746 non-null  object 
 20  2nd Largest Property Use - Gross Floor Area (ft²)           11746 non-null  object 
 21  3rd Largest Property Use Type                               11746 non-null  object 
 22  3rd Largest Property Use Type - Gross Floor Area (ft²)      11746 non-null  object 
 23  Year Built                                                  11746 non-null  int64  
 24  Number of Buildings - Self-reported                         11746 non-null  int64  
 25  Occupancy                                                   11746 non-null  int64  
 26  Metered Areas (Energy)                                      11746 non-null  object 
 27  Metered Areas  (Water)                                      11746 non-null  object 
 28  ENERGY STAR Score                                           11746 non-null  object 
 29  Site EUI (kBtu/ft²)                                         11746 non-null  object 
 30  Weather Normalized Site EUI (kBtu/ft²)                      11746 non-null  object 
 31  Weather Normalized Site Electricity Intensity (kWh/ft²)     11746 non-null  object 
 32  Weather Normalized Site Natural Gas Intensity (therms/ft²)  11746 non-null  object 
 33  Weather Normalized Source EUI (kBtu/ft²)                    11746 non-null  object 
 34  Fuel Oil #1 Use (kBtu)                                      11746 non-null  object 
 35  Fuel Oil #2 Use (kBtu)                                      11746 non-null  object 
 36  Fuel Oil #4 Use (kBtu)                                      11746 non-null  object 
 37  Fuel Oil #5 & 6 Use (kBtu)                                  11746 non-null  object 
 38  Diesel #2 Use (kBtu)                                        11746 non-null  object 
 39  District Steam Use (kBtu)                                   11746 non-null  object 
 40  Natural Gas Use (kBtu)                                      11746 non-null  object 
 41  Weather Normalized Site Natural Gas Use (therms)            11746 non-null  object 
 42  Electricity Use - Grid Purchase (kBtu)                      11746 non-null  object 
 43  Weather Normalized Site Electricity (kWh)                   11746 non-null  object 
 44  Total GHG Emissions (Metric Tons CO2e)                      11746 non-null  object 
 45  Direct GHG Emissions (Metric Tons CO2e)                     11746 non-null  object 
 46  Indirect GHG Emissions (Metric Tons CO2e)                   11746 non-null  object 
 47  Property GFA - Self-Reported (ft²)                          11746 non-null  int64  
 48  Water Use (All Water Sources) (kgal)                        11746 non-null  object 
 49  Water Intensity (All Water Sources) (gal/ft²)               11746 non-null  object 
 50  Source EUI (kBtu/ft²)                                       11746 non-null  object 
 51  Release Date                                                11746 non-null  object 
 52  Water Required?                                             11628 non-null  object 
 53  DOF Benchmarking Submission Status                          11716 non-null  object 
 54  Unnamed: 54                                                 0 non-null      float64
dtypes: float64(2), int64(6), object(47)
memory usage: 4.9+ MB

1.5缺失值處理模板

# 缺失值Not Available轉換為np.nan
#replace():描述Python replace() 方法把字符串中的 old(舊字符串) 替換成 new(新字符串),
data = data.replace({
    'Not Available': np.nan})

#在原始數據中‘ft²’結尾的列中的屬性顯示的是有的是數值型float類型,但是在python環境中info()函數展示有其他類型的數據都是Object類型
#kBtu/ft²等本應該是float類型,在這裏是object類型,所以要轉換一下 ,以ft²、kBtu、Metric Tons CO2e等為結尾的astype一下float 

# 把下面的data列中的數據全部轉換成float型的
for col in list(data.columns):
    # 如果ft^2平方英尺結尾的,本來是object强制轉換為float
    if ('ft²' in col or 'kBtu' in col or 'Metric Tons CO2e' in col or 'kWh' in 
        col or 'therms' in col or 'gal' in col or 'Score' in col):
        
        data[col] = data[col].astype(float)

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-Eqc6HhrF-1642160554515)(attachment:image.png)]

print(list(data.columns))
['Order', 'Property Id', 'Property Name', 'Parent Property Id', 'Parent Property Name', 'BBL - 10 digits', 'NYC Borough, Block and Lot (BBL) self-reported', 'NYC Building Identification Number (BIN)', 'Address 1 (self-reported)', 'Address 2', 'Postal Code', 'Street Number', 'Street Name', 'Borough', 'DOF Gross Floor Area', 'Primary Property Type - Self Selected', 'List of All Property Use Types at Property', 'Largest Property Use Type', 'Largest Property Use Type - Gross Floor Area (ft²)', '2nd Largest Property Use Type', '2nd Largest Property Use - Gross Floor Area (ft²)', '3rd Largest Property Use Type', '3rd Largest Property Use Type - Gross Floor Area (ft²)', 'Year Built', 'Number of Buildings - Self-reported', 'Occupancy', 'Metered Areas (Energy)', 'Metered Areas  (Water)', 'ENERGY STAR Score', 'Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)', 'Weather Normalized Site Electricity Intensity (kWh/ft²)', 'Weather Normalized Site Natural Gas Intensity (therms/ft²)', 'Weather Normalized Source EUI (kBtu/ft²)', 'Fuel Oil #1 Use (kBtu)', 'Fuel Oil #2 Use (kBtu)', 'Fuel Oil #4 Use (kBtu)', 'Fuel Oil #5 & 6 Use (kBtu)', 'Diesel #2 Use (kBtu)', 'District Steam Use (kBtu)', 'Natural Gas Use (kBtu)', 'Weather Normalized Site Natural Gas Use (therms)', 'Electricity Use - Grid Purchase (kBtu)', 'Weather Normalized Site Electricity (kWh)', 'Total GHG Emissions (Metric Tons CO2e)', 'Direct GHG Emissions (Metric Tons CO2e)', 'Indirect GHG Emissions (Metric Tons CO2e)', 'Property GFA - Self-Reported (ft²)', 'Water Use (All Water Sources) (kgal)', 'Water Intensity (All Water Sources) (gal/ft²)', 'Source EUI (kBtu/ft²)', 'Release Date', 'Water Required?', 'DOF Benchmarking Submission Status', 'Unnamed: 54']
print(data.columns)
Index(['Order', 'Property Id', 'Property Name', 'Parent Property Id',
       'Parent Property Name', 'BBL - 10 digits',
       'NYC Borough, Block and Lot (BBL) self-reported',
       'NYC Building Identification Number (BIN)', 'Address 1 (self-reported)',
       'Address 2', 'Postal Code', 'Street Number', 'Street Name', 'Borough',
       'DOF Gross Floor Area', 'Primary Property Type - Self Selected',
       'List of All Property Use Types at Property',
       'Largest Property Use Type',
       'Largest Property Use Type - Gross Floor Area (ft²)',
       '2nd Largest Property Use Type',
       '2nd Largest Property Use - Gross Floor Area (ft²)',
       '3rd Largest Property Use Type',
       '3rd Largest Property Use Type - Gross Floor Area (ft²)', 'Year Built',
       'Number of Buildings - Self-reported', 'Occupancy',
       'Metered Areas (Energy)', 'Metered Areas  (Water)', 'ENERGY STAR Score',
       'Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)',
       'Weather Normalized Site Electricity Intensity (kWh/ft²)',
       'Weather Normalized Site Natural Gas Intensity (therms/ft²)',
       'Weather Normalized Source EUI (kBtu/ft²)', 'Fuel Oil #1 Use (kBtu)',
       'Fuel Oil #2 Use (kBtu)', 'Fuel Oil #4 Use (kBtu)',
       'Fuel Oil #5 & 6 Use (kBtu)', 'Diesel #2 Use (kBtu)',
       'District Steam Use (kBtu)', 'Natural Gas Use (kBtu)',
       'Weather Normalized Site Natural Gas Use (therms)',
       'Electricity Use - Grid Purchase (kBtu)',
       'Weather Normalized Site Electricity (kWh)',
       'Total GHG Emissions (Metric Tons CO2e)',
       'Direct GHG Emissions (Metric Tons CO2e)',
       'Indirect GHG Emissions (Metric Tons CO2e)',
       'Property GFA - Self-Reported (ft²)',
       'Water Use (All Water Sources) (kgal)',
       'Water Intensity (All Water Sources) (gal/ft²)',
       'Source EUI (kBtu/ft²)', 'Release Date', 'Water Required?',
       'DOF Benchmarking Submission Status', 'Unnamed: 54'],
      dtype='object')
# 每列中只能展示數值型的count、mean、sdt等等,object不會展示
data.describe()

# 3.20e+05=3.20x10^5=3.20x100000=320000
# 在科學計數法中,為了使公式簡便,可以用帶“E”的格式錶示。當用該格式錶示時,E前面的數字和“E+”後面要精確到十分比特,(比特數不够末尾補0),例如7.8乘10的7次方,正常寫法為:7.8x10^7,簡寫為“7.8E+07”的形式
Order Property Id DOF Gross Floor Area Largest Property Use Type - Gross Floor Area (ft²) 2nd Largest Property Use - Gross Floor Area (ft²) 3rd Largest Property Use Type - Gross Floor Area (ft²) Year Built Number of Buildings - Self-reported Occupancy ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Weather Normalized Site Electricity Intensity (kWh/ft²) Weather Normalized Site Natural Gas Intensity (therms/ft²) Weather Normalized Source EUI (kBtu/ft²) Fuel Oil #1 Use (kBtu) Fuel Oil #2 Use (kBtu) Fuel Oil #4 Use (kBtu) Fuel Oil #5 & 6 Use (kBtu) Diesel #2 Use (kBtu) District Steam Use (kBtu) Natural Gas Use (kBtu) Weather Normalized Site Natural Gas Use (therms) Electricity Use - Grid Purchase (kBtu) Weather Normalized Site Electricity (kWh) Total GHG Emissions (Metric Tons CO2e) Direct GHG Emissions (Metric Tons CO2e) Indirect GHG Emissions (Metric Tons CO2e) Property GFA - Self-Reported (ft²) Water Use (All Water Sources) (kgal) Water Intensity (All Water Sources) (gal/ft²) Source EUI (kBtu/ft²) Unnamed: 54
count 11746.000000 1.174600e+04 1.162800e+04 1.174400e+04 3741.000000 1484.000000 11746.000000 11746.000000 11746.000000 9642.000000 11583.000000 10281.000000 10959.000000 9783.000000 10281.000000 9.000000e+00 2.581000e+03 1.321000e+03 5.940000e+02 1.600000e+01 9.360000e+02 1.030400e+04 9.784000e+03 1.150200e+04 1.096000e+04 1.167200e+04 1.166300e+04 1.168100e+04 1.174600e+04 7.762000e+03 7762.000000 11583.000000 0.0
mean 7185.759578 3.642958e+06 1.732695e+05 1.605524e+05 22778.682010 12016.825270 1948.738379 1.289971 98.762557 59.854594 280.071484 309.747466 11.072643 1.901441 417.915709 3.395398e+06 3.186882e+06 5.294367e+06 2.429105e+06 1.193594e+06 2.868907e+08 5.048543e+07 5.364578e+05 5.965472e+06 1.768752e+06 4.553657e+03 2.477937e+03 2.076339e+03 1.673739e+05 1.591798e+04 136.172432 385.908029 NaN
std 4323.859984 1.049070e+06 3.367055e+05 3.095746e+05 55094.441422 27959.755486 30.576386 4.017484 7.501603 29.993586 8607.178877 9784.731207 127.733868 97.204587 10530.524339 2.213237e+06 5.497154e+06 5.881863e+06 4.442946e+06 3.558178e+06 3.124603e+09 3.914717e+09 4.022606e+07 3.154430e+07 9.389154e+06 2.041639e+05 1.954498e+05 5.931295e+04 3.189238e+05 1.529524e+05 1730.726938 9312.736225 NaN
min 1.000000 7.365000e+03 5.002800e+04 5.400000e+01 0.000000 0.000000 1600.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.085973e+05 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -4.690797e+08 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -2.313430e+04 0.000000e+00 0.000000e+00 0.000000 0.000000 NaN
25% 3428.250000 2.747222e+06 6.524000e+04 6.520100e+04 4000.000000 1720.750000 1927.000000 1.000000 100.000000 37.000000 61.800000 65.100000 3.800000 0.100000 103.500000 1.663594e+06 2.550378e+05 2.128213e+06 0.000000e+00 5.698020e+04 4.320254e+06 1.098251e+06 1.176952e+04 1.043673e+06 3.019974e+05 3.287000e+02 1.474500e+02 9.480000e+01 6.699400e+04 2.595400e+03 27.150000 99.400000 NaN
50% 6986.500000 3.236404e+06 9.313850e+04 9.132400e+04 8654.000000 5000.000000 1941.000000 1.000000 100.000000 65.000000 78.500000 82.500000 5.300000 0.500000 129.400000 4.328815e+06 1.380138e+06 4.312984e+06 0.000000e+00 2.070020e+05 9.931240e+06 4.103962e+06 4.445525e+04 1.855196e+06 5.416312e+05 5.002500e+02 2.726000e+02 1.718000e+02 9.408000e+04 4.692500e+03 45.095000 124.900000 NaN
75% 11054.500000 4.409092e+06 1.596140e+05 1.532550e+05 20000.000000 12000.000000 1966.000000 1.000000 100.000000 85.000000 97.600000 102.500000 9.200000 0.700000 167.200000 4.938947e+06 4.445808e+06 6.514520e+06 4.293825e+06 2.918332e+05 2.064497e+07 6.855070e+06 7.348107e+04 4.370302e+06 1.284677e+06 9.084250e+02 4.475000e+02 4.249000e+02 1.584140e+05 8.031875e+03 70.805000 162.750000 NaN
max 14993.000000 5.991312e+06 1.354011e+07 1.421712e+07 962428.000000 591640.000000 2019.000000 161.000000 100.000000 100.000000 869265.000000 939329.000000 6259.400000 9393.000000 986366.000000 6.275850e+06 1.046849e+08 7.907464e+07 4.410378e+07 1.435178e+07 7.163518e+10 3.942850e+11 3.942852e+09 1.691763e+09 4.958273e+08 2.094340e+07 2.094340e+07 4.764375e+06 1.421712e+07 6.594604e+06 96305.690000 912801.100000 NaN
# 缺失值的模板,通用的
# 定義一個函數,傳進來一個DataFrame
def missing_values_table(df): 
        # python的pandas庫中有一個十分便利的isnull()函數,它可以用來判斷缺失值,把每列的缺失值算一下總和
        mis_val = df.isnull().sum() 
        
        # 100相當於%,每列的缺失值的占比
        mis_val_percent = 100 * df.isnull().sum() / len(df) 
        
        # 每列缺失值的個數 、 每列缺失值的占比做成錶
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # 重命名指定列的名稱
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {
    0 : 'Missing Values', 1 : '% of Total Values'})
        
        # 因為第1列缺失值很大,ascending=False代錶降序
        #iloc[:,1] != 0的意思是對於下面的錶中的第2列(缺失的占比)進行降序,從大到小
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # 打印所有列的個數 、 缺失了多少列
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        
        return mis_val_table_ren_columns
missing_values_table(data) #第一列是每1列,第二列是缺失值個數,第三列是缺失值%比,一共是60列,有46列是有缺失值
Your selected dataframe has 55 columns.
There are 40 columns that have missing values.
Missing Values % of Total Values
Unnamed: 54 11746 100.0
Fuel Oil #1 Use (kBtu) 11737 99.9
Diesel #2 Use (kBtu) 11730 99.9
Address 2 11539 98.2
Fuel Oil #5 & 6 Use (kBtu) 11152 94.9
District Steam Use (kBtu) 10810 92.0
Fuel Oil #4 Use (kBtu) 10425 88.8
3rd Largest Property Use Type 10262 87.4
3rd Largest Property Use Type - Gross Floor Area (ft²) 10262 87.4
Fuel Oil #2 Use (kBtu) 9165 78.0
2nd Largest Property Use - Gross Floor Area (ft²) 8005 68.2
2nd Largest Property Use Type 8005 68.2
Metered Areas (Water) 4609 39.2
Water Intensity (All Water Sources) (gal/ft²) 3984 33.9
Water Use (All Water Sources) (kgal) 3984 33.9
ENERGY STAR Score 2104 17.9
Weather Normalized Site Natural Gas Intensity (therms/ft²) 1963 16.7
Weather Normalized Site Natural Gas Use (therms) 1962 16.7
Weather Normalized Source EUI (kBtu/ft²) 1465 12.5
Weather Normalized Site EUI (kBtu/ft²) 1465 12.5
Natural Gas Use (kBtu) 1442 12.3
Weather Normalized Site Electricity Intensity (kWh/ft²) 787 6.7
Weather Normalized Site Electricity (kWh) 786 6.7
Electricity Use - Grid Purchase (kBtu) 244 2.1
Site EUI (kBtu/ft²) 163 1.4
Source EUI (kBtu/ft²) 163 1.4
NYC Building Identification Number (BIN) 162 1.4
Street Number 124 1.1
Street Name 122 1.0
DOF Gross Floor Area 118 1.0
Borough 118 1.0
Water Required? 118 1.0
Direct GHG Emissions (Metric Tons CO2e) 83 0.7
Total GHG Emissions (Metric Tons CO2e) 74 0.6
Indirect GHG Emissions (Metric Tons CO2e) 65 0.6
Metered Areas (Energy) 57 0.5
DOF Benchmarking Submission Status 30 0.3
NYC Borough, Block and Lot (BBL) self-reported 11 0.1
Largest Property Use Type - Gross Floor Area (ft²) 2 0.0
Largest Property Use Type 2 0.0
# 50%是閾值,大於50%的列
missing_df = missing_values_table(data);
# 大於50%的列拿出來 ,後面drop()删掉
missing_columns = list(missing_df[missing_df['% of Total Values'] > 50].index)
print('We will remove %d columns.' % len(missing_columns))

#原始的列中有60列,發現有缺失值的列有46列 , 缺失的46列中大於50%的將删除,有11列
Your selected dataframe has 55 columns.
There are 40 columns that have missing values.
We will remove 12 columns.
# 大於50%的列都drop掉
data = data.drop(columns = list(missing_columns))

2 Exploratory Data Analysis

2.1單變量繪圖

# 設置圖形的寬和高
figsize(8, 8)

# Y,就是從1~100的能源得分值,重命名為score
data = data.rename(columns = {
    'ENERGY STAR Score': 'score'})

# 在seaboard中找到不同的風格,不同的參數,代錶不同背景格式
plt.style.use('fivethirtyeight')

#dropna():該函數主要用於濾除缺失數據 
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); 

plt.xlabel('Score'); plt.ylabel('Number of Buildings'); 

plt.title('Energy Star Score Distribution');

#在展示的圖中,1和100的得分比較高,原始數據都是物業自己填的報錶打得分,根據實際情况,給房屋的能源利用率打的分值,人為填的,
#所以1和100,得分很高,有水分,但是,我們的目標只是預測分數,而不是設計更好的建築物評分方法! 我們可以在我們的報告中記下分數具有可疑分布,但我們主要關注預測分數。

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-GO86qLvG-1642160554517)(output_26_0.png)]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-9bpUjb8Z-1642160554517)(attachment:image.png)]

plt.style.ava
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-186-aa5a23d3013a> in <module>
----> 1 plt.style.ava


AttributeError: module 'matplotlib.style' has no attribute 'ava'

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-Ypjv0rRI-1642160554518)(attachment:image.png)]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-ESukvH9A-1642160554519)(attachment:image.png)]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-fmtPMkoL-1642160554519)(attachment:image.png)]

help(plt.hist)
# 設置圖形的寬和高
figsize(8, 8)

# Y,就是從1~100的能源得分值,重命名為score
data = data.rename(columns = {
    'ENERGY STAR Score': 'score'})

# 在seaboard中找到不同的風格
plt.style.use('dark_background')

# hist錶示的畫直方圖
#dropna():該函數主要用於濾除缺失數據,删除data列錶中,score列中的有缺失值的行。處理後的數據作為畫圖的數據
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); 

plt.xlabel('Score'); plt.ylabel('Number of Buildings'); 

plt.title('Energy Star Score Distribution');

#在展示的圖中,1和100的得分比較高,原始數據都是物業自己填的報錶打得分,根據實際情况,給房屋的能源利用率打的分值,人為填的,
#所以1和100,得分很高,有水分,但是,我們的目標只是預測分數,而不是設計更好的建築物評分方法! 我們可以在我們的報告中記下分數具有可疑分布,但我們主要關注預測分數。
plt.style.available
print(data.columns)
# 設置圖形的寬和高
figsize(10, 10)

# Y,就是從1~100的能源得分值,重命名為score
data = data.rename(columns = {
    'ENERGY STAR Score': 'score'})

# 在seaboard中找到不同的風格
plt.style.use('fivethirtyeight')

#dropna():該函數主要用於濾除缺失數據 
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); 

plt.xlabel('Score'); plt.ylabel('Number of Buildings'); 

plt.title('Energy Star Score Distribution');

#在展示的圖中,1和100的得分比較高,原始數據都是物業自己填的報錶打得分,根據實際情况,給房屋的能源利用率打的分值,人為填的,
#所以1和100,得分很高,有水分,但是,我們的目標只是預測分數,而不是設計更好的建築物評分方法! 我們可以在我們的報告中記下分數具有可疑分布,但我們主要關注預測分數。

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-peqW6VVa-1642160554521)(attachment:image.png)]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-uEY5diWG-1642160554521)(attachment:image.png)]

# Site EUI (kBtu/ft²:能源使用强度
figsize(8, 8)
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'black'); # 邊也是黑色
plt.xlabel('Site EUI'); 
plt.ylabel('Count'); plt.title('Site EUI Distribution');

#這顯示我們有另一個問題:!由於存在幾個非常高分的建築物,這張圖難以置信地傾斜了。所以必須進行异常值處理。
#你會很清楚地看到最後一個值异常大。出現异常值的原因很多:錯字,測量設備故障,錯誤的單比特,或者它們可能是合法的但是個極端值
#相當於分一下數據有很多點離均值很遠,就有離群點
# Site EUI (kBtu/ft²:能源使用强度
figsize(8, 8)
# edgecolor:直方圖中柱形邊緣的顏色
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'red'); # 邊也是黑色
plt.xlabel('Site EUI'); 
plt.ylabel('Count'); plt.title('Site EUI Distribution');

#這顯示我們有另一個問題:!由於存在幾個非常高分的建築物,這張圖難以置信地傾斜了。所以必須進行异常值處理。
#你會很清楚地看到最後一個值异常大。出現异常值的原因很多:錯字,測量設備故障,錯誤的單比特,或者它們可能是合法的但是個極端值
#相當於分一下數據有很多點離均值很遠,就有離群點
data['Site EUI (kBtu/ft²)'].describe() 
# 均值mean小 , 標准差很大,就意味著有很多點離均值很遠,就有離群點 ,因為最小值為0,最大值為869265
#dropna()該函數主要用於濾除缺失數據
# sort_values()先分組 ,再看後10比特
#能源使用强度(EUI)
#sort_values():默認是昇序 ,從小到大排序,按值排序,左邊是行號,右邊是數據
data['Site EUI (kBtu/ft²)'].dropna().sort_values().tail(10)
# 怎麼過濾離群點呢,查看第869265行
data.loc[data['Site EUI (kBtu/ft²)'] == 869265, :]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-AQnReIgf-1642160554522)(attachment:image.png)]

# 應該是版本更新的問題,沒有ix,和iloc這兩種函數了
# 怎麼過濾離群點呢,查看第869265行
data.ix[data['Site EUI (kBtu/ft²)'] == 869265, :]
# 怎麼過濾離群點呢,查看第869265行
data.iloc[data['Site EUI (kBtu/ft²)'] == 869265, :]

2.2剔除離群點

# 在describe取25%和75%分比特
first_quartile = data['Site EUI (kBtu/ft²)'].describe()['25%'] 
third_quartile = data['Site EUI (kBtu/ft²)'].describe()['75%']

# 2者一减就是IQ值,就是間隔 
iqr = third_quartile - first_quartile


#在這裏判斷的是正常數據,Q3 - 3IQ < EUI < Q3+ 3IQ ,保留正常數據,剩下的過濾异常點
# Q3+ 3IQ > 。。。。。。>Q3 - 3IQ ,中間的就是非離群點,就是咱們想要的數據
data = data[(data['Site EUI (kBtu/ft²)'] > (first_quartile - 3 * iqr)) &
            (data['Site EUI (kBtu/ft²)'] < (third_quartile + 3 * iqr))]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-gUPEPiG6-1642160554523)(attachment:image.png)]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-j1F9x97h-1642160554524)(attachment:image.png)]

# #能源使用强度(EUI),剔除離群點後應該有的正太分布
figsize(8, 8)
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'black');
plt.xlabel('Site EUI'); 
plt.ylabel('Count'); plt.title('Site EUI Distribution');

2.3觀察那些變量會對結果產生影響

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-GPpHhLOJ-1642160554524)(attachment:image.png)]

types = data.dropna(subset=['score'])

#Largest Property Use Type:最大財產使用類型
#該列中有很多的個屬性,大於100的值分別有4個屬性 , 為:Multifamily Housing——多戶住宅區 、 Office——辦公室 、 Hotel——酒店 
#Data Center, Non-Refrigerated Warehouse, Office——數據中心、非冷藏倉庫、辦公室

types = types['Largest Property Use Type'].value_counts()
types = list(types[types.values > 100])
print(types)
types = data.dropna(subset=['score'])

#Largest Property Use Type:最大財產使用類型
#該列中有很多的個屬性,大於100的值分別有4個屬性 , 為:Multifamily Housing——多戶住宅區 、 Office——辦公室 、 Hotel——酒店 
#Data Center, Non-Refrigerated Warehouse, Office——數據中心、非冷藏倉庫、辦公室

types = types['Largest Property Use Type'].value_counts()
types = list(types[types.values > 100].index)
print(types)

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-B894snof-1642160554525)(attachment:image.png)]

# 找出差异大的2個選取特征
#Largest Property Use Type:最大財產使用類型
figsize(12, 10)

# b_type是變量,types是4種類型 
for b_type in types:
    #當前Largest Property Use Type就是畫的類型b_type4個 變量
    subset = data[data['Largest Property Use Type'] == b_type] 
    
    # 拿到subset的得分值,alpha指的是透明度
    sns.kdeplot(subset['score'].dropna(),
               label = b_type, shade = False, alpha = 0.5);
    
# 橫軸是能源得分 ,縱軸是密度
plt.xlabel('Energy Star Score', size = 20); plt.ylabel('Density', size = 20); 
plt.title('Density Plot of Energy Star Scores by Building Type', size = 28);

#紅色和黃色差距很大
# 找出差异大的2個選取特征
#Largest Property Use Type:最大財產使用類型
figsize(12, 10)

# b_type是變量,types是4種類型 ,其實types列錶中包含了7個元素,也就是有7種類型,這兒只是現實了前4種
for b_type in types:
    #當前Largest Property Use Type就是畫的類型b_type4個 變量
    subset = data[data['Largest Property Use Type'] == b_type] 
    print(subset)
# 找出差异大的2個選取特征
#Largest Property Use Type:最大財產使用類型
figsize(12, 10)

# b_type是變量,types是4種類型 ,其實types列錶中包含了7個元素,也就是有7種類型,這兒只是現實了前4種
for b_type in types:
    #當前Largest Property Use Type就是畫的類型b_type4個 變量
    subset = data[data['Largest Property Use Type'] == b_type] 
    print(subset['score'].dropna())

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-C9YXcBux-1642160554526)(attachment:image.png)]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-4TyHPyFb-1642160554527)(attachment:image.png)]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-2eU3SHfW-1642160554527)(attachment:image.png)]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-G4RVKkWP-1642160554528)(attachment:image.png)]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-B9ybHRoH-1642160554528)(attachment:image.png)]

# 查看當前的結果跟地區有什麼結果 結果
boroughs = data.dropna(subset=['score'])
# 地區
boroughs = boroughs['Borough'].value_counts()
boroughs = list(boroughs[boroughs.values > 100].index)
print(boroughs)

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-pCTrfw8A-1642160554529)(attachment:image.png)]

# 4個從差异程度來說,影響不大,特征的差异性不强
#Borough:自治區鎮 ,該列中有5個屬性,分別為:Manhattan——曼哈頓 、 Brooklyn——布魯克林 、 Queens——皇後區 、 Bronx——布朗克斯
# Staten Island——斯塔頓島


figsize(12, 10)
 
# 遍曆5個屬性遍曆,畫出圖,橫軸是能源得分、縱軸是密度
for borough in boroughs:
    
    subset = data[data['Borough'] == borough]
    
    
    sns.kdeplot(subset['score'].dropna(),
               label = borough);
    

plt.xlabel('Energy Star Score', size = 20); plt.ylabel('Density', size = 20); 
plt.title('Density Plot of Energy Star Scores by Borough', size = 28);

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-6G9nDpPd-1642160554530)(attachment:image.png)]

# corr()相關系數矩陣,即給出任意X與Y之間的相關系數 X——>Y兩兩相關的,負相關多,-0.046605接近於0的都删掉 , 正相關的少
correlations_data = data.corr()['score'].sort_values()#昇序,從小到大

# 後10個
print(correlations_data.head(10), '\n')
print("---------------------------")
# 前10個
print(correlations_data.tail(10))

其中corr()函數的參數為空時,默認使用的參數為pearson

3特征工程

3.1特征變換

import warnings
warnings.filterwarnings("ignore")

# 所有的數值數據拿到手,只需要數值列的數據,數據為字符串或者其他類型的數據列,不要
numeric_subset = data.select_dtypes('number')


# 遍曆所有數值數據的每一列數據
# 遍曆所有的數值數據
for col in numeric_subset.columns:
    # 這個項目把score看成了標簽,也就是線性函數這種的y,其他特征值全部都是x,而每一個x的系數就是這個特征與score的相關系數
    # 如果score就是y值 ,就不做任何變換
    if col == 'score':
        next
    #剩下的不是y的話特征做log和開根號
    else: 
        # 直接對整個列的數據進行開方和log計算
        numeric_subset['sqrt_' + col] = np.sqrt(numeric_subset[col])
        numeric_subset['log_' + col] = np.log(numeric_subset[col])

# Borough:自治鎮
# Largest Property Use Type:
categorical_subset = data[['Borough', 'Largest Property Use Type']]
print(categorical_subset)

# One hot encode用到了讀熱編碼get_dummies 
categorical_subset = pd.get_dummies(categorical_subset)
print(categorical_subset)


# 合並數組 一個是數值的, 一個熱度編碼的
print(numeric_subset)
features = pd.concat([numeric_subset, categorical_subset], axis = 1)

features = features.dropna(subset = ['score'])

# sort_values()做一下排序
correlations = features.corr()['score'].dropna().sort_values()
print(correlations)

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-wzb3UvvG-1642160554531)(attachment:image.png)]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-k7cVG3Mj-1642160554531)(attachment:image.png)]

特征和特征之間的相關性,特征和score之間的相關性。
相關性:線性相關性和非線性相關性。

#sqrt結尾的變幻後就是sqrt_,log結尾的變幻後就是log_
# 這些都是負的
correlations.head(15)

#Weather Normalized Site EUI (kBtu/ft²)和轉換後sqrt_Weather Normalized Site EUI (kBtu/ft²)沒啥變化,所以沒有價值
#都差不多,沒有明顯的趨勢,
# 後15比特下面是正的 
correlations.tail(15)

一般head能做的操作,tail也能够做

3.2 雙變量繪圖

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-ASa6g0ty-1642160554532)(attachment:image.png)]

import warnings
warnings.filterwarnings("ignore")
figsize(12, 10)

# 能源得分與城鎮區域之間的關系
features['Largest Property Use Type'] = data.dropna(subset = ['score'])['Largest Property Use Type']


# Largest Property Use Type 最大財產使用類型 ,isin()接受一個列錶,判斷該列中4個屬性是否在列錶中
features = features[features['Largest Property Use Type'].isin(types)]


# hue = 'Largest Property Use Type'是4個種類變量 ,4個顏色
sns.lmplot('Site EUI (kBtu/ft²)', 'score', 
           # 種類變量,有4個種類,右下角hue是有4個種類變量,
          hue = 'Largest Property Use Type', data = features,
          scatter_kws = {
    'alpha': 0.8, 's': 60}, fit_reg = False,
          size = 12, aspect = 1.2);

# Plot labeling
plt.xlabel("Site EUI", size = 28)
plt.ylabel('Energy Star Score', size = 28)
plt.title('Energy Star Score vs Site EUI', size = 36);

3.3 剔除共線特征

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-EtmuRsXb-1642160554532)(attachment:image.png)]

#原始數據備份一下copy(),修改後數據後保持原數據不變
features = data.copy()

# select_dtypes():根據數據類型選擇特征,number錶示數值型特征
numeric_subset = data.select_dtypes('number')

# 遍曆特征是數值型在一個列錶中
for col in numeric_subset.columns:
    # 跳過能源得分就是咱們的目標值Y
    if col == 'score':
        next
    else:
        #numeric_subset()從某一個列中選擇出符合某條件的數據或是相關的列
        numeric_subset['log_' + col] = np.log(numeric_subset[col])
        
# Borough:自治區鎮
# 最大財產使用類型/多戶家庭的a住宅區、辦公區、酒店、不制冷的大倉庫
categorical_subset = data[['Borough', 'Largest Property Use Type']]


# get_dummies 是利用pandas實現one hot encode的方式。
categorical_subset = pd.get_dummies(categorical_subset)

#把所有數值型特征和治區鎮以及最大財產的使用類型合並起來
features = pd.concat([numeric_subset, categorical_subset], axis = 1)

features.shape#有110個列,比原來的列多

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-k32D2KSL-1642160554533)(attachment:image.png)]

#Weather Normalized Site EUI (kBtu/ft²):天氣正常指數的使用强度
#Site EUI:能源使用强度


plot_data = data[['Weather Normalized Site EUI (kBtu/ft²)', 'Site EUI (kBtu/ft²)']].dropna()
#'bo':由點繪制的線
plt.plot(plot_data['Site EUI (kBtu/ft²)'], plot_data['Weather Normalized Site EUI (kBtu/ft²)'], 'bo')
#橫軸是天氣正常指數的使用强度 、 縱軸是能源使用强度
plt.xlabel('Site EUI'); plt.ylabel('Weather Norm EUI')
plt.title('Weather Norm EUI vs Site EUI, R = %0.4f' % np.corrcoef(data[['Weather Normalized Site EUI (kBtu/ft²)', 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);
# collinear 共線,這個函數的作用是删除一些兩個特征值,之間的相關性特別高的,其中的一個特征。
# threshold:設置的閾值,這個值,是通過多次嘗試求取出來的。
def remove_collinear_features(x, threshold):
    y = x['score'] #在原始數據X中”score“當做y值
    x = x.drop(columns = ['score']) #除去標簽值以外的當做特征
    # 多長運行,直到相關性小於閾值才穩定結束
    while True:
        # 計算一個矩陣 ,兩兩的相關系數
        corr_matrix = x.corr()

        
        for i in range(len(corr_matrix)):
            corr_matrix.iloc[i][i] = 0 # 將對角線上的相關系數置為0。避免自己跟自己計算相關系數一定大於閾值,自己與自己的相關系數的是1

        # 定義待删除的特征。
        drop_cols = []
        # col返回的是列名

        
        for col in corr_matrix:
            if col not in drop_cols: # A和B比 ,B和A比的相關系數一樣,避免AB全删了
                # 取相關系數的絕對值。
                v = np.abs(corr_matrix[col]) # 取的是每一列的相關系數
                # 如果相關系數大於設置的閾值 
                # 取出每一列中相關系數絕對值最大的那個數
                if np.max(v) > threshold:
                    # 取出最大值對應的索引。
                    name = np.argmax(v) # 找到最大值的的列名
                    # 將含有最大值的那一列放到drop_cols列錶中
                    drop_cols.append(name)
         # 列錶不為空,就删除,列錶為空,符合條件,退出循環 
        # drop_cols 列錶中存儲的是,兩個特征的相關系數的絕對值大於設置的閾值的其中一個特征,為了减小模型的複雜度,和提高模型的效果,就需要删除其中一個特征
        if drop_cols:
            # 删除想删除的列
            x = x.drop(columns=drop_cols, axis=1)
        else:
            break

    # 指定標簽
    # y中存儲的是原始數據X中”score“
    x['score'] = y
               
    return x
help(remove_collinear_features)
# 下面這段代碼運行有問題,我修改不出來,所以就注釋了,不讓它運行
# # 設置閾值0.6 ,tem.values相關性的矩陣的向量大於0.6的
# features = remove_collinear_features(features, 0.6);
# 上面這段代碼運行有問題,我修改不出來,所以就注釋了,不讓它運行
# 删除
features  = features.dropna(axis=1, how = 'all')
features.shape #原來時110
features.shape

4 分割數據集

4.1 劃分數據

# pandas:isna(): 如果參數的結果為#NaN, 則結果TRUE, 否則結果是FALSE。
no_score = features[features['score'].isna()]
# pandas:notnull()判斷是否不是NaN
score = features[features['score'].notnull()]

print(no_score.shape)
print(score.shape)
# 把所有特征放在features列錶中
# 把標簽,也就是targets(建築物的得分)放在targets列錶中
features = score.drop(columns='score')
targets = pd.DataFrame(score['score'])

#np.inf :最大值 -np.inf:最小值 
features = features.replace({
    np.inf: np.nan, -np.inf: np.nan})

# random_state = 42設置成一個固定值,是為了讓,每一次生成的訓練集和測試集都是一樣的,如果不設置,
# 那麼每次生成的測試集和訓練集是不一樣的,那麼這樣就無法調參了random_state可以被設置成任何值,
# 但是當你使用相同的數據集,進行測試集和訓練集分割時,如果想要與之前的訓練集和測試集生成一樣的,
# 那麼你就得把random_state設置成一樣的值,因為分割數據集其實是在生成的一些系列隨機數,通過這些
# 隨機數去取數據。由於隨機數的生成也是通過程序控制的,那麼當你設置相同的random_state值,就會
# 得到相同的隨機數,那麼就會得到相同的測試集和訓練集
X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)

print(X.shape)
print(X_test.shape)
print(y.shape)
print(y_test.shape)

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-omV3TRls-1642160554534)(attachment:image.png)]

4.2 建立Baseline

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-PAh1oBf6-1642160554534)(attachment:image.png)]

# mae平均的絕對值 ,就是 (真實值 - 預測值) / n
#abs():絕對值 
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))
baseline_guess = np.median(y)

print('The baseline guess is a score of %0.2f' % baseline_guess) # 中比特數為66 
print("Baseline Performance on the test set: MAE = %0.4f" % mae(y_test, baseline_guess)) # MAE = 24.5164

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-SQLqtAbD-1642160554534)(attachment:image.png)]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-bwU9b3eh-1642160554535)(attachment:image.png)]

4.3 結果保存下來,建模再用

# Save the no scores, training, and testing data
# to_csv:把to_csv列錶中的元素以csv的格式寫進data/no_score.csv文件中g
no_score.to_csv('data/no_score.csv', index = False)
X.to_csv('data/training_features.csv', index = False)
X_test.to_csv('data/testing_features.csv', index = False)
y.to_csv('data/training_labels.csv', index = False)
y_test.to_csv('data/testing_labels.csv', index = False)

5 建立基礎模型,嘗試多種算法

#之前把精力都放在了前面了,這回我的重點就要放在建模上了,導入所需要的包
# 數據分析庫
import pandas as pd
import numpy as np

# warnings:警告——>忽視
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', 60)

# 可視化
import matplotlib.pyplot as plt
%matplotlib inline

# 字體大小設置
plt.rcParams['font.size'] = 24

from IPython.core.pylabtools import figsize

# Seaborn 高級可視化工具
import seaborn as sns
sns.set(font_scale = 2)

# 預處理:缺失值 、 最大最小歸一化

# 下面代碼是自己修改的
# 這是原代碼
# from sklearn.preprocessing import Imputer, MinMaxScaler
from sklearn.preprocessing import  MinMaxScaler
from sklearn.impute import SimpleImputer
Imputer = SimpleImputer(strategy='median')
# 上面代碼是自己修改的

# 機器學習算法庫
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

# 調參工具包
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV


import warnings
warnings.filterwarnings("ignore")
# Read in data into dataframes 
train_features = pd.read_csv('data/training_features.csv')
test_features = pd.read_csv('data/testing_features.csv')
train_labels = pd.read_csv('data/training_labels.csv')
test_labels = pd.read_csv('data/testing_labels.csv')

# Display sizes of data
print('Training Feature Size: ', train_features.shape)
print('Testing Feature Size: ', test_features.shape)
print('Training Labels Size: ', train_labels.shape)
print('Testing Labels Size: ', test_labels.shape)

5.1 缺失值填充

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-oMSL9UnT-1642160554535)(attachment:image.png)]

# 下面代碼是自己修改的
# 這是原代碼
# imputer = Imputer(strategy='median') # 因為數據有離群點,有大有小,用mean不太合適,用中比特數較合適
imputer = SimpleImputer(strategy='median') # 因為數據有離群點,有大有小,用mean不太合適,用中比特數較合適
# 上面代碼是自己修改的
# 在訓練特征中訓練
imputer.fit(train_features)

# 對訓練數據進行轉換
X = imputer.transform(train_features)#用中比特數來代替做成的訓練集
X_test = imputer.transform(test_features) #用中比特數來代替做成的測試集
# 查看訓練集和測試集中的特征列錶是否還有缺失值
#np.isnan:數值進行空值檢測
print('Missing values in training features: ', np.sum(np.isnan(X))) #返回的是0 ,代錶缺失值任務已經完成了
print('Missing values in testing features: ', np.sum(np.isnan(X_test)))

5.2 特征進行與歸一化

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-f23cQpIF-1642160554536)(attachment:image.png)]

# feature_range=(0, 1)特征值的範圍在0-1之間
scaler = MinMaxScaler(feature_range=(0, 1))

# 訓練與轉換
scaler.fit(X)

# 把訓練數據轉換過來(0,1)
X = scaler.transform(X)
X_test = scaler.transform(X_test) # 測試數據
#標簽值是1列 ,reshape變成1行
# reshape(行數,列數)常用來更改數據的行列數目
y = np.array(train_labels).reshape((-1,))#一維數組 , 變成1列
y_test = np.array(test_labels).reshape((-1, )) # 一維數組 , 變成1列

6 建立基礎模型,嘗試多種算法(回歸問題)

6.1 建立損失函數

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-ZQQhPLqN-1642160554536)(attachment:image.png)]

# 在這裏的損失函數是MAE ,abs()是絕對值
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))


#制作一個模型 ,訓練模型和在驗證集上驗證模型的參數
def fit_and_evaluate(model):
    
    # 訓練模型
    model.fit(X, y)
    
    # 訓練模型開始在測試數據上訓練
    model_pred = model.predict(X_test)
    model_mae = mae(y_test, model_pred)
    
    
    return model_mae

6.2 選擇機器學習算法

lr = LinearRegression()#線性回歸
lr_mae = fit_and_evaluate(lr)

print('Linear Regression Performance on the test set: MAE = %0.4f' % lr_mae)
svm = SVR(C = 1000, gamma = 0.1) #支持向量機
svm_mae = fit_and_evaluate(svm)

print('Support Vector Machine Regression Performance on the test set: MAE = %0.4f' % svm_mae)
random_forest = RandomForestRegressor(random_state=60)#集成算法的隨機森林
random_forest_mae = fit_and_evaluate(random_forest)

print('Random Forest Regression Performance on the test set: MAE = %0.4f' % random_forest_mae)
gradient_boosted = GradientBoostingRegressor(random_state=60) #梯度提昇樹
gradient_boosted_mae = fit_and_evaluate(gradient_boosted)

print('Gradient Boosted Regression Performance on the test set: MAE = %0.4f' % gradient_boosted_mae)
knn = KNeighborsRegressor(n_neighbors=10)#K近鄰算法
knn_mae = fit_and_evaluate(knn)

print('K-Nearest Neighbors Regression Performance on the test set: MAE = %0.4f' % knn_mae)
plt.style.use('fivethirtyeight') 
figsize(8, 6)


model_comparison = pd.DataFrame({
    'model': ['Linear Regression', 'Support Vector Machine',
                                           'Random Forest', 'Gradient Boosted',
                                            'K-Nearest Neighbors'],
                                 'mae': [lr_mae, svm_mae, random_forest_mae, 
                                         gradient_boosted_mae, knn_mae]})

# ascending=True是對的意思昇序 降序 :從大到小/從第1行到第5行 barh:橫著去畫的直方圖 
model_comparison.sort_values('mae', ascending = False).plot(x = 'model', y = 'mae', kind = 'barh',
                                                           color = 'red', edgecolor = 'black')

# 縱軸是算法模型的名稱 yticks:為遞增值向量 橫軸是MAE損失 xticks:為遞增值向量
plt.ylabel(''); plt.yticks(size = 14); plt.xlabel('Mean Absolute Error'); plt.xticks(size = 14)
plt.title('Model Comparison on Test MAE', size = 20);

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-Z2xpgxR7-1642160554537)(attachment:image.png)]

7 模型調參

7.1 調參

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-MYqoMgUR-1642160554538)(attachment:image.png)]

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-gk3Vn00o-1642160554538)(attachment:image.png)]

loss = ['ls', 'lad', 'huber']

# 所使用的弱“學習者”(决策樹)的數量
n_estimators = [100, 500, 900, 1100, 1500]

# 决策樹的最大深度
max_depth = [2, 3, 5, 10, 15]

# 决策樹的葉節點所需的最小示例個數
min_samples_leaf = [1, 2, 4, 6, 8]

# 分割决策樹節點所需的最小示例個數
min_samples_split = [2, 4, 6, 10]





hyperparameter_grid = {
    'loss': loss,
                       'n_estimators': n_estimators,
                       'max_depth': max_depth,
                       'min_samples_leaf': min_samples_leaf,
                       'min_samples_split': min_samples_split}

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-jS4ntaRq-1642160554539)(attachment:image.png)]

model = GradientBoostingRegressor(random_state = 42)


random_cv = RandomizedSearchCV(estimator=model, 
                               
                               param_distributions=hyperparameter_grid,
                               cv=4, n_iter=25, 
                               scoring = 'neg_mean_absolute_error', #選擇好結果的評估值
                               
                               n_jobs = -1, verbose = 1, 
                               
                               return_train_score = True,
                               
                               random_state=42)
# 注意:運行的時間非常慢,需要14mins
random_cv.fit(X, y)
help(GradientBoostingRegressor)
RandomizedSearchCV(cv=4, error_score='raise-deprecating',
                   estimator=GradientBoostingRegressor(alpha=0.9,
                                                       criterion='friedman_mse',
                                                       init=None,
                                                       learning_rate=0.1,
                                                       loss='ls', max_depth=3,
                                                       max_features=None,
                                                       max_leaf_nodes=None,
                                                       min_impurity_decrease=0.0,
                                                       min_impurity_split=None,
                                                       min_samples_leaf=1,
                                                       min_samples_split=2,
                                                       min_weight_fraction_leaf=0.0,
                                                       n_estimators=100,
                                                       verbose=0,
                                                       warm_start=False),
                   iid='warn', n_iter=25, n_jobs=-1,
                   param_distributions={
    'loss': ['ls', 'lad', 'huber'],
                                        'max_depth': [2, 3, 5, 10, 15],
                                        'min_samples_leaf': [1, 2, 4, 6, 8],
                                        'min_samples_split': [2, 4, 6, 10],
                                        'n_estimators': [100, 500, 900, 1100,
                                                         1500]},
                   pre_dispatch='2*n_jobs', random_state=42, refit=True,
                   return_train_score=True, scoring='neg_mean_absolute_error',
                   verbose=1)
random_cv.best_estimator_ #最好的參數
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-187-c5a12878b76a> in <module>
----> 1 random_cv.best_estimator_ #最好的參數


NameError: name 'random_cv' is not defined
# 創建樹策個數
trees_grid = {
    'n_estimators': [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800]}

#建立模型
#lad:最小化絕對偏差
model = GradientBoostingRegressor(loss = 'lad', max_depth = 5,
                                  min_samples_leaf = 6,
                                  min_samples_split = 6,
                                  max_features = None,
                                  random_state = 42)

# 傳入參數
grid_search = GridSearchCV(estimator = model, param_grid=trees_grid, cv = 4, 
                           scoring = 'neg_mean_absolute_error', verbose = 1,
                           n_jobs = -1, return_train_score = True)
# 需要3mins
grid_search.fit(X, y)
GridSearchCV(cv=4, error_score='raise-deprecating',
             estimator=GradientBoostingRegressor(alpha=0.9,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.1,
                                                 loss='lad', max_depth=5,
                                                 max_features=None,
                                                 max_leaf_nodes=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=6,
                                                 min_samples_split=6,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=100,
                                                 n_iter_no_change=None,
                                                 presort='auto',
                                                 random_state=42, subsample=1.0,
                                                 tol=0.0001,
                                                 validation_fraction=0.1,
                                                 verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={
    'n_estimators': [100, 150, 200, 250, 300, 350, 400,
                                          450, 500, 550, 600, 650, 700, 750,
                                          800]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='neg_mean_absolute_error', verbose=1)

7.2 對比損失函數

# 得到結果傳入DataFrame
results = pd.DataFrame(grid_search.cv_results_)

# 畫圖操作
figsize(8, 8)
plt.style.use('fivethirtyeight')

plt.plot(results['param_n_estimators'], -1 * results['mean_test_score'], label = 'Testing Error')
plt.plot(results['param_n_estimators'], -1 * results['mean_train_score'], label = 'Training Error')
#橫軸是樹的個數 ,縱軸是MAE的誤差
plt.xlabel('Number of Trees'); plt.ylabel('Mean Abosolute Error'); plt.legend();
plt.title('Performance vs Number of Trees');
#過擬合 , 藍色平緩 ,紅色比較陡 ,中間的數據越來陡,所以overfiting

8 評估與測試:預測和真實之間的差异圖

# 測試模型
default_model = GradientBoostingRegressor(random_state = 42)
default_model.fit(X,y)
# 選擇最好的參數
final_model = grid_search.best_estimator_

final_model
default_pred = default_model.predict(X_test)
final_pred = final_model.predict(X_test)
print('Default model performance on the test set: MAE = %0.4f.' % mae(y_test, default_pred))
print('Final model performance on the test set: MAE = %0.4f.' % mae(y_test, final_pred))
figsize = (6, 6)

# 最終的模型差异 = 模型 - 測試值 ,大部分都在+-25%
residuals = final_pred - y_test

plt.hist(residuals, color = 'red', bins = 20,
         edgecolor = 'black')
plt.xlabel('Error'); plt.ylabel('Count')
plt.title('Distribution of Residuals');

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-JhYfQn90-1642160554540)(attachment:image.png)]

9 解釋模型:基於重要性來進行特征選擇

import pandas as pd
import numpy as np


pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', 60)


import matplotlib.pyplot as plt
%matplotlib inline


plt.rcParams['font.size'] = 24

from IPython.core.pylabtools import figsize

import seaborn as sns

sns.set(font_scale = 2)



# 下面代碼是自己修改的
# 這是原代碼
# from sklearn.preprocessing import Imputer, MinMaxScaler
from sklearn.preprocessing import  MinMaxScaler
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
# 上面代碼是自己修改的

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor

from sklearn import tree



import warnings
warnings.filterwarnings("ignore")

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-tWveKWPM-1642160554540)(attachment:image.png)]

# 用中值代替缺失值

# 下面代碼是自己修改的
# 這是原代碼
# imputer = Imputer(strategy='median')
imputer = SimpleImputer(strategy='median')
# 上面代碼是自己修改的



# 開始訓練
imputer.fit(train_features)


X = imputer.transform(train_features)
# 測試集的缺失值使用的也是訓練集的數據
X_test = imputer.transform(test_features)


y = np.array(train_labels).reshape((-1,))
y_test = np.array(test_labels).reshape((-1,))
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))
model = GradientBoostingRegressor(loss='lad', max_depth=5, max_features=None,
                                  min_samples_leaf=6, min_samples_split=6, 
                                  n_estimators=800, random_state=42)

model.fit(X, y)
# GBDT模型作為最終的模型
model_pred = model.predict(X_test)

print('Final Model Performance on the test set: MAE = %0.4f' % mae(y_test, model_pred))
# 特征重要度
feature_results = pd.DataFrame({
    'feature': list(train_features.columns),  #所有的訓練特征
                                'importance': model.feature_importances_})

# 展示前10名的重要的特征 ,降序 
feature_results = feature_results.sort_values('importance', ascending = False).reset_index(drop=True)

feature_results.head(10)
figsize(12, 10)
plt.style.use('fivethirtyeight')

# 展示前10名的重要的特征 
feature_results.loc[:9, :].plot(x = 'feature', y = 'importance', 
                                 edgecolor = 'k',
                                 kind='barh', color = 'blue');#barh:直方圖橫著
plt.xlabel('Relative Importance', size = 20); plt.ylabel('')
plt.title('Feature Importances from Random Forest', size = 30);
most_important_features = feature_results['feature'][:10]#前10行的特征
# indices=10個列名
indices = [list(train_features.columns).index(x) for x in most_important_features]# 列錶推導式


X_reduced = X[:, indices]
X_test_reduced = X_test[:, indices]

print('Most important training features shape: ', X_reduced.shape)
print('Most important testing features shape: ', X_test_reduced.shape)
lr = LinearRegression()


lr.fit(X, y)
lr_full_pred = lr.predict(X_test)


lr.fit(X_reduced, y)
lr_reduced_pred = lr.predict(X_test_reduced)


print('Linear Regression Full Results: MAE = %0.4f.' % mae(y_test, lr_full_pred))
print('Linear Regression Reduced Results: MAE = %0.4f.' % mae(y_test, lr_reduced_pred))


版權聲明
本文為[QT-Smile]所創,轉載請帶上原文鏈接,感謝
https://cht.chowdera.com/2022/01/202201272014155704.html

隨機推薦