2021 0427 타이타닉 playground EDA 스터디

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
1

https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model

https://www.kaggle.com/hiro5299834/tps-apr-2021-voting-pseudo-labeling

1
!pip install catboost
Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/47/80/8e9c57ec32dfed6ba2922bc5c96462cbf8596ce1a6f5de532ad1e43e53fe/catboost-0.25.1-cp37-none-manylinux1_x86_64.whl (67.3MB)
     |████████████████████████████████| 67.3MB 42kB/s 
[?25hRequirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from catboost) (1.15.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from catboost) (1.4.1)
Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (from catboost) (4.4.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.19.5)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.1.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from catboost) (3.2.2)
Requirement already satisfied: graphviz in /usr/local/lib/python3.7/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (1.3.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (2.4.7)
Installing collected packages: catboost
Successfully installed catboost-0.25.1

KAGGLE 스터디

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
import numpy as np
import random
import os

from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold

import lightgbm as lgb
import catboost as ctb
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier, export_graphviz

import graphviz
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter('ignore')
1
!pip install kaggle
Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages (1.5.12)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.8.1)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.15.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from kaggle) (2020.12.5)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.0.1)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.41.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (2.10)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle) (1.3)
1
2
3
!mkdir ~/.kaggle
!echo '{ "username": "tlgks32", "key": "ebc90b09f1ea143f3ff91bf4b19c9956"}' > ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
1
!kaggle competitions list
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.12 / client 1.5.4)
ref                                            deadline             category            reward  teamCount  userHasEntered  
---------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
contradictory-my-dear-watson                   2030-07-01 23:59:00  Getting Started     Prizes        132           False  
gan-getting-started                            2030-07-01 23:59:00  Getting Started     Prizes        244           False  
tpu-getting-started                            2030-06-03 23:59:00  Getting Started  Knowledge        783           False  
digit-recognizer                               2030-01-01 00:00:00  Getting Started  Knowledge       4207           False  
titanic                                        2030-01-01 00:00:00  Getting Started  Knowledge      34273            True  
house-prices-advanced-regression-techniques    2030-01-01 00:00:00  Getting Started  Knowledge       8993            True  
connectx                                       2030-01-01 00:00:00  Getting Started  Knowledge        727           False  
nlp-getting-started                            2030-01-01 00:00:00  Getting Started  Knowledge       2403            True  
competitive-data-science-predict-future-sales  2022-12-31 23:59:00  Playground           Kudos      11051           False  
jane-street-market-prediction                  2021-08-23 23:59:00  Featured          $100,000       4245           False  
hungry-geese                                   2021-07-26 23:59:00  Playground          Prizes        556           False  
coleridgeinitiative-show-us-the-data           2021-06-22 23:59:00  Featured           $90,000        707           False  
bms-molecular-translation                      2021-06-02 23:59:00  Featured           $50,000        553            True  
birdclef-2021                                  2021-05-31 23:59:00  Research            $5,000        314           False  
iwildcam2021-fgvc8                             2021-05-26 23:59:00  Research         Knowledge         25           False  
herbarium-2021-fgvc8                           2021-05-26 23:59:00  Research         Knowledge         50           False  
plant-pathology-2021-fgvc8                     2021-05-26 23:59:00  Research         Knowledge        354           False  
hotel-id-2021-fgvc8                            2021-05-26 23:59:00  Research         Knowledge         67           False  
hashcode-2021-oqr-extension                    2021-05-25 23:59:00  Playground       Knowledge        136           False  
indoor-location-navigation                     2021-05-17 23:59:00  Research           $10,000       1020           False  
1
!kaggle competitions download -c tabular-playground-series-apr-2021
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.12 / client 1.5.4)
Downloading train.csv.zip to /content
  0% 0.00/2.13M [00:00<?, ?B/s]
100% 2.13M/2.13M [00:00<00:00, 72.1MB/s]
Downloading sample_submission.csv to /content
  0% 0.00/879k [00:00<?, ?B/s]
100% 879k/879k [00:00<00:00, 125MB/s]
Downloading test.csv.zip to /content
  0% 0.00/2.07M [00:00<?, ?B/s]
100% 2.07M/2.07M [00:00<00:00, 141MB/s]
1
2
3
4
5
6
7
TARGET = 'Survived'

N_ESTIMATORS = 1000
N_SPLITS = 10
SEED = 2021
EARLY_STOPPING_ROUNDS = 100
VERBOSE = 100
1
2
3
4
5
6
7
#랜덤 시드 생성
def set_seed(seed=42):
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)

set_seed(SEED)

데이터 전처리

lode data

1
2
3
4
5
6
7
8
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
submission = pd.read_csv('sample_submission.csv')
#test_df['Survived'] = pd.read_csv("../input/submission-merged3/submission_merged3.csv")['Survived']

all_df = pd.concat([train_df, test_df]).reset_index(drop=True)
#reset_index : 인덱스를 세팅한다. drop=True를 하면 인덱스를 세팅한걸 삭제함.

1
2
print('Rows and Columns in train dataset:', train_df.shape)
print('Rows and Columns in test dataset:', test_df.shape)
Rows and Columns in train dataset: (100000, 12)
Rows and Columns in test dataset: (100000, 11)

결측치 갯수 출력

1
2
3
4
5
6
7
8
9
print('Missing values per columns in train dataset')
for col in train_df.columns:
temp_col = train_df[col].isnull().sum()
print(f'{col}: {temp_col}')
print()
print('Missing values per columns in test dataset')
for col in test_df.columns:
temp_col = test_df[col].isnull().sum()
print(f'{col}: {temp_col}')
Missing values per columns in train dataset
PassengerId: 0
Survived: 0
Pclass: 0
Name: 0
Sex: 0
Age: 3292
SibSp: 0
Parch: 0
Ticket: 4623
Fare: 134
Cabin: 67866
Embarked: 250

Missing values per columns in test dataset
PassengerId: 0
Pclass: 0
Name: 0
Sex: 0
Age: 3487
SibSp: 0
Parch: 0
Ticket: 5181
Fare: 133
Cabin: 70831
Embarked: 277

Filling missing values

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#나이는 나이의 평균치로 채운다.
all_df['Age'] = all_df['Age'].fillna(all_df['Age'].mean())

#cabin은 문자열을 분할하고, 제일 첫번째 글자를 따와서 넣는다. 결측치엔 X를 넣는다.
#strip() : 양쪽 공백을 지운다. 여기서느 x[0]외엔 다 지우는듯.
all_df['Cabin'] = all_df['Cabin'].fillna('X').map(lambda x: x[0].strip())


#print(all_df['Ticket'].head(10))
#Ticket, fillna with 'X', split string and take first split
#split() : 문자열 나누기. 디폴트는 ' '이고, 문자를 가진 데이터들이 전부 띄워쓰기로 구분되어있기때문에 가능.
all_df['Ticket'] = all_df['Ticket'].fillna('X').map(lambda x:str(x).split()[0] if len(str(x).split()) > 1 else 'X')

#pclass에 따른 Fare의 평균을 구해서 dictionary형태로 만든다.
fare_map = all_df[['Fare', 'Pclass']].dropna().groupby('Pclass').median().to_dict()
#fare의 결측치에 본인 행의 pclass 값을 넣고, 그 값을 fare 평균에 맵핑시킨다.
all_df['Fare'] = all_df['Fare'].fillna(all_df['Pclass'].map(fare_map['Fare']))
#유독 높은 가격이나 낮은 가격이 있기때문에, 이상치의 영향을 줄이기 위해서 Fare에 log를 취해준다.
all_df['Fare'] = np.log1p(all_df['Fare'])


#항구의 결측치를 X로 채운다.
all_df['Embarked'] = all_df['Embarked'].fillna('X')

#이름은 성만 사용한다.
all_df['Name'] = all_df['Name'].map(lambda x: x.split(',')[0])

1
2
3
4
5
6
7
8
9
data_1=all_df.loc[all_df['Pclass']==1].groupby('Ticket')['Ticket'].count().sort_values(ascending=False)
print(data_1)
print()
data_2=all_df.loc[all_df['Pclass']==2].groupby('Ticket')['Ticket'].count().sort_values(ascending=False)
print(data_2)
print()
data_3=all_df.loc[all_df['Pclass']==3].groupby('Ticket')['Ticket'].count().sort_values(ascending=False)
print(data_3)
print()
Ticket
X             36336
PC            16814
C.A.            338
SC/Paris        334
SC/PARIS        260
W./C.           206
S.O.C.          192
S.C./PARIS      191
PP              186
F.C.            183
SC/AH           178
F.C.C.          167
STON/O          163
CA.             161
SOTON/O.Q.      123
A/4             115
A/5.            108
W.E.P.           94
WE/P             92
SOTON/OQ         87
STON/O2.         81
CA               81
A/5              70
C                67
A/4.             66
P/PP             66
SC               59
SOTON/O2         48
A./5.            46
S.O./P.P.        40
A.5.             33
AQ/4             27
A/S              23
SCO/W            19
S.P.             17
SC/A4            16
SW/PP            16
S.O.P.           15
SC/A.3           15
SO/C             14
S.C./A.4.        14
C.A./SOTON       14
A.               14
STON/OQ.         13
W/C              13
S.W./PP          11
LP               11
AQ/3.             8
Fa                7
A4.               6
Name: Ticket, dtype: int64

Ticket
X             31337
A.              997
C.A.            717
SC/PARIS        470
STON/O          387
PC              330
S.O.C.          313
PP              308
SC/AH           284
W./C.           259
SOTON/O.Q.      219
F.C.C.          203
A/5.            200
A/4             152
SC/Paris        135
S.C./PARIS      119
SOTON/O2        112
CA.             107
STON/O2.        106
C               104
F.C.            100
WE/P             92
SOTON/OQ         86
A/5              82
CA               66
W.E.P.           60
A./5.            60
S.O./P.P.        54
P/PP             50
A/4.             46
SCO/W            36
SC               33
A.5.             29
AQ/4             29
LP               25
SC/A.3           20
A/S              19
C.A./SOTON       19
SC/A4            17
Fa               15
S.C./A.4.        13
S.W./PP          13
SO/C             13
STON/OQ.         12
W/C              11
S.P.             10
S.O.P.            9
SW/PP             9
A4.               7
AQ/3.             6
Name: Ticket, dtype: int64

Ticket
X             84781
A.             6420
C.A.           2615
STON/O         1508
A/5.            918
SOTON/O.Q.      719
PP              679
SC/PARIS        642
W./C.           623
PC              595
F.C.C.          541
A/5             420
CA.             368
STON/O2.        363
SC/AH           331
A/4             268
SOTON/O2        264
S.O.C.          231
C               227
SC/Paris        177
S.O./P.P.       177
CA              172
SOTON/OQ        172
W.E.P.          154
F.C.            131
S.C./PARIS      127
A./5.           122
WE/P            121
SC              106
A/4.            104
SCO/W            74
A.5.             72
P/PP             68
SC/A4            67
AQ/4             56
LP               41
Fa               37
STON/OQ.         37
S.W./PP          32
SC/A.3           31
C.A./SOTON       31
SW/PP            30
SO/C             28
A/S              28
AQ/3.            26
S.P.             24
S.C./A.4.        23
S.O.P.           21
A4.              20
W/C              20
Name: Ticket, dtype: int64

인코딩

변수별로 인코딩을 다르게 해준다.

1
2
3
label_cols = ['Name', 'Ticket', 'Sex','Pclass','Embarked']
onehot_cols = [ 'Cabin',]
numerical_cols = [ 'Age', 'SibSp', 'Parch', 'Fare']
1
2
3
4
#라벨 인코딩 함수. c라는 매개변수를 받아서 맞게 트렌스폼 해준다. 
def label_encoder(c):
le = LabelEncoder()
return le.fit_transform(c)
1
2
3
4
5
6
7
8
9
10
11
12

#StandardScaler(): 평균을 제거하고 데이터를 단위 분산으로 조정한다.
#그러나 이상치가 있다면 평균과 표준편차에 영향을 미쳐 변환된 데이터의 확산은 매우 달라지게 되는 함수
scaler = StandardScaler()

onehot_encoded_df = pd.get_dummies(all_df[onehot_cols])
label_encoded_df = all_df[label_cols].apply(label_encoder)
numerical_df = pd.DataFrame(scaler.fit_transform(all_df[numerical_cols]), columns=numerical_cols)
target_df = all_df[TARGET]

all_df = pd.concat([numerical_df, label_encoded_df,onehot_encoded_df, target_df], axis=1)
#all_df = pd.concat([numerical_df, label_encoded_df, target_df], axis=1)

모델링

1
drop_list=['Survived','Parch']

not pseudo

1
2
3
4
5
6
train = all_df.iloc[:100000, :]#0개~100000개
test = all_df.iloc[100000:, :] #100000개~
#iloc은 정수형 인덱싱
test = test.drop('Survived', axis=1) #test에서 종속변수를 드랍한다.
model_results = pd.DataFrame()
folds = 5
1
2
y= train.loc[:,'Survived']
X= train.drop(drop_list,axis=1)

pseudo

1
2
# y=all_df.loc[:,'Survived']
# X=all_df.drop('Survived',axis=1)
1
2
X_train, X_valid, y_train, y_valid = train_test_split(X,y,test_size=0.25, random_state=21)

1
2
3
from sklearn import metrics  
from sklearn.metrics import accuracy_score
import numpy as np
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
params = {
'metric': 'binary_logloss',
'n_estimators': N_ESTIMATORS,
'objective': 'binary',
'random_state': SEED,
'learning_rate': 0.01,
'min_child_samples': 150,
'reg_alpha': 3e-5,
'reg_lambda': 9e-2,
'num_leaves': 20,
'max_depth': 16,
'colsample_bytree': 0.8,
'subsample': 0.8,
'subsample_freq': 2,
'max_bin': 240,
}
1
2
3
4
5
6
7
8
lgbm_model=lgb.LGBMClassifier(**params)
lgbm_model.fit(X_train,y_train)
lgbm_pred=lgbm_model.predict(X_valid)

lgbm_R2=metrics.accuracy_score(y_valid,lgbm_pred)
#lgbm_rmse = np.sqrt(mean_squared_error(lgbm_pred,y_valid))
print('R2 : ',lgbm_R2)
#print("RMSE : ", lgbm_rmse)
R2 :  0.78076
1

1
2
print(len(X_train.columns))
print(X_train.columns)
17
Index(['Age', 'SibSp', 'Fare', 'Name', 'Ticket', 'Sex', 'Pclass', 'Embarked',
       'Cabin_A', 'Cabin_B', 'Cabin_C', 'Cabin_D', 'Cabin_E', 'Cabin_F',
       'Cabin_G', 'Cabin_T', 'Cabin_X'],
      dtype='object')
1
2
3
4
5
6
7
def cal_adjust_r2(r2):
n=80000
k= len(X_train.columns)
temp=(1-r2)*(n-1)
temp2=n-k-1
ad_r2=1-(temp/temp2)
return ad_r2
1
2
ad_r2_lgbm=cal_adjust_r2(lgbm_R2)
print(ad_r2_lgbm)
0.7807134010152285
1
2
3
#NOT Pseudo
train_kf_feature=train.drop(drop_list,axis=1)
train_kf_label=train.loc[:,'Survived']
1
2
3
#Pseudo
# train_kf_feature=all_df.drop(drop_list,axis=1)
# train_kf_label=all_df.loc[:,'Survived']
1
lgbm_temp = lgbm_model.booster_
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
n_iter=0
kfold=StratifiedKFold(n_splits=5)
cv_accuracy=[]
feature_importances = pd.DataFrame()

for train_idx, test_idx in kfold.split(train_kf_feature,train_kf_label):

X_train=train_kf_feature.iloc[train_idx]
X_test=train_kf_feature.iloc[test_idx]
y_train,y_test=train_kf_label.iloc[train_idx],train_kf_label.iloc[test_idx]
#학습 진행
lgbm_model.fit(X_train,y_train)
#예측
fold_pred=lgbm_model.predict(X_test)

#정확도
n_iter+=1
fold_accuracy=metrics.accuracy_score(y_test,fold_pred)
print("\n {}번째 교차 검증 정확도 : {} , 학습 데이터 크기:{}, 검증 데이터 크기 :{} ".
format(n_iter,fold_accuracy,X_train.shape[0],X_test.shape[0]))
cv_accuracy.append(fold_accuracy)

#중요도
fi_tmp = pd.DataFrame()
fi_tmp["feature"] = lgbm_temp.feature_name()
fi_tmp["importance"] = lgbm_model.feature_importances_
feature_importances = feature_importances.append(fi_tmp)

print('\n 평균 검증 정확도 : ',np.mean(cv_accuracy))

 1번째  교차 검증 정확도 : 0.78015 , 학습 데이터 크기:80000, 검증 데이터 크기 :20000 

 2번째  교차 검증 정확도 : 0.7824 , 학습 데이터 크기:80000, 검증 데이터 크기 :20000 

 3번째  교차 검증 정확도 : 0.78185 , 학습 데이터 크기:80000, 검증 데이터 크기 :20000 

 4번째  교차 검증 정확도 : 0.7816 , 학습 데이터 크기:80000, 검증 데이터 크기 :20000 

 5번째  교차 검증 정확도 : 0.7809 , 학습 데이터 크기:80000, 검증 데이터 크기 :20000 

 평균 검증 정확도 :  0.78138
1
2
3
4
5
6
order = list(feature_importances.groupby("feature").
mean().sort_values("importance", ascending=False).index)
plt.figure(figsize=(10, 10))
sns.barplot(x="importance", y="feature", data=feature_importances, order=order)
plt.title("{} importance".format("LGBMRegressor"))
plt.tight_layout()

output_46_0

CATBoost

1
2
3
4
5
6
7
8
9
10
11
12
13
14
params_cat = {
'bootstrap_type': 'Poisson',
'loss_function': 'Logloss',
'eval_metric': 'Logloss',
'random_seed': SEED,
'task_type': 'GPU',
'max_depth': 8,
'learning_rate': 0.01,
'n_estimators': N_ESTIMATORS,
'max_bin': 280,
'min_data_in_leaf': 64,
'l2_leaf_reg': 0.01,
'subsample': 0.8
}
1
2
3
#새로운 트레인 valid 셋
X_train, X_valid, y_train, y_valid = train_test_split(X,y,test_size=0.25, random_state=21)

1
2
3
4
5
6
7
8

cat_model=ctb.CatBoostClassifier(**params_cat)
cat_model.fit(X_train, y_train,verbose=300)
cat_pred=cat_model.predict(X_valid)
print("\n정확도: ", metrics.accuracy_score(y_valid, cat_pred))
cat_R2=metrics.accuracy_score(y_valid,cat_pred)
#lgbm_rmse = np.sqrt(mean_squared_error(lgbm_pred,y_valid))
print('R2 : ',cat_R2)
0:    learn: 0.6881875    total: 18ms    remaining: 18s
300:    learn: 0.4671082    total: 3.33s    remaining: 7.73s
600:    learn: 0.4580212    total: 6.44s    remaining: 4.28s
900:    learn: 0.4512272    total: 9.49s    remaining: 1.04s
999:    learn: 0.4491741    total: 10.5s    remaining: 0us

정확도:  0.78044
R2 :  0.78044
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cv_accuracy=[]
feature_importances = pd.DataFrame()

for train_idx, test_idx in kfold.split(train_kf_feature,train_kf_label):

X_train=train_kf_feature.iloc[train_idx]
X_test=train_kf_feature.iloc[test_idx]
y_train,y_test=train_kf_label.iloc[train_idx],train_kf_label.iloc[test_idx]
#학습 진행
cat_model.fit(X_train,y_train,verbose=500)
#예측
fold_pred=cat_model.predict(X_test)

#정확도
n_iter+=1
fold_accuracy=metrics.accuracy_score(y_test,fold_pred)
print("\n {}번째 교차 검증 정확도 : {} , 학습 데이터 크기:{}, 검증 데이터 크기 :{} ".
format(n_iter,fold_accuracy,X_train.shape[0],X_test.shape[0]))
cv_accuracy.append(fold_accuracy)

#중요도 . lgbm이랑 명령어가 다르다.
fi_tmp = pd.DataFrame()
fi_tmp["feature"] = X_test.columns.to_list()
fi_tmp["importance"] = cat_model.get_feature_importance()
feature_importances = feature_importances.append(fi_tmp)

print('\n 평균 검증 정확도 : ',np.mean(cv_accuracy))
0:    learn: 0.6881430    total: 11.2ms    remaining: 11.2s
500:    learn: 0.4620724    total: 5.17s    remaining: 5.15s
999:    learn: 0.4513527    total: 10.2s    remaining: 0us

 6번째  교차 검증 정확도 : 0.77945 , 학습 데이터 크기:80000, 검증 데이터 크기 :20000 
0:    learn: 0.6881914    total: 12.4ms    remaining: 12.3s
500:    learn: 0.4635447    total: 5.02s    remaining: 5s
999:    learn: 0.4529141    total: 10.2s    remaining: 0us

 7번째  교차 검증 정확도 : 0.78335 , 학습 데이터 크기:80000, 검증 데이터 크기 :20000 
0:    learn: 0.6881970    total: 13.6ms    remaining: 13.6s
500:    learn: 0.4635994    total: 5.2s    remaining: 5.18s
999:    learn: 0.4529137    total: 10.3s    remaining: 0us

 8번째  교차 검증 정확도 : 0.78265 , 학습 데이터 크기:80000, 검증 데이터 크기 :20000 
0:    learn: 0.6882583    total: 11.2ms    remaining: 11.2s
500:    learn: 0.4622575    total: 5.08s    remaining: 5.06s
999:    learn: 0.4513804    total: 10.1s    remaining: 0us

 9번째  교차 검증 정확도 : 0.7821 , 학습 데이터 크기:80000, 검증 데이터 크기 :20000 
0:    learn: 0.6882789    total: 15.4ms    remaining: 15.3s
500:    learn: 0.4630108    total: 5.1s    remaining: 5.08s
999:    learn: 0.4522854    total: 10.1s    remaining: 0us

 10번째  교차 검증 정확도 : 0.7802 , 학습 데이터 크기:80000, 검증 데이터 크기 :20000 

 평균 검증 정확도 :  0.78155
1
2
3
4
5
6
# just to get ideas to improve
order = list(feature_importances.groupby("feature").mean().sort_values("importance", ascending=False).index)
plt.figure(figsize=(10, 10))
sns.barplot(x="importance", y="feature", data=feature_importances, order=order)
plt.title("{} importance".format("CatBoostClassifier"))
plt.tight_layout()

output_52_0

Submission

1
2
3
4
5
6
7
8
9
10
11
def create_submission(model, test, test_passenger_id, model_name):
y_pred_test = model.predict_proba(test)[:, 1]
submission = pd.DataFrame(
{
'PassengerId': test_passenger_id,
'Survived': (y_pred_test >= 0.5).astype(int),
}
)
submission.to_csv(f"submission_{model_name}.csv", index=False)

return y_pred_test
1
test_df.head()

PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 100000 3 Holliday, Daniel male 19.0 0 0 24745 63.01 NaN S
1 100001 3 Nguyen, Lorraine female 53.0 0 0 13264 5.81 NaN S
2 100002 1 Harris, Heather female 19.0 0 0 25990 38.91 B15315 C
3 100003 2 Larsen, Eric male 25.0 0 0 314011 12.93 NaN S
4 100004 1 Cleary, Sarah female 17.0 0 2 26203 26.89 B22515 C
1
2
3
4
#X_test=test.drop('Pclass',axis=1)
test = all_df.iloc[100000:, :] #100000개~
X_test=test.drop(drop_list,axis=1)
X_test.head()

Age SibSp Fare Name Ticket Sex Pclass Embarked Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_X
100000 -0.937422 -0.539572 0.949786 10830 49 1 2 2 0 0 0 0 0 0 0 0 1
100001 1.123570 -0.539572 -1.273379 17134 49 0 2 2 0 0 0 0 0 0 0 0 1
100002 -0.937422 -0.539572 0.481059 9978 49 0 0 0 0 1 0 0 0 0 0 0 0
100003 -0.573717 -0.539572 -0.563310 13303 49 1 1 2 0 0 0 0 0 0 0 0 1
100004 -1.058657 -0.539572 0.125497 4406 49 0 0 0 0 1 0 0 0 0 0 0 0
1
2
3
4
5
6
test_pred_lightgbm = create_submission(
lgbm_model, X_test, test_df["PassengerId"], "lightgbm"
)
test_pred_catboost = create_submission(
cat_model, X_test, test_df["PassengerId"], "catboost"
)
1
2
3
4
5
6
test_pred_merged = (

test_pred_lightgbm +
test_pred_catboost
)
test_pred_merged = np.round(test_pred_merged / 2)
1
2
3
4
5
6
7
submission = pd.DataFrame(
{
'PassengerId': test_df["PassengerId"],
'Survived': test_pred_merged.astype(int),
}
)
submission.to_csv(f"submission_merged3.csv", index=False)

score

kaggle public score : 0.80354

Author

이현정

Posted on

2021-04-27

Updated on

2021-04-28

Licensed under

You need to set install_url to use ShareThis. Please set it in _config.yml.
You forgot to set the business or currency_code for Paypal. Please set it in _config.yml.

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.
You need to set client_id and slot_id to show this AD unit. Please set it in _config.yml.