본문 바로가기
빅데이터 분석기사

내가 보려고 만든 빅분기 꼭 알아야할 명령어

by 공불러 2023. 6. 24.
728x90
반응형

제 1유형최소최대 변수 = MinMaxScaler()

 

 from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
 scaler.fit_transform(data[['qsec']])

표준화(Standardization) : StandardScaler | scaler.fit_transform(data)

from sklearn.preprocessing import StandardScaler

data = pd.read_csv('data/mtcars.csv', index_col=0)

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data[['qsec']])
print(scaled_data)

Z점수

from scipy.stats import zscore

# 'data'를 Z 점수로 변환
z_scores = zscore(data)

표준편차

import numpy as np

# 'data'의 표준 편차 계산
std_dev = np.std(data)

데이터 뽑기

top_60_percent = df[df['a'] >= df['a'].quantile(0.4)]

below_mean = df[df['a'] <= df['a'].mean()]

gk_data = df[df['b'] == 'gk']

 

 

 

결측치 0으로 바꾸기 data['바꿀column'].fillna(0)

data['바꿀column'] =data['바꿀column'].fillna(0)

 

제 2유형

데이터 확인

X.shape, y.shape, test.shape
X.head()

결측치

X.isnull().sum()

전처리

# 결측치 대체
train_data = train_data.fillna(0)

날짜 변경 to_datetime  | date_obj.month

date_obj = pd.to_datetime(date_str)

year = date_obj.year
month = date_obj.month
day = date_obj.day
import pandas as pd
data = pd.read_csv("data/nf.csv")
data['date_added'] = pd.to_datetime(data['date_added']) #pd.to.datetime 기억할것
date = data['date_added']

con1 = data['date_added'].dt.year == 2018
con2 = data['date_added'].dt.month == 1
con3 = data['country'] == "United Kingdom"

result = data[con1 & con2 & con3 ]
print(len(result))

불필요 컬럼 제거

X = X.drop(['cust_id'], axis=1)
cust_id = test.pop('cust_id')

 

Feature Engineering LabelEncoder | fit_transform

# 범주형 변수 인코딩
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
train_data_encoded = encoder.fit_transform(train_data)
from sklearn.preprocessing import LabelEncoder
cols = ['주구매상품', '주구매지점']
for col in cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    test[col] = le.transform(test[col])

X.head()

 

 

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

모델링&하이퍼파라미터&앙상블 model.fit  model.score model.predict

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=2022)
model.fit(X, y['gender']
print(model.score(X, y['gender']))
predictions = model.predict_proba(test)



(n_estimators=100, max_depth=5, random_state
model.fit(X, y['gender'])
model.score(X, y['gender'])
model.predict_proba

선택된 자료(범주형)

predictions[:,1]

분류 알고리즘

rom sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(train_data, train_labels)

ROC SCORE sklearn.metrics import roc_auc_score

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_tr, y_tr)
pred = model.predict_proba(X_val)


from sklearn.metrics import roc_auc_score
roc_auc_score(y_val, pred[:,1])

CSV 변경

output = pd.DataFrame({'cust_id': cust_id, 'gender': predictions[:,1]})
output.to_csv("123456789.csv", index=False)

회귀일때

 시간 변경
 pd.to_datetime(x_train['datetime'])
 
 pd.concat([x_train, y_train], axis = 1)
 
 관계확인
 print(data_check.groupby('계절')['count'].sum())

 

회귀

 

from sklearn.model_selction import train_test_split

from sklearn.ensemble import RandomFroestRegressor
from sklearn.metrics import mean_squared_error
def rmse(y_tue, y_pred):
	return mean_sqared

 

검증

from sklearn.metrics import roc_auc_score

import pandas as pd
from sklearn.linear_model import LogisticRegression

제 3유형 

t-test

독립표본 : ttest_ind

from scipy.stats import ttest_ind
t_statistic, p_value = ttest_ind(group1, group2)

대응 표본(쌍체집단) : ttest_rel

from scipy.stats import ttest_rel

from scipy.stats import ttest_rel
t_statistic, p_value = ttest_rel(before, after)

단일표본 :  ttest_1samp

from scipy.stats import ttest_1samp
t_statistic, p_value = ttest_1samp(data, 5)

F 검증

#일원배치분석
from scipy.stats import f_oneway
f_statistic, p_value = f_oneway(group1, group2, group3)

#이원 배치 분석
f_statistic, p_value = f_oneway(group1, group2, group3, factor1, factor2)

상관분석

from scipy.stats import pearsonr
correlation, p_value = pearsonr(data['x'], data['y'])

회귀분석

model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)

샤피로Shapiro-Wilk 검정 수행

from scipy import stats

data = [75, 83, 81, 92, 68, 77, 78, 80, 85, 95, 79, 89]

# Shapiro-Wilk 검정 수행
statistic, p_value = stats.shapiro(data)

 

2023.06.21 - [빅데이터 분석기사] - [빅데이터분석기사] 제 1유형 기출 문제 파해치기 - 1 | 풀이 모음 | 여러 방식 | 핵심정리

2023.06.22 - [빅데이터 분석기사] - [빅데이터분석기사] 제 1유형 기출 파해치기 - 2 | 풀이 모음 | 여러 방식 | 핵심정리

728x90
반응형

댓글