[ML] Exploratory Data Analysis (EDA)

728x90

https://ppcexpo.com/blog/exploratory-data-analysis

Exploratory Data Analysis (탐색적 데이터 분석, EDA)

EDA(탐색적 데이터 분석)은 실험 또는 데이터 프로젝트에서 데이터를 분석하는 첫 번째 단계임.
EDA를 통해 분석가들은 데이터를 이해하고, 가설을 세움
EDA를 통해 더 정형화된 분석 방법으로 넘어가기 전에 주어진 가정을 확인.

탐색적 데이터 분석(EDA)의 주요 구성 요소

Descriptive Statistics (기술 통계) :
- Measures of Central Tendency (중심 경향의 척도: mean, median, mode).
- Measures of Dispersion (분산 척도: variance, std, range, interquatile range=IQR).
- Skewness (왜도)와 Kurtosis (첨도).
Data Visualization (데이터 시각화):
- Histogram (히스토그램): 데이터의 분포를 확인.
- Box Plots (박스 플롯): outlier(이상치)와 spread of data를 확인.
- Scatter plots(산점도): 변수 간의 correlation(상관관계)를 검색.
- Bar Charts(막대 그래프) 및 Pie Charts(파이 차트): Categorical Data(범주형 데이터)에 적용.
Data Quality Assessment (데이터 품질 평가):
- Missing Values(결측치) 확인.
- 데이터의 오류나 이상치 식별.
- 각 변수의 데이터 유형 평가(예: numerical, categorical).
Correlation Analysis (상관 분석):
- 변수 간의 선형적 관계(linear relationship)의 강도와 방향 확인.
- 상관 계수를 사용하여 (선형적) 상관관계의 정도를 정량화.
Initial Data Preparation (~Data Preprocessing):
- Data Cleaning (handling missing values, removing duplicates).
- Data Transformation (normalization, scaling).
- Feature Engineering (and/or Feature Selection).

EDA의 목적

EDA의 주요 목표는 다음과 같음:

데이터 세트에 대한 insight(통찰력)을 극대화.
Underlying structure(기저 구조)를 발견.
Important factors (~independent variables)중요 변수를 추출.
outliers 및 anomalies 을 검출.
Test underlying assumptions.

같이 보면 좋은 자료들

2025.05.16 - [Python] - [ML] pandas.DataFrame 에서 EDA에 적합한 메서드 요약

[ML] pandas.DataFrame 에서 EDA에 적합한 메서드 요약

Pandas DataFrame에서 탐색적 데이터 분석(EDA)에 사용할 수 있는 주요 메서드들은 다음과 같음:2024.05.18 - [분류 전체보기] - [ML] Exploratory Data Analysis (EDA) [ML] Exploratory Data Analysis (EDA)Exploratory Data Analysis (

ds31x.tistory.com

https://dsaint31.tistory.com/256

[Statistics] Moment (Probability Moment)

1. Moment (Probability Moment) : Statistics💡 statistics에서 moment는 probability distribution에서 계산되어진 특징값확률 분포를 이용하여 구해지는 random variable의 대표값(or 통계량)을 일반화(generalization)시킨 것

dsaint31.tistory.com

https://dsaint31.tistory.com/818

[Statistics] Tail, Head, and Distribution (w/ Moment)

확률 분포 등에서 헷갈리기 쉬운 tail, head 와 heavy tailed와 light tailed, 그리고 Right skewed 와 Left skewed를 정리.1. Head데이터 분포에서 중심부주로 중앙값(median)의 위치라고 보면 거의 맞음. mean을 사용