Table of contents
Open Table of contents
1. Data Preparation
Thong thuong du lieu se duoc chia thanh 3 phan: training, validation va test.
Best practice la chung ta nen co mot test set va mot training set rieng biet. Training set se duoc chia ra thanh training va validation. Theo thoi gian training set cang ngay cang tang. Viec de test set rieng biet va co dinh tu dau den cuoi giup cho viec so sanh giua cac model chinh xac hon.
Nhu vay o buoc preparation nay, chi can quan tam toi viec chia du lieu thanh cac sets.
2. Data Exploration
Exploratory data analysis (EDA) la phuong phap giup chung ta hieu ve mat nghia cua du lieu ma chung ta dang xet toi.
Nhung ly do can toi phuong phap EDA:
- Khong chi bao gom viec hoat anh hoa nhung bieu do.
- Rut ra nhung insight.
- Khong phai chi can danh gia mot lan, ma khi du lieu chung ta tang len, chung ta cung can phai coi lai de danh gian hung yeu to nhu distribution shift, …
3. Data Preprocessing
Data prepocessing bao gom hai loai: preparation va transformation.
Cong tac chuan bi bao gom viec sap xep va don dep du lieu. Con ve transformation bao gom feature encoding va feature engineering.
Chi tiet hon ve buoc transformation, ta co cac thao tac nhu sau:
3.1. Scaling
Mot vai dieu can biet:
- La bat buoc cho cac models ma yeu to scale du lieu dau vao anh huong toi ca qua trinh
- Phai chon lua features de scale
- Standardization: la ky thuat rescale cac gia tri dau vao de no co mean = 0 va std = 1
- Min-max
- Binning
3.2. Encoding
Thao tac cho phep bieu dien du lieu mot cach hieu qua, dam bao giu duoc cac signal va hoc nhung pattern. Gom cac phuong phap tieu bieu nhu:
- label
- one-hot
- embeddings