Project 2
Regression and Classification with the Ames Housing Data
This project uses the Ames housing data recently made available on kaggle.
Data Science often involves modelling and prediction based on a dataset. In this project, techniques such as regression and classification are explored. Python packages used for this dataset include:
- numpy
- pandas
- matplotlib
- seaborn
- scikit-learn
- imb-learn
1. Estimating the value of homes from fixed characteristics
1.1 Overview of dataset using pandas .describe()
house_res.describe(include='all')
1.2 Fixed characteristics as observed from the data description file:
1.3 Feature engineering
A good practice before performing any modelling or classification is to examine your dataset and look for salient features that may be aggregated or perform any factorisation or binarisation (one-hot encoding) for qualtitative data. A brief feature engineering workflow may be as follows:
- Sum columns that can be aggregated
- Binarise columns
- Drop columns with low variance
- Get dummies for categorical columns
A good feature engineering step that you may consider is to remove quantitative data columns with near 0 variance. This illustrates that that particular column has minimal impact to your prediction or classification. Here is a sample python code used to show columns with near 0 variance.
# near zero variance
def nearZeroVariance(X, freqCut = 95 / 5, uniqueCut = 10):
'''
Determine predictors with near zero or zero variance.
Inputs:
X: pandas data frame
freqCut: the cutoff for the ratio of the most common value to the second most common value
uniqueCut: the cutoff for the percentage of distinct values out of the number of total samples
Returns a tuple containing a list of column names: (zeroVar, nzVar)
'''
colNames = X.columns.values.tolist()
freqRatio = dict()
uniquePct = dict()
for names in colNames:
counts = (
(X[names])
.value_counts()
.sort_values(ascending = False)
.values
)
if len(counts) == 1:
freqRatio[names] = -1
uniquePct[names] = (len(counts) / len(X[names])) * 100
continue
freqRatio[names] = counts[0] / counts[1]
uniquePct[names] = (len(counts) / len(X[names])) * 100
zeroVar = list()
nzVar = list()
for k in uniquePct.keys():
if freqRatio[k] == -1:
zeroVar.append(k)
if uniquePct[k] < uniqueCut and freqRatio[k] > freqCut:
nzVar.append(k)
return(zeroVar, nzVar)
zerovartest = house_res1.loc[:,['LotFrontage','LotArea','MasVnrArea', 'TotalBsmtSF','GrLivArea',\
'GarageArea','WoodDeckSF', 'PoolArea', 'MiscVal']]
nearZeroVariance(zerovartest)
([], ['MasVnrArea', 'PoolArea', 'MiscVal'])
2. Preparing your dataset for prediction/classification
2.1 Train-test split and normalisation
Before running your model, it is important to split your data into train and test datasets. It is also important to normalise any data columns whenever necessary. In our case, we may use scikit-learn’s train-test split module to split our dataset; and its StandardScaler module to normalise our data.
# Lasso regression
lasso = Lasso(alpha=optimal_lasso.alpha_)
lasso_scores = cross_val_score(lasso, Xs, y, cv=10)
print lasso_scores
print np.mean(lasso_scores)
[ 0.88388639 0.8378792 0.83228373 0.73855977 0.7982018 0.7768981
0.82231355 0.77187934 0.51414779 0.83227006]
0.780831973138
3. Model for regression
3.1 Lasso regression to predict house prices
The lasso regression applies regularisation to data columns such that certain columns may not be as information to predict your variable of interest. This is particularly useful for datasets with large amounts of qualitative features.
lasso.fit(Xs, y)
Lasso(alpha=1098.3164314643716, copy_X=True, fit_intercept=True,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
# top 20 features after lasso
lasso_coefs = pd.DataFrame({'variable':X.columns,
'coef':lasso.coef_,
'abs_coef':np.abs(lasso.coef_)})
lasso_coefs.sort_values('abs_coef', inplace=True, ascending=False)
lasso_coefs.head(20)
abs_coef | coef | variable | |
---|---|---|---|
3 | 27801.501124 | 27801.501124 | GrLivArea |
52 | 13227.923817 | 13227.923817 | Neighborhood_NridgHt |
93 | 13146.851630 | 13146.851630 | GarageCars_3 |
2 | 9522.574031 | 9522.574031 | TotalBsmtSF |
58 | 7605.301512 | 7605.301512 | Neighborhood_StoneBr |
51 | 7503.234427 | 7503.234427 | Neighborhood_NoRidge |
57 | 6852.984978 | 6852.984978 | Neighborhood_Somerst |
7 | 5878.558372 | -5878.558372 | BsmtFullBath_0 |
19 | 5098.838853 | -5098.838853 | MSSubClass_90 |
82 | 4719.729393 | 4719.729393 | TotRmsAbvGrd_10 |
53 | 4426.366451 | -4426.366451 | Neighborhood_OldTown |
4 | 4359.232517 | 4359.232517 | GarageArea |
72 | 4212.180965 | -4212.180965 | BedroomAbvGr_5 |
43 | 4058.672481 | -4058.672481 | Neighborhood_Edwards |
8 | 3902.417073 | -3902.417073 | fullbath2< |
21 | 3650.970712 | -3650.970712 | MSSubClass_160 |
5 | 3590.483741 | 3590.483741 | WoodDeckSF |
1 | 3211.026312 | 3211.026312 | LotArea |
41 | 2830.346250 | 2830.346250 | Neighborhood_CollgCr |
9 | 2691.396233 | -2691.396233 | halfbath_0 |
4. Classify records into abnormal or normal sale
4.1 Caveat: Imbalanced dataset
In some cases, your dataset might present itself to be imbalanced. This has large implications towards the building of our models. In the case of classification, an over-representation of a particular class may skew the classification towards the majority class. To mitigate this problem, it is advisable to perform certain sampling techniques in order to balance out the classes.
For this dataset, we will be exploring 2 imbalanced dataset sampling techniques:
- SMOTE - Synthetic Minority Oversampling TEchnique
- Combination of SMOTE and TOMEK - tomek link undersampling
4.1.1 SMOTE
Synthetic minority oversampling technique
- For each minority point, compute nearest neighbours
- draw line to nearest neighbours
- synthetically add a new point as minority
4.1.2 Combination, SMOTE and tomek
Tomek link (undersampling)
- a pair of samples is considered tomek link if they are nearest neighbour and of differing class
- the majority class of the tomek link is then removed (under sample)
4.2 Logistic Regression Classification
We will next use logistic regression to classify our dataset into abnormal or normal housing sale.
param = {'penalty':['l1','l2'] ,\
'C': list(np.linspace(0.1,1,num=10))}
clf = GridSearchCV(LogisticRegression(),param, cv=5)
clf.fit(X_res,y_res)
GridSearchCV(cv=5, error_score='raise',
estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False),
fit_params={}, iid=True, n_jobs=1,
param_grid={'penalty': ['l1', 'l2'], 'C': [0.10000000000000001, 0.20000000000000001, 0.30000000000000004, 0.40000000000000002, 0.5, 0.59999999999999998, 0.70000000000000007, 0.80000000000000004, 0.90000000000000002, 1.0]},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
4.3 Logistic Regression after SMOTE
Classification report
from imblearn.metrics import classification_report_imbalanced
print classification_report_imbalanced(y_test, y_pred,target_names=['normal','abnormal'])
pre rec spe f1 geo iba sup
normal 0.94 0.79 0.40 0.86 0.35 0.13 444
abnormal 0.13 0.40 0.79 0.20 0.35 0.11 35
avg / total 0.88 0.76 0.43 0.81 0.35 0.13 479
4.4 Logistic Regression after SMOTE + TOMEK
Classification report
print classification_report_imbalanced(y_test, y_pred,target_names=['normal','abnormal'])
pre rec spe f1 geo iba sup
normal 0.95 0.77 0.49 0.85 0.37 0.15 444
abnormal 0.14 0.49 0.77 0.22 0.37 0.12 35
avg / total 0.89 0.75 0.51 0.80 0.37 0.14 479