Data

Data#

Setup#

%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant  

sns.set_theme()

Import data#

ROOT = "https://raw.githubusercontent.com/kirenz/modern-statistics/main/data/"
DATA = "duke-forest.csv"

df = pd.read_csv(ROOT + DATA)

Data inspection#

df

	address	price	bed	bath	area	type	year_built	heating	cooling	parking	lot	hoa	url
0	1 Learned Pl, Durham, NC 27705	1520000	3	4.0	6040	Single Family	1972	Other, Gas	central	0 spaces	0.97	NaN	https://www.zillow.com/homedetails/1-Learned-P...
1	1616 Pinecrest Rd, Durham, NC 27705	1030000	5	4.0	4475	Single Family	1969	Forced air, Gas	central	Carport, Covered	1.38	NaN	https://www.zillow.com/homedetails/1616-Pinecr...
2	2418 Wrightwood Ave, Durham, NC 27705	420000	2	3.0	1745	Single Family	1959	Forced air, Gas	central	Garage - Attached, Covered	0.51	NaN	https://www.zillow.com/homedetails/2418-Wright...
3	2527 Sevier St, Durham, NC 27705	680000	4	3.0	2091	Single Family	1961	Heat pump, Other, Electric, Gas	central	Carport, Covered	0.84	NaN	https://www.zillow.com/homedetails/2527-Sevier...
4	2218 Myers St, Durham, NC 27707	428500	4	3.0	1772	Single Family	2020	Forced air, Gas	central	0 spaces	0.16	NaN	https://www.zillow.com/homedetails/2218-Myers-...
...	...	...	...	...	...	...	...	...	...	...	...	...	...
93	2507 Sevier St, Durham, NC 27705	541000	4	4.0	2740	Single Family	1960	Forced air, Heat pump, Gas	central	Carport, Covered	0.51	NaN	https://www.zillow.com/homedetails/2507-Sevier...
94	1207 Woodburn Rd, Durham, NC 27705	473000	3	3.0	2171	Single Family	1955	Forced air, Electric, Gas	other	0 spaces	0.61	NaN	https://www.zillow.com/homedetails/1207-Woodbu...
95	3008 Montgomery St, Durham, NC 27705	490000	4	4.0	2972	Single Family	1984	Forced air, Electric, Gas	central	Garage - Attached, Off-street, Covered	0.65	NaN	https://www.zillow.com/homedetails/3008-Montgo...
96	1614 Pinecrest Rd, Durham, NC 27705	815000	4	4.0	3904	Single Family	1970	Forced air, Gas	other	Garage - Attached, Garage - Detached, Covered	1.47	NaN	https://www.zillow.com/homedetails/1614-Pinecr...
97	2708 Circle Dr, Durham, NC 27705	674500	4	4.0	3766	Single Family	1955	Forced air, Electric, Gas	other	0 spaces	0.73	NaN	https://www.zillow.com/homedetails/2708-Circle...

98 rows × 13 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   address     98 non-null     object 
 1   price       98 non-null     int64  
 2   bed         98 non-null     int64  
 3   bath        98 non-null     float64
 4   area        98 non-null     int64  
 5   type        98 non-null     object 
 6   year_built  98 non-null     int64  
 7   heating     98 non-null     object 
 8   cooling     98 non-null     object 
 9   parking     98 non-null     object 
 10  lot         97 non-null     float64
 11  hoa         1 non-null      object 
 12  url         98 non-null     object 
dtypes: float64(2), int64(4), object(7)
memory usage: 10.1+ KB

# show missing values (missing values - if present - will be displayed in yellow)
sns.heatmap(df.isnull(), 
            yticklabels=False,
            cbar=False, 
            cmap='viridis');

../_images/6faa903226cfde5b06306ad6fd5b96c3ea2b4c1a4f79887a55d7b8e751c14269.png

print(df.isnull().sum())

address        0
price          0
bed            0
bath           0
area           0
type           0
year_built     0
heating        0
cooling        0
parking        0
lot            1
hoa           97
url            0
dtype: int64

Data transformation#

# drop column with too many missing values
df = df.drop(['hoa'], axis=1)

# drop remaining row with one missing value
df = df.dropna()

# Drop irrelevant features
df = df.drop(['url', 'address'], axis=1)

print(df.isnull().sum())

price         0
bed           0
bath          0
area          0
type          0
year_built    0
heating       0
cooling       0
parking       0
lot           0
dtype: int64

# Convert data types
categorical_list = ['type', 'heating', 'cooling', 'parking']

for i in categorical_list:
    df[i] = df[i].astype("category")

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97 entries, 0 to 97
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   price       97 non-null     int64   
 1   bed         97 non-null     int64   
 2   bath        97 non-null     float64 
 3   area        97 non-null     int64   
 4   type        97 non-null     category
 5   year_built  97 non-null     int64   
 6   heating     97 non-null     category
 7   cooling     97 non-null     category
 8   parking     97 non-null     category
 9   lot         97 non-null     float64 
dtypes: category(4), float64(2), int64(4)
memory usage: 7.3 KB

# summary statistics for all categorical columns
df.describe(include=['category']).transpose()

	count	unique	top	freq
type	97	1	Single Family	97
heating	97	19	Forced air, Gas	34
cooling	97	2	other	52
parking	97	19	0 spaces	42

Variable type has zero veriation (only single family) and therefore can be exluded from the analysis and the model.
We will also exclude heatingand parking to keep this example as simple as possible.

df = df.drop(['type', 'heating', 'parking'], axis=1)
df

	price	bed	bath	area	year_built	cooling	lot
0	1520000	3	4.0	6040	1972	central	0.97
1	1030000	5	4.0	4475	1969	central	1.38
2	420000	2	3.0	1745	1959	central	0.51
3	680000	4	3.0	2091	1961	central	0.84
4	428500	4	3.0	1772	2020	central	0.16
...	...	...	...	...	...	...	...
93	541000	4	4.0	2740	1960	central	0.51
94	473000	3	3.0	2171	1955	other	0.61
95	490000	4	4.0	2972	1984	central	0.65
96	815000	4	4.0	3904	1970	other	1.47
97	674500	4	4.0	3766	1955	other	0.73

97 rows × 7 columns

Data splitting#

train_dataset = df.sample(frac=0.8, random_state=0)
test_dataset = df.drop(train_dataset.index)

train_dataset

	price	bed	bath	area	year_built	cooling	lot
26	385000	3	2.0	1831	1951	central	0.29
85	485000	4	3.0	2609	1962	other	0.52
2	420000	2	3.0	1745	1959	central	0.51
55	150000	3	1.0	1734	1945	other	0.16
69	105000	2	1.0	1094	1940	other	0.26
...	...	...	...	...	...	...	...
96	815000	4	4.0	3904	1970	other	1.47
70	520000	4	3.0	2637	1968	other	0.65
20	270000	3	3.0	1416	1990	other	0.36
92	590000	5	3.0	3323	1980	other	0.43
73	592000	3	2.0	2378	1960	other	0.75

78 rows × 7 columns

Exploratory data analysis#

# summary statistics for all numerical columns
round(train_dataset.describe(),2).transpose()

	count	mean	std	min	25%	50%	75%	max
price	78.0	560762.18	243254.08	95000.00	421250.00	537500.00	650000.00	1520000.00
bed	78.0	3.81	0.74	2.00	3.00	4.00	4.00	6.00
bath	78.0	3.10	0.92	1.00	2.50	3.00	4.00	5.00
area	78.0	2831.40	986.38	1094.00	2095.25	2745.00	3261.75	6178.00
year_built	78.0	1965.82	16.80	1923.00	1956.25	1961.50	1971.50	2020.00
lot	78.0	0.59	0.23	0.15	0.45	0.56	0.69	1.47

sns.pairplot(train_dataset);

../_images/874e5ec030bb4ec32a261d81627f9d508c51c0abd301bf34c3789d5c24050f9c.png

Correlation analysis#

# Create correlation matrix for numerical variables
corr_matrix = train_dataset.corr()
corr_matrix

	price	bed	bath	area	year_built	lot
price	1.000000	0.446668	0.593686	0.680012	0.248102	0.537264
bed	0.446668	1.000000	0.599660	0.560258	0.216696	0.248166
bath	0.593686	0.599660	1.000000	0.659879	0.351917	0.335490
area	0.680012	0.560258	0.659879	1.000000	0.165495	0.412836
year_built	0.248102	0.216696	0.351917	0.165495	1.000000	-0.047352
lot	0.537264	0.248166	0.335490	0.412836	-0.047352	1.000000

# Simple heatmap
heatmap = sns.heatmap(corr_matrix)

../_images/6ff41025351d711c35a17882b48965e8bea67dbfd40cb6dbce00234cee3f753c.png

# Make a pretty heatmap

# Use a mask to plot only part of a matrix
mask = np.zeros_like(corr_matrix)
mask[np.triu_indices_from(mask)]= True

# Change size
plt.subplots(figsize=(11, 15))

# Build heatmap with additional options
heatmap = sns.heatmap(corr_matrix, 
                      mask = mask, 
                      square = True, 
                      linewidths = .5,
                      cmap = 'coolwarm',
                      cbar_kws = {'shrink': .6,
                                'ticks' : [-1, -.5, 0, 0.5, 1]},
                      vmin = -1,
                      vmax = 1,
                      annot = True,
                      annot_kws = {"size": 10})

../_images/edd9e4064f476e4e686e3d9de57b722c892d8798d3849224e45b03a413f83091.png

Instead of inspecting the correlation matrix, a better way to assess multicollinearity is to compute the variance inflation factor (VIF). Note that we ignore the intercept in this test.

The smallest possible value for VIF is 1, which indicates the complete absence of collinearity.
Typically in practice there is a small amount of collinearity among the predictors.
As a rule of thumb, a VIF value that exceeds 5 indicates a problematic amount of collinearity and the parameter estimates will have large standard errors because of this.

Note that the function variance_inflation_factor expects the presence of a constant in the matrix of explanatory variables. Therefore, we use add_constant from statsmodels to add the required constant to the dataframe before passing its values to the function.

# choose features and add constant
features = add_constant(df[['bed', 'bath', 'area', 'lot']])
# create empty DataFrame
vif = pd.DataFrame()
# calculate vif
vif["VIF Factor"] = [variance_inflation_factor(features.values, i) for i in range(features.shape[1])]
# add feature names
vif["Feature"] = features.columns

vif.round(2)

	VIF Factor	Feature
0	28.52	const
1	1.74	bed
2	2.17	bath
3	2.14	area
4	1.19	lot

We don’t have a problematic amount of collinearity in our data.

Modeling#

See separate notebooks.