30 Nov 2017

Exploratory Data Analysis

This project is focused on exploratory data analysis, aka “EDA” of SAT and drugs dataset from the States. EDA is an essential part of the data science analysis pipeline.

Exploratory Data Analysis (EDA)

Your hometown mayor just created a new data analysis team to give policy advice, and the administration recruited you via LinkedIn to join it. Unfortunately, due to budget constraints, for now the “team” is just you…

The mayor wants to start a new initiative to move the needle on one of two separate issues: high school education outcomes, or drug abuse in the community.

Also unfortunately, that is the entirety of what you’ve been told. And the mayor just went on a lobbyist-funded fact-finding trip in the Bahamas. In the meantime, you got your hands on two national datasets: one on SAT scores by state, and one on drug use by age. Start exploring these to look for useful patterns and possible hypotheses!

This project is focused on exploratory data analysis, aka “EDA”. EDA is an essential part of the data science analysis pipeline. Failure to perform EDA before modeling is almost guaranteed to lead to bad models and faulty conclusions. What you do in this project are good practices for all projects going forward!

This jupyter notebook lab includes a variety of plotting problems.

As a data scientist, one will use visualization and plotting every single day.

Package imports

import numpy as np
import scipy.stats as stats
import csv
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# this line tells jupyter notebook to put the plots in the notebook rather than saving them to file.
%matplotlib inline

# this line makes plots prettier on mac retina screens. If you don't have one it shouldn't do anything.
%config InlineBackend.figure_format = 'retina'

1. Load the `sat_scores.csv` dataset and describe it

Input the placeholder path to the sat_scores.csv dataset below with a specific path to the file.

1.1 Load the file with the `csv` module and put it in a Python dictionary

The dictionary format for data will be the column names as key, and the data under each column as the values.

Toy example:

data = {
    'column1':[0,1,2,3],
    'column2':['a','b','c','d']
    }

#Create dictionary

with open('sat_scores.csv', 'r') as csvfile:
    reader = csv.reader(csvfile)
    data = list(reader)

d = {}

for n in range(4):
    for key in data[0]:
        d[key] = [row[n] for row in data[1::]]
d

{'Math': ['510',
  '513',
  '515',
  '505',
  '516',
  '499',
  '499',
  '506',
  '500',
  '501',
  '499',
  '510',
  '499',
  '489',
  '501',
  '488',
  '474',
  '526',
  '499',
  '527',
  '499',
  '515',
  '510',
  '517',
  '525',
  '515',
  '542',
  '439',
  '539',
  '512',
  '542',
  '553',
  '542',
  '589',
  '550',
  '545',
  '572',
  '589',
  '580',
  '554',
  '568',
  '561',
  '577',
  '562',
  '596',
  '550',
  '570',
  '603',
  '582',
  '599',
  '551',
  '514'],
 'Rate': ['510',
  '513',
  '515',
  '505',
  '516',
  '499',
  '499',
  '506',
  '500',
  '501',
  '499',
  '510',
  '499',
  '489',
  '501',
  '488',
  '474',
  '526',
  '499',
  '527',
  '499',
  '515',
  '510',
  '517',
  '525',
  '515',
  '542',
  '439',
  '539',
  '512',
  '542',
  '553',
  '542',
  '589',
  '550',
  '545',
  '572',
  '589',
  '580',
  '554',
  '568',
  '561',
  '577',
  '562',
  '596',
  '550',
  '570',
  '603',
  '582',
  '599',
  '551',
  '514'],
 'State': ['510',
  '513',
  '515',
  '505',
  '516',
  '499',
  '499',
  '506',
  '500',
  '501',
  '499',
  '510',
  '499',
  '489',
  '501',
  '488',
  '474',
  '526',
  '499',
  '527',
  '499',
  '515',
  '510',
  '517',
  '525',
  '515',
  '542',
  '439',
  '539',
  '512',
  '542',
  '553',
  '542',
  '589',
  '550',
  '545',
  '572',
  '589',
  '580',
  '554',
  '568',
  '561',
  '577',
  '562',
  '596',
  '550',
  '570',
  '603',
  '582',
  '599',
  '551',
  '514'],
 'Verbal': ['510',
  '513',
  '515',
  '505',
  '516',
  '499',
  '499',
  '506',
  '500',
  '501',
  '499',
  '510',
  '499',
  '489',
  '501',
  '488',
  '474',
  '526',
  '499',
  '527',
  '499',
  '515',
  '510',
  '517',
  '525',
  '515',
  '542',
  '439',
  '539',
  '512',
  '542',
  '553',
  '542',
  '589',
  '550',
  '545',
  '572',
  '589',
  '580',
  '554',
  '568',
  '561',
  '577',
  '562',
  '596',
  '550',
  '570',
  '603',
  '582',
  '599',
  '551',
  '514']}

1.2 Make a pandas DataFrame object with the SAT dictionary, and another with the pandas `.read_csv()` function

Compare the DataFrames using the .dtypes attribute in the DataFrame objects. What is the difference between loading from file and inputting this dictionary (if any)?

sat = pd.read_csv('sat_scores.csv')
sat.dtypes

State     object
Rate       int64
Verbal     int64
Math       int64
dtype: object

sat2 = pd.DataFrame(d)
sat2.dtypes

Math      object
Rate      object
State     object
Verbal    object
dtype: object

Answer:

The difference between creating a dataframe from a dictionary (in Q1.1) and using the read_csv panda function is the content of the dataframe. The elements within the csv file will maintain it’s type when using the read_csv panda function. However, the creating a dataframe from a dictionary will convert the elements into objects. This will pose issues during data cleaning and munging process.

TIP:

If you don’t convert the string column values to float in your dictionary, the columns in the DataFrame are of type object (which are string values, essentially).

1.3 Look at the first ten rows of the DataFrame: What does our data describe?

From now on, use the DataFrame loaded from the file using the .read_csv() function.

Use the .head(num) built-in DataFrame function, where num is the number of rows to print out.

You are not given a “codebook” with this data, so some (very minor) inference must be made.

sat.head(10)

	State	Rate	Verbal	Math
0	CT	82	509	510
1	NJ	81	499	513
2	MA	79	511	515
3	NY	77	495	505
4	NH	72	520	516
5	RI	71	501	499
6	PA	71	500	499
7	VT	69	511	506
8	ME	69	506	500
9	VA	68	510	501

# We found that one of the rows identifies the scores of all the states. 
# We need to check if that data in the last row is correct
sat['State'].unique()

array(['CT', 'NJ', 'MA', 'NY', 'NH', 'RI', 'PA', 'VT', 'ME', 'VA', 'DE',
       'MD', 'NC', 'GA', 'IN', 'SC', 'DC', 'OR', 'FL', 'WA', 'TX', 'HI',
       'AK', 'CA', 'AZ', 'NV', 'CO', 'OH', 'MT', 'WV', 'ID', 'TN', 'NM',
       'IL', 'KY', 'WY', 'MI', 'MN', 'KS', 'AL', 'NE', 'OK', 'MO', 'LA',
       'WI', 'AR', 'UT', 'IA', 'SD', 'ND', 'MS', 'All'], dtype=object)

2. Create a “data dictionary” based on the data

A data dictionary is an object that describes your data. This should contain the name of each variable (column), the type of the variable, your description of what the variable is, and the shape (rows and columns) of the entire dataset.

#Checking the type of data for each column
sat.dtypes

State     object
Rate       int64
Verbal     int64
Math       int64
dtype: object

#Checking the data size
sat.shape

(52, 4)

#Checking if there are null values
sat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 4 columns):
State     52 non-null object
Rate      52 non-null int64
Verbal    52 non-null int64
Math      52 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.7+ KB

#Familiarise with the column names
sat.columns

Index(['State', 'Rate', 'Verbal', 'Math'], dtype='object')

sat.tail()

	State	Rate	Verbal	Math
47	IA	5	593	603
48	SD	4	577	582
49	ND	4	592	599
50	MS	4	566	551
51	All	45	506	514

Update of the SAT Dataset:

No null values within the DataFrame
Last row attempts to summarise the readings of each feature.
Assess the findings in the ‘All’ row in order to identify if it’s summary is correct by removing the last row to ensure that the data analysis is not affected by the ‘summary’ of the table

sats = sat.drop(51)

sats.describe()

	Rate	Verbal	Math
count	51.000000	51.000000	51.000000
mean	37.000000	532.529412	531.843137
std	27.550681	33.360667	36.287393
min	4.000000	482.000000	439.000000
25%	9.000000	501.000000	503.000000
50%	33.000000	527.000000	525.000000
75%	64.000000	562.000000	557.500000
max	82.000000	593.000000	603.000000

From the mean value of Rate, Math, and Verbal in the SAT DataFrame, we observe that the mean are 37.0, 533 and 532. These number are not coherent with the values in the ‘All’ row. Therefore, we shall proceed to remove the last row of the sat DataFrame.

3. Plot the data using seaborn

3.1 Using seaborn’s `distplot`, plot the distributions for each of `Rate`, `Math`, and `Verbal`

Set the keyword argument kde=False. This way you can actually see the counts within bins. You can adjust the number of bins to your liking.

rate = sats['Rate']
sns.distplot(rate, bins= 10, kde=False)
plt.show()

math= sats['Math']
sns.distplot(math, bins= 50, kde=False)
plt.show()

verbal = sats['Verbal']
sns.distplot(verbal, bins= 50, kde=False)
plt.show()

png

3.2 Using seaborn’s `pairplot`, show the joint distributions for each of `Rate`, `Math`, and `Verbal`

Explain what the visualization tells you about your data.

sns.pairplot(sats, vars=['Rate','Math','Verbal'] , hue='State')
plt.show()

From the pairplot, we are able to observe that there is states which fair very well on the Math and Verbal test, they tend to have low rate of SAT passes. Therefore, we are able to infer that students from the states which scores well, they take the SAT when they are sure to score.

4. Plot the data using built-in pandas functions.

Pandas is very powerful and contains a variety of nice, built-in plotting functions for your data.

4.1 Plot a stacked histogram with `Verbal` and `Math` using pandas

verbal_math = sats[['Verbal','Math']]
verbal_math.plot.hist(stacked=True, bins=20)

<matplotlib.axes._subplots.AxesSubplot at 0x1a185fa828>

png

4.2 Plot `Verbal` and `Math` on the same chart using boxplots

What are the benefits of using a boxplot as compared to a scatterplot or a histogram?

The boxplot is capable of indicating the upper and lower limit, maximum value and minimum value. However, the scatterplot and histogram only gives us a generic idea of distribution of the data.

color = dict(boxes='DarkGreen', whiskers='DarkOrange',medians='DarkBlue', caps='Gray')
verbal_math.plot.box(color=color, sym='r+')

<matplotlib.axes._subplots.AxesSubplot at 0x1a172db860>

png

4.3 Plot `Verbal`, `Math`, and `Rate` appropriately on the same boxplot chart

Think about how you might change the variables so that they would make sense on the same chart. Explain your rationale for the choices on the chart. You should strive to make the chart as intuitive as possible.

verbal_math_rate = sats[['Verbal','Math','Rate']]
color = dict(boxes='DarkGreen', whiskers='DarkOrange',medians='DarkBlue', caps='Gray')
verbal_math_rate.plot.box(color=color, sym='r+')

<matplotlib.axes._subplots.AxesSubplot at 0x1a18e7f978>

png

What’s wrong with plotting a box-plot of Rate on the same chart as Math and Verbal?

The initial boxplot is a misrepresentation. The rate is not on the same metric as verbal and maths. Therefore, the dataset must be standardised.

#Rate descriptive data
rate_value = sats.Rate.values
rate_mean = np.mean(rate_value)
rate_std = np.std(rate_value)
print(rate_mean, rate_std)

37.0 27.27923867605359

#Standardisation of rate
rate_stand = (rate_value - rate_mean) / rate_std
print(np.mean(rate_stand), np.std(rate_stand))
# not exactly a mean of 0 but excruciatingly close

-8.707631565687502e-18 1.0

#Plot histogram to get observe any changes
fig = plt.figure(figsize=(8,4))
ax = fig.gca()
ax = sns.distplot(rate, bins=30, kde=False)
plt.show()

png

Notice that nothing changes about the distribution except for the location and the scale

#Maths descriptive data
math_value = sats.Math.values
math_mean = np.mean(math_value)
math_std = np.std(math_value)
print(math_mean, math_std)

531.843137254902 35.92987317311408

#Standardisation of math
math_stand = (math_value - math_mean) / math_std
print(np.mean(math_stand), np.std(math_stand))
# not exactly a mean of 0 but excruciatingly close

-8.272249987403127e-16 1.0

#Plot histogram to get observe any changes
fig = plt.figure(figsize=(8,4))
ax = fig.gca()
ax = sns.distplot(math, bins=30, kde=False)
plt.show()

png

#Verbal descriptive data
verbal_value = sats.Verbal.values
verbal_mean = np.mean(verbal_value)
verbal_std = np.std(verbal_value)
print(verbal_mean, verbal_std)

532.5294117647059 33.03198268415228

#Standardisation of Verbal
verbal_stand = (verbal_value - verbal_mean) / verbal_std
print(np.mean(verbal_stand), np.std(verbal_stand))
# not exactly a mean of 0 but excruciatingly close

8.098097356089377e-16 0.9999999999999998

#Plot histogram to get observe any changes
fig = plt.figure(figsize=(8,4))
ax = fig.gca()

ax = sns.distplot(verbal, bins=30, kde=False)
plt.show()

png

sats.head()

	State	Rate	Verbal	Math
0	CT	82	509	510
1	NJ	81	499	513
2	MA	79	511	515
3	NY	77	495	505
4	NH	72	520	516

sats_without_state = sats.drop(['State'],axis=1)

sats_without_state.head()

	Rate	Verbal	Math
0	82	509	510
1	81	499	513
2	79	511	515
3	77	495	505
4	72	520	516

sats_without_state_stand = (sats_without_state - sats_without_state.mean()) / sats_without_state.std()

sats_without_state_stand.plot.box(color=color, sym='r+')

<matplotlib.axes._subplots.AxesSubplot at 0x1a18e7bf28>

png

5. Create and examine subsets of the data

For these questions masking will be used in pandas. Masking uses conditional statements to select portions of your DataFrame (through boolean operations under the hood.)

Remember the distinction between DataFrame indexing functions in pandas:

.iloc[row, col] : row and column are specified by index, which are integers
.loc[row, col]  : row and column are specified by string "labels" (boolean arrays are allowed; useful for rows)
.ix[row, col]   : row and column indexers can be a mix of labels and integer indices

For detailed reference and tutorial make sure to read over the pandas documentation:

http://pandas.pydata.org/pandas-docs/stable/indexing.html

5.1 Find the list of states that have `Verbal` scores greater than the average of `Verbal` scores across states

How many states are above the mean? What does this tell you about the distribution of Verbal scores?

verbal_mean = sats['Verbal'].mean()
print(verbal_mean)
sats.loc[sats['Verbal'] > verbal_mean]

532.5294117647059

	State	Rate	Verbal	Math
26	CO	31	539	542
27	OH	26	534	439
28	MT	23	539	539
30	ID	17	543	542
31	TN	13	562	553
32	NM	13	551	542
33	IL	12	576	589
34	KY	12	550	550
35	WY	11	547	545
36	MI	11	561	572
37	MN	9	580	589
38	KS	9	577	580
39	AL	9	559	554
40	NE	8	562	568
41	OK	8	567	561
42	MO	8	577	577
43	LA	7	564	562
44	WI	6	584	596
45	AR	6	562	550
46	UT	5	575	570
47	IA	5	593	603
48	SD	4	577	582
49	ND	4	592	599
50	MS	4	566	551

sats.loc[sats['Verbal'] > verbal_mean].count()

State     24
Rate      24
Verbal    24
Math      24
dtype: int64

There are a total of 24 states that above the mean score of the verbal test. Therefore, the distribution is considerably normally distributed.

5.2 Find the list of states that have `Verbal` scores greater than the median of `Verbal` scores across states

How does this compare to the list of states greater than the mean of Verbal scores? Why?

verbal_median = sats['Verbal'].median() 
print(verbal_median)
sats[sats['Verbal'] > verbal_median]

527.0

	State	Rate	Verbal	Math
26	CO	31	539	542
27	OH	26	534	439
28	MT	23	539	539
30	ID	17	543	542
31	TN	13	562	553
32	NM	13	551	542
33	IL	12	576	589
34	KY	12	550	550
35	WY	11	547	545
36	MI	11	561	572
37	MN	9	580	589
38	KS	9	577	580
39	AL	9	559	554
40	NE	8	562	568
41	OK	8	567	561
42	MO	8	577	577
43	LA	7	564	562
44	WI	6	584	596
45	AR	6	562	550
46	UT	5	575	570
47	IA	5	593	603
48	SD	4	577	582
49	ND	4	592	599
50	MS	4	566	551

sats[sats['Verbal'] > verbal_median].count()

State     24
Rate      24
Verbal    24
Math      24
dtype: int64

There are a total of 24 states that above the median score of the verbal test. It is similar to list of state which are above the mean score of the verbal test. This is so as the distribution of the scores are normally distributed.

5.3 Create a column that is the difference between the `Verbal` and `Math` scores

Specifically, this should be Verbal - Math.

sats['diff_verbal_math'] = sats['Verbal'] - sats['Math']
sats.head(5)

	State	Rate	Verbal	Math	diff_verbal_math
0	CT	82	509	510	-1
1	NJ	81	499	513	-14
2	MA	79	511	515	-4
3	NY	77	495	505	-10
4	NH	72	520	516	4

5.4 Create two new DataFrames showing states with the greatest difference between scores

Your first DataFrame should be the 10 states with the greatest gap between Verbal and Math scores where Verbal is greater than Math. It should be sorted appropriately to show the ranking of states.
Your second DataFrame will be the inverse: states with the greatest gap between Verbal and Math such that Math is greater than Verbal. Again, this should be sorted appropriately to show rank.
Print the header of both variables, only showing the top 3 states in each.

Question 5.4.1 Find the states with the highest difference between `Verbal` and `Math` scores

state_diff_score = sats[['State','diff_verbal_math']]
top_3_verbal = state_diff_score.sort_values('diff_verbal_math').head(9)
print('The states is ' + top_3_verbal.State.head(3))

  The states is HI
  The states is CA
   The states is NJ
Name: State, dtype: object

Question 5.4.2 Find the states with the lowest difference between `Verbal` and `Math` scores

top_3_math = state_diff_score.sort_values('diff_verbal_math').tail(9)
print('The state is ' +  top_3_math.State.head(3))

  The state is OK
  The state is DC
  The state is NM
Name: State, dtype: object

6. Examine summary statistics

Checking the summary statistics for data is an essential step in the EDA process!

6.1 Create the correlation matrix of your variables (excluding `State`).

What does the correlation matrix tell you?

# Create dataframe to analyse the different factors on a heat map
features = sats[['Math','Verbal', 'Rate']]

# create correlation data
correlation = features.corr()

# plot heatmap
ax = sns.heatmap(correlation, annot=True)

# turn the axis label
for item in ax.get_yticklabels():
    item.set_rotation(0)

for item in ax.get_xticklabels():
    item.set_rotation(90)

png

Based on the absolute value of the pearson correlation value, we observe high correlation between two sets of groups: Verbal and Maths scores (0.9) and, Verbal scores and Rate of passes(-0.89). This implies that candidates with high verbal scores are highly likely in attain high maths scores. Whereas, candidates with high verbal scores are highly likely to come from states with low passing rates.

6.2 Use pandas’ `.describe()` built-in function on your DataFrame

Write up what each of the rows returned by the function indicate.

math.describe()

count     51.000000
mean     531.843137
std       36.287393
min      439.000000
25%      503.000000
50%      525.000000
75%      557.500000
max      603.000000
Name: Math, dtype: float64

Based on the data of the 50 states, we observe that the mean math score of the country is 532 with a minimum score of 439 and maximum score of 603.Those in the upper percentile will score more than 558.

verbal.describe()

count     51.000000
mean     532.529412
std       33.360667
min      482.000000
25%      501.000000
50%      527.000000
75%      562.000000
max      593.000000
Name: Verbal, dtype: float64

Based on the data of the 50 states, we observe that the mean verbal score of the country is 533 with a minimum score of 482 and maximum score of 593. Those in the upper percentile will score more than 593.

rate.describe()

count    51.000000
mean     37.000000
std      27.550681
min       4.000000
25%       9.000000
50%      33.000000
75%      64.000000
max      82.000000
Name: Rate, dtype: float64

Based on the data of the 50 states, we observe that the mean rate of the country is 37 with a minimum rate of 4 and maximum score of 82. Those in the upper percentile will rate more than 64.

6.3 Assign and print the covariance matrix for the dataset

Describe how the covariance matrix is different from the correlation matrix.
What is the process to convert the covariance into the correlation?
Why is the correlation matrix preferred to the covariance matrix for examining relationships in your data?

6.3.1 & 6.3.2 Understanding the concept of Covariance and Pearson Correlation

The covariance matrix indicates “relatedness” between variables. It is literally the sum of deviations from the mean of X times deviations from the mean of Y adjusted by the sample size N.

$\text{covariance}(X, Y) = \sum_{i=1}^N \frac{(X - \bar{X})(Y - \bar{Y})}{N}$

From the formula above, it is challenging in deducing what the value means as X and Y might have different units. Therefore, we rely on the correlation matrix helps normalise the covariance by dividing it with the standard deviation:

$\text{pearson correlation}\;r = cor(X, Y) =\frac{cov(X, Y)}{std(X)std(Y)}$

The correlation matrix allows us to easily interprete the “relatedness” on a scale of -1 to 1 as it takes the diversity of the data into account.

6.3.3 Covariance of `Verbal`, `Math`, and `Rate`

The correlation matrix is much more effective in assessing the the relationship between the two different variables as it the deviations of both variables will be taken into account.

covariance = features.cov()
print(covariance)

               Math       Verbal    Rate
Math    1316.774902  1089.404706 -773.22
Verbal  1089.404706  1112.934118 -816.28
Rate    -773.220000  -816.280000  759.04

7. Performing EDA on “drug use by age” data.

You will now switch datasets to one with many more variables. This section of the project is more open-ended.

We’ll work with the “drug-use-by-age.csv” data, sourced from and described here: https://github.com/fivethirtyeight/data/tree/master/drug-use-by-age.

7.1 Load the data using pandas.

Does this data require cleaning? Are variables missing? How will this affect your approach to EDA on the data?

#Load the data using pandas
drug = pd.read_csv('drug-use-by-age.csv', na_values='-')
drug.head()

Data quality check :

No null values
Check Columns labels
Check dtypes of the columns

# Ensure that there are no null values
drug.isnull().sum()

age                        0
n                          0
alcohol-use                0
alcohol-frequency          0
marijuana-use              0
marijuana-frequency        0
cocaine-use                0
cocaine-frequency          1
crack-use                  0
crack-frequency            3
heroin-use                 0
heroin-frequency           1
hallucinogen-use           0
hallucinogen-frequency     0
inhalant-use               0
inhalant-frequency         1
pain-releiver-use          0
pain-releiver-frequency    0
oxycontin-use              0
oxycontin-frequency        1
tranquilizer-use           0
tranquilizer-frequency     0
stimulant-use              0
stimulant-frequency        0
meth-use                   0
meth-frequency             2
sedative-use               0
sedative-frequency         0
dtype: int64

We can observe that there are null values in the EDA data set under cocaine frequency, crack frequency, heroin frequency, inhalant frequency, oxycontin frequency.

drug.fillna(0, inplace=True)

drug.isnull().sum()

age                        0
n                          0
alcohol-use                0
alcohol-frequency          0
marijuana-use              0
marijuana-frequency        0
cocaine-use                0
cocaine-frequency          0
crack-use                  0
crack-frequency            0
heroin-use                 0
heroin-frequency           0
hallucinogen-use           0
hallucinogen-frequency     0
inhalant-use               0
inhalant-frequency         0
pain-releiver-use          0
pain-releiver-frequency    0
oxycontin-use              0
oxycontin-frequency        0
tranquilizer-use           0
tranquilizer-frequency     0
stimulant-use              0
stimulant-frequency        0
meth-use                   0
meth-frequency             0
sedative-use               0
sedative-frequency         0
dtype: int64

drug.columns

Index(['age', 'n', 'alcohol-use', 'alcohol-frequency', 'marijuana-use',
       'marijuana-frequency', 'cocaine-use', 'cocaine-frequency', 'crack-use',
       'crack-frequency', 'heroin-use', 'heroin-frequency', 'hallucinogen-use',
       'hallucinogen-frequency', 'inhalant-use', 'inhalant-frequency',
       'pain-releiver-use', 'pain-releiver-frequency', 'oxycontin-use',
       'oxycontin-frequency', 'tranquilizer-use', 'tranquilizer-frequency',
       'stimulant-use', 'stimulant-frequency', 'meth-use', 'meth-frequency',
       'sedative-use', 'sedative-frequency'],
      dtype='object')

We replace the ‘-‘ in the columns of the drug dataframe as it would be perceived as the subtraction arithmetic operation when we cast a method. Therefore, it would be replace with a ‘_’. From the columns names above, we can observe that ‘pain-releiver’ and ‘oxycontin’ is spelled wrongly.

# List Replacement Method
new_names = ['age', 'n','alcohol_use', 'alcohol_frequency', 'marijuana_use',
        'marijuana_frequency', 'cocaine_use', 'cocaine_frequency',
        'crack_use', 'crack_frequency', 'heroin_use', 'heroin_frequency',
        'hallucinogen_use', 'hallucinogen_frequency', 'inhalant_use',
        'inhalant_frequency', 'painrelieve_use', 'painrelieve_frequency',
        'oxycontin_use', 'oxycontin_frequency', 'tranquilizer_use',
        'tranquilizer-frequency', 'stimulant_use', 'stimulant_frequency',
        'meth_use', 'meth_frequency', 'sedative_use', 'sedative_frequency']
drug.columns = new_names
drug.head()

We add a new column at the front of the table to understand the age proportion within the study. This will help us understand the age distribution amongst the various types of drug use.

new_col = (drug['n']/sum(drug['n'])*100)
idx = 2
drug.insert(loc=idx, column='age_percentage', value=new_col)
drug.head(2)

We are going to make a new data set without the age column due to imbalance in age groups and it would be easier to process the data using other techniques.

druggie = drug.drop(['age'], axis=1)
druggie.head(2)

7.2 Do a high-level, initial overview of the data

Get a feel for what this dataset is all about.

Use whichever techniques you’d like, including those from the SAT dataset EDA. The final response to this question should be a written description of what you infer about the dataset.

Some things to consider doing:

Look for relationships between variables and subsets of those variables’ values
Derive new features from the ones available to help your analysis
Visualize everything!

To understand the relationships between variables, we will visualise the data in the form of a heat map and boxplot. The box plot is use to identify the outliers whereas the heat map will point us to the more closely related variables.

Before doing so, we must separate the data points into two groups: 1) Drug use and 2) Drug Frequency

# The n column is dropped because do not measure the drug use on a percentage scale. 
drug_only = drug.drop(['n'], axis=1)

use_drug = list(drug_only.columns[::2])
use = drug_only[use_drug]

freq_drug = list(drug_only.columns[1::2])
freq = drug_only[freq_drug]

#Configure the shape of the boxplot
fig = plt.figure(figsize=(12,8))
ax = fig.gca()

# Plot the boxplot
ax = sns.boxplot(data=use, orient='h', ax=ax)

# Add titles and axes labels to the boxplot
ax.set_title('Different drug use\n')
plt.xlabel(' % drug use\n')
plt.ylabel('Drug type\n')

# Display the boxplot
plt.show()

The drug use boxplot tells us that there are no outliers. Alcohol and marijuana use is much more widely used than the other drugs.

#Configure the shape of the boxplot
fig = plt.figure(figsize=(12,8))
ax = fig.gca()

# Plot the boxplot
ax = sns.boxplot(data=freq, orient='h', ax=ax)

# Add titles and axes labels to the boxplot
ax.set_title('Different drug frequency\n')
plt.xlabel(' Drug frequency in 12 months\n')
plt.ylabel('Drug type\n')

# Display the boxplot
plt.show()

The drug frequency boxplot tells us that there are outliers in the hallucinogen, pain reliever, tranquilizer, stimulant and sedative category. Alcohol, heroin and marijuana frequency remains to be much more widely used than the other drugs.

# create correlation data
plt.figure(figsize=(20,16))
drug_correlation = drug.corr()

# plot heatmap
ax = sns.heatmap(drug_correlation, linewidth=0.5, cmap="RdBu_r", annot=True)

# turn the axis label
for item in ax.get_yticklabels():
    item.set_rotation(0)

for item in ax.get_xticklabels():
    item.set_rotation(90)

Based on the EDA correlation heatmap, we see a strong positive correlation between ( above 0.95) within the usage of certain drugs:

Stimulant and marijuana use
Oxytocin and marijuana use
Hallucinogen and marijuana use
Pain-reliever and marijuana use
Oxytocin and pain-reliever use
Stimulant and pain-reliever use
Tranquilizer and pain-reliever use
Stimulant and oxytocin use
Tranquilizer and oxytocin use

7.3 Create a testable hypothesis about this data

Requirements for the question:

Write a specific question you would like to answer with the data (that can be accomplished with EDA).
Write a description of the “deliverables”: what will you report after testing/examining your hypothesis?
Use EDA techniques of your choice, numeric and/or visual, to look into your question.
Write up your report on what you have found regarding the hypothesis about the data you came up with.

Your hypothesis could be on:

Difference of group means
Correlations between variables
Anything else you think is interesting, testable, and meaningful!

Important notes:

You should be only doing EDA relevant to your question here. It is easy to go down rabbit holes trying to look at every facet of your data, and so we want you to get in the practice of specifying a hypothesis you are interested in first and scoping your work to specifically answer that question.

Some of you may want to jump ahead to “modeling” data to answer your question. This is a topic addressed in the next project and you should not do this for this project. We specifically want you to not do modeling to emphasize the importance of performing EDA before you jump to statistical analysis.

Question:

Is there a difference between the mean percentage of oxytocin and pain-reliever use?

Deliverables:

Mean percentage of oxytocin use
Mean percentage of pain-reliever use
t-statistic
p-value

The “null hypothesis”

The null hypothesis is a fundamental concept of Frequentist statistical tests. We typically denote the null hypothesis with H0.

Users are using oxycontin and pain-reliever concurrently.

H0: The mean difference between the pain-reliever and oxytocin use is zero.

The “alternative hypothesis”

The alternative hypothesis is the outcome of the experiment that we hope to show. In our example the alternative hypothesis is that there is in fact a mean difference in oxycontin and pain-reliever use.

H1: The parameter of interest, our mean difference between oxytocin and pain-reliever use, is different than zero.

NOTE: The null hypothesis and alternative hypothesis are concerned with the true values, or in other words the parameter of the overall population. Through the process of experimentation / hypothesis testing and statistical analysis of the results we will make an inference about this population parameter.

mean2 = drug['oxycontin_use'].mean()
distri_2 = drug['oxycontin_use'] - mean2

mean2

0.9352941176470588

mean1 = drug['painrelieve_use'].mean()
distri_1 = drug['painrelieve_use'] - mean1

mean1

6.270588235294118

mean_difference = mean1 - mean2
mean_difference

5.3352941176470585

Evaluating our experiment with a t-test and p-value

Say in our experiment we measure the following results:

The subjects in the oxycontin group have mean percentage of 0.935
The subjects in the pain-reliever group have mean percentage of 6.271

The difference between ooxycontin and pain-reliever groups is 5.336% . We can perform what is known as a t-test to evaluate this.

First we will calculate a t-statistic. The t-statistic is a measure of the degree to which our groups differ standardized by the variance of our measurements.

Secondly we will calculate a p-value. The p-value is a metric that indicates a probability that our measured difference was due to random chance in the sampling of subjects.

The likelihood of the data given the null hypothesis

For our experiment we will set up a null hypothesis and an alternative hypothesis:

H0: The difference between in pain-reliever and oxycontin use is 0.

H1: The difference between in pain-reliever and oxycontin use is not 0.

Likewise, our measured difference is 5.336.

Recall that as Frequentists we want to know:

$P(\text{data}\;|\;\text{mean difference})$

What is the probability that we observed this data GIVEN a specified mean difference in oxytocin use.

We obviously don’t know the true mean difference in oxycontin use resulting from the drug. The whole point of conducting the experiment is to evaluate the drug. Instead we will assume that the true mean difference is zero: the null hypothesis H0 is assumed to be true:

$P(\text{data}\;|\;\text{mean difference}=0)$

#95% confidence level
alpha = 0.05
import scipy.stats as stats
results  = stats.ttest_ind(distri_1, distri_2, equal_var=False)

if results.pvalue < (alpha/2):
    print('As p-value={results.pvalue} is less than {alpha/2}, we reject the null hypothesis and conclude that the mean difference between oxycontin and pain-reliever is not 0')
else: 
    print('As p-value={results.pvalue} is less than {alpha/2}, we do not reject the null hypothesis')

As p-value={results.pvalue} is less than {alpha/2}, we do not reject the null hypothesis

8. Introduction to dealing with outliers

Outliers are an interesting problem in statistics, in that there is not an agreed upon best way to define them. Subjectivity in selecting and analyzing data is a problem that will recur throughout the course.

Pull out the rate variable from the sat dataset.
Are there outliers in the dataset? Define, in words, how you numerically define outliers.
Print out the outliers in the dataset.
Remove the outliers from the dataset.
Compare the mean, median, and standard deviation of the “cleaned” data without outliers to the original. What is different about them and why?

8.1.1 and 8.1.2 Identify an outlier

An outliers is a data point which is numerically distant from the majority of the data point.

sats['Rate']

   82
   81
   79
   77
   72
   71
   71
   69
   69
   68
  67
  65
  65
  63
  60
  57
  56
  55
  54
  53
  53
  52
  51
  51
  34
  33
  31
  26
  23
  18
  17
  13
  13
  12
  12
  11
  11
   9
   9
   9
   8
   8
   8
   7
   6
   6
   5
   5
   4
   4
   4
Name: Rate, dtype: int64

sats['Rate'].describe()

count    51.000000
mean     37.000000
std      27.550681
min       4.000000
25%       9.000000
50%      33.000000
75%      64.000000
max      82.000000
Name: Rate, dtype: float64

Since the maximum value is no more than two standard deviation away from the mean, we are able to conclude that there are no outliers. It is not necessary to remove any data point.

9. Percentile scoring and spearman rank correlation

9.1 Calculate the spearman correlation of sat `Verbal` and `Math`

How does the spearman correlation compare to the pearson correlation?
Describe clearly in words the process of calculating the spearman rank correlation.
- Hint: the word “rank” is in the name of the process for a reason!

stats.pearsonr(sat['Verbal'], sat['Math'])

(0.899870852544429, 1.192002673306768e-19)

verbal_math.corr()

	Verbal	Math
Verbal	1.000000	0.899909
Math	0.899909	1.000000

The similarity between pearson and spearman correlation coefficient are that they are both range between -1 to 1. However pearson correlation coefficient tells us the how closely related the two variables. While the spearman correlation coefficient is based on the rank order of the variable and it’s measure quantity.

9.2 Percentile scoring

Look up percentile scoring of data. In other words, the conversion of numeric data to their equivalent percentile scores.

http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.percentileofscore.html

Convert Rate to percentiles in the sat scores as a new column.
Show the percentile of California in Rate.
How is percentile related to the spearman rank correlation?

sat['percentile'] = sat['Rate'].map(lambda x: stats.percentileofscore(sat['Rate'],x))
sat.head()

	State	Rate	Verbal	Math	percentile
0	CT	82	509	510	100.000000
1	NJ	81	499	513	98.076923
2	MA	79	511	515	96.153846
3	NY	77	495	505	94.230769
4	NH	72	520	516	92.307692

sat[sat['State']== 'CA']

	State	Rate	Verbal	Math	percentile
23	CA	51	498	517	56.730769

9.3 Percentiles and outliers

Why might percentile scoring be useful for dealing with outliers?
Plot the distribution of a variable of your choice from the drug use dataset.
Plot the same variable but percentile scored.
Describe the effect, visually, of coverting raw scores to percentile.

9.3.1 Why might percentile scoring be useful for dealing with outliers?

It would be able to point where the outlier is on the percentile range

9.3.2 Plot the distribution of a variable of your choice from the drug use dataset

ax = drug['stimulant_frequency'].plot.hist(bins=50)
ax.set_xlabel('Frequency of Simulant')

Text(0.5,0,'Frequency of Simulant')

png

9.3.3 Plot the same variable but percentile scored

stim_per = []
for n in drug['stimulant_frequency']:
    stim_per.append(stats.percentileofscore(drug['stimulant_frequency'],n))
ax = sns.distplot(stim_per, kde=False, bins=50)
ax.set_xlabel('Percentile of Stimulant use frequency')
ax.set_ylabel('Frequency')

Text(0,0.5,'Frequency')

png

9.3.4 Describe the effect, visually, of coverting raw scores to percentile

From the histogram of the Stimulant use frequency, we can identify an outlier. The distribution of Stimulant use frequency does not follow a normal distribution as graph is not symmetrical. The mean percentile score of the Stimulant use frequency lies between 60-80%.

	State	Rate	Verbal	Math
0	CT	82	509	510
1	NJ	81	499	513
2	MA	79	511	515
3	NY	77	495	505
4	NH	72	520	516
5	RI	71	501	499
6	PA	71	500	499
7	VT	69	511	506
8	ME	69	506	500
9	VA	68	510	501

	State	Rate	Verbal	Math
26	CO	31	539	542
27	OH	26	534	439
28	MT	23	539	539
30	ID	17	543	542
31	TN	13	562	553
32	NM	13	551	542
33	IL	12	576	589
34	KY	12	550	550
35	WY	11	547	545
36	MI	11	561	572
37	MN	9	580	589
38	KS	9	577	580
39	AL	9	559	554
40	NE	8	562	568
41	OK	8	567	561
42	MO	8	577	577
43	LA	7	564	562
44	WI	6	584	596
45	AR	6	562	550
46	UT	5	575	570
47	IA	5	593	603
48	SD	4	577	582
49	ND	4	592	599
50	MS	4	566	551

	State	Rate	Verbal	Math
26	CO	31	539	542
27	OH	26	534	439
28	MT	23	539	539
30	ID	17	543	542
31	TN	13	562	553
32	NM	13	551	542
33	IL	12	576	589
34	KY	12	550	550
35	WY	11	547	545
36	MI	11	561	572
37	MN	9	580	589
38	KS	9	577	580
39	AL	9	559	554
40	NE	8	562	568
41	OK	8	567	561
42	MO	8	577	577
43	LA	7	564	562
44	WI	6	584	596
45	AR	6	562	550
46	UT	5	575	570
47	IA	5	593	603
48	SD	4	577	582
49	ND	4	592	599
50	MS	4	566	551

	State	Rate	Verbal	Math	diff_verbal_math
0	CT	82	509	510	-1
1	NJ	81	499	513	-14
2	MA	79	511	515	-4
3	NY	77	495	505	-10
4	NH	72	520	516	4

	State	Rate	Verbal	Math
0	CT	82	509	510
1	NJ	81	499	513
2	MA	79	511	515
3	NY	77	495	505
4	NH	72	520	516
5	RI	71	501	499
6	PA	71	500	499
7	VT	69	511	506
8	ME	69	506	500
9	VA	68	510	501

	State	Rate	Verbal	Math
26	CO	31	539	542
27	OH	26	534	439
28	MT	23	539	539
30	ID	17	543	542
31	TN	13	562	553
32	NM	13	551	542
33	IL	12	576	589
34	KY	12	550	550
35	WY	11	547	545
36	MI	11	561	572
37	MN	9	580	589
38	KS	9	577	580
39	AL	9	559	554
40	NE	8	562	568
41	OK	8	567	561
42	MO	8	577	577
43	LA	7	564	562
44	WI	6	584	596
45	AR	6	562	550
46	UT	5	575	570
47	IA	5	593	603
48	SD	4	577	582
49	ND	4	592	599
50	MS	4	566	551

	State	Rate	Verbal	Math
26	CO	31	539	542
27	OH	26	534	439
28	MT	23	539	539
30	ID	17	543	542
31	TN	13	562	553
32	NM	13	551	542
33	IL	12	576	589
34	KY	12	550	550
35	WY	11	547	545
36	MI	11	561	572
37	MN	9	580	589
38	KS	9	577	580
39	AL	9	559	554
40	NE	8	562	568
41	OK	8	567	561
42	MO	8	577	577
43	LA	7	564	562
44	WI	6	584	596
45	AR	6	562	550
46	UT	5	575	570
47	IA	5	593	603
48	SD	4	577	582
49	ND	4	592	599
50	MS	4	566	551

	State	Rate	Verbal	Math	diff_verbal_math
0	CT	82	509	510	-1
1	NJ	81	499	513	-14
2	MA	79	511	515	-4
3	NY	77	495	505	-10
4	NH	72	520	516	4