The Olympic medal data for the following exercises comes from The Guardian. It comprises records of all events held at the Olympic games between 1896 and 2012.¶

https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data

import pandas as pd

medals = pd.read_csv('Summer Olympic medallists 1896 to 2008 - ALL MEDALISTS.csv',skiprows=4)
print medals.shape
medals.head()

(29216, 10)

medals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
City            29216 non-null object
Edition         29216 non-null int64
Sport           29216 non-null object
Discipline      29216 non-null object
Athlete         29216 non-null object
NOC             29216 non-null object
Gender          29216 non-null object
Event           29216 non-null object
Event_gender    29216 non-null object
Medal           29216 non-null object
dtypes: int64(1), object(9)
memory usage: 2.2+ MB

Using .value_counts() for ranking¶

pandas Series method .value_counts()
- determine the top 15 countries ranked by total number of medals.
  - .value_counts() sorts by values by default.
    - The result is returned as a Series of counts indexed by unique entries from the original Series with values (counts) ranked in descending order.

# Select the 'NOC' column of medals: country_names
country_names = medals['NOC']
print type(country_names)
country_names.head()

<class 'pandas.core.series.Series'>

0    HUN
1    AUT
2    GRE
3    GRE
4    GRE
Name: NOC, dtype: object

# Count the number of medals won by each country: medal_counts
medal_counts = country_names.value_counts()
print type(medal_counts)
medal_counts.head()

<class 'pandas.core.series.Series'>

USA    4335
URS    2049
GBR    1594
FRA    1314
ITA    1228
Name: NOC, dtype: int64

Using .pivot_table() to count medals by type¶

Rather than ranking countries by total medals won and showing that list, you may want to see a bit more detail.
- You can use a pivot table to compute how many separate bronze, silver and gold medals each country won.
  - That pivot table can then be used to repeat the previous computation to rank by total medals won.

medals.head(3)

# Construct the pivot table: counted
counted = medals.pivot_table(index='NOC', values='Athlete', columns='Medal', aggfunc='count')
counted.head(7)

# Create the new column: counted['totals']
counted['totals'] = counted.sum(axis='columns')
counted.head(7)

# Sort counted by the 'totals' column
counted = counted.sort_values('totals', ascending=False)
counted.head(7)

Applying .drop_duplicates()¶

What could be the difference between the 'Event_gender' and 'Gender' columns?

you should be able to evaluate your guess by looking at the unique values of the pairs (Event_gender, Gender)

The duplicates can be dropped using the .drop_duplicates() method, leaving behind the unique observations.

# Select columns: ev_gen
ev_gen = medals[['Event_gender', 'Gender']]

print ev_gen.shape
ev_gen.head()

(29216, 2)

# Drop duplicate pairs: ev_gen_uniques
ev_gen_uniques = ev_gen.drop_duplicates()
ev_gen_uniques

Finding possible errors with .groupby()¶

you will now use .groupby() to continue your exploration. Your job is to group by 'Event_gender' and 'Gender' and count the rows.

You will see that there is only one suspicious row: This is likely a data error.

# Group medals by the two columns: medals_by_gender
medals_by_gender = medals.groupby(['Event_gender', 'Gender'])

# Create a DataFrame with a group count: medal_count_by_gender
medal_count_by_gender = medals_by_gender.count()

medal_count_by_gender

Locating suspicious data¶

# Create the Boolean Series: sus
sus = (medals.Event_gender == 'W') & (medals.Gender == 'Men')
sus.head()

0    False
1    False
2    False
3    False
4    False
dtype: bool

# Create a DataFrame with the suspicious row: suspect
suspect = medals[sus] 
suspect

Constructing alternative country rankings¶

Counting distinct events¶

print len(medals['Sport'].unique())
medals['Sport'].unique()

42

array(['Aquatics', 'Athletics', 'Cycling', 'Fencing', 'Gymnastics',
       'Shooting', 'Tennis', 'Weightlifting', 'Wrestling', 'Archery',
       'Basque Pelota', 'Cricket', 'Croquet', 'Equestrian', 'Football',
       'Golf', 'Polo', 'Rowing', 'Rugby', 'Sailing', 'Tug of War',
       'Boxing', 'Lacrosse', 'Roque', 'Hockey', 'Jeu de paume', 'Rackets',
       'Skating', 'Water Motorsports', 'Modern Pentathlon', 'Ice Hockey',
       'Basketball', 'Canoe / Kayak', 'Handball', 'Judo', 'Volleyball',
       'Table Tennis', 'Badminton', 'Baseball', 'Softball', 'Taekwondo',
       'Triathlon'], dtype=object)

# Group medals by 'NOC': country_grouped
country_grouped = medals.groupby('NOC')

# Compute the number of distinct sports in which each country won medals: Nsports
Nsports = country_grouped['Sport'].nunique()
Nsports.head()

NOC
AFG     1
AHO     1
ALG     3
ANZ     5
ARG    17
Name: Sport, dtype: int64

country_grouped['Sport'].count().head()

NOC
AFG      1
AHO      1
ALG     14
ANZ     29
ARG    239
Name: Sport, dtype: int64

# Sort the values of Nsports in descending order
Nsports = Nsports.sort_values(ascending=False)
Nsports.head()

NOC
USA    34
GBR    31
FRA    28
GER    26
CHN    24
Name: Sport, dtype: int64

country_grouped['Sport'].count().sort_values(ascending = False).head()

NOC
USA    4335
URS    2049
GBR    1594
FRA    1314
ITA    1228
Name: Sport, dtype: int64

Ranking of distinct events¶

Top five countries that have won medals in the most sports
Compare medal counts of USA and USSR from 1952 to 1988

Two new DataFrame methods¶

idxmax(): Row or column label where maximum value is located
idxmin(): Row or column label where minimum value is located

count = medals.groupby('NOC')['Medal'].count()
count.head()

NOC
AFG      1
AHO      1
ALG     14
ANZ     29
ARG    239
Name: Medal, dtype: int64

count.index

Index([u'AFG', u'AHO', u'ALG', u'ANZ', u'ARG', u'ARM', u'AUS', u'AUT', u'AZE',
       u'BAH',
       ...
       u'URS', u'URU', u'USA', u'UZB', u'VEN', u'VIE', u'YUG', u'ZAM', u'ZIM',
       u'ZZX'],
      dtype='object', name=u'NOC', length=138)

count.idxmax()

'USA'

count.idxmin()

'AFG'

Using idxmax() along columns¶

import numpy as np

count2 = medals.groupby('NOC')['Medal','Event'].count()
count2['Event']=count2['Event'].apply(lambda x: x+np.random.randint(-10,5))
count2.head()

count2.head()

count2.idxmax(1).value_counts()

Medal    105
Event     33
dtype: int64

count2.idxmax(axis='columns').head()

NOC
AFG    Medal
AHO    Medal
ALG    Medal
ANZ    Event
ARG    Medal
dtype: object

filtering¶

Counting USA vs. USSR Cold War Olympic Sports¶

The Olympic competitions between 1952 and 1988 took place during the height of the Cold War between the United States of America (USA) & the Union of Soviet Socialist Republics (USSR).
- Your goal in this exercise is to aggregate the number of distinct sports in which the USA and the USSR won medals during the Cold War years.

# Extract all rows for which the 'Edition' is between 1952 & 1988: during_cold_war
during_cold_war = (medals.Edition>=1952) & (medals.Edition<=1988)
print during_cold_war.value_counts()
during_cold_war.head()

False    17637
True     11579
Name: Edition, dtype: int64

0    False
1    False
2    False
3    False
4    False
Name: Edition, dtype: bool

# Extract rows for which 'NOC' is either 'USA' or 'URS': is_usa_urs
is_usa_urs = medals.NOC.isin(['USA', 'URS'])
print is_usa_urs.value_counts()
is_usa_urs.head()

False    22832
True      6384
Name: NOC, dtype: int64

0    False
1    False
2    False
3    False
4    False
Name: NOC, dtype: bool

# Use during_cold_war and is_usa_urs to create the DataFrame: cold_war_medals
cold_war_medals = medals.loc[during_cold_war & is_usa_urs]
print cold_war_medals.shape
cold_war_medals.head()

(3544, 10)

Create a Series Nsports from country_grouped using indexing & chained methods:¶

Extract the column 'Sport'; ### Use .nunique() to get the number of unique elements in each group;

Apply .sort_values(ascending=False) to rearrange the Series.

# Group cold_war_medals by 'NOC'
country_grouped = cold_war_medals.groupby('NOC')

# Create Nsports
Nsports = country_grouped['Sport'].nunique().sort_values(ascending=False)

print Nsports.shape
Nsports

(2L,)

NOC
URS    21
USA    20
Name: Sport, dtype: int64

country_grouped['Sport'].nunique().head()

NOC
URS    21
USA    20
Name: Sport, dtype: int64

cold_war_medals.groupby('NOC').count()

Counting USA vs. USSR Cold War Olympic Medals¶

see which country, the USA or the USSR, won the most medals consistently over the Cold War period.
- ou'll need a pivot table with years ('Edition') on the index and countries ('NOC') on the columns. The entries will be the total number of medals each country won that year. If the country won no medals in a given edition, expect a NaN in that entry of the pivot table.

# Create the pivot table: medals_won_by_country
medals_won_by_country = medals.pivot_table(index='Edition', columns='NOC', values='Athlete', aggfunc='count')
medals_won_by_country.head()

### You'll need to slice the Cold War period and subset the 'USA' and 'URS' columns.

# Slice medals_won_by_country: cold_war_usa_usr_medals
cold_war_usa_usr_medals = medals_won_by_country.loc[1952:1988, ['USA','URS']]
cold_war_usa_usr_medals

### You'll need to make a Series from this slice of the pivot table that tells which country won the most medals in that edition using .idxmax(axis='columns').

If .max() returns the maximum value of Series or 1D array, .idxmax() returns the index of the maximizing element.
- The argument axis=columns or axis=1 is required because, by default, this aggregation would be done along columns for a DataFrame.
  - The final Series contains either 'USA' or 'URS' according to which country won the most medals in each Olympic edition.
    - You can use .value_counts() to count the number of occurrences of each.

# Create most_medals
most_medals = cold_war_usa_usr_medals.idxmax(axis='columns')
most_medals

Edition
1952    USA
1956    URS
1960    URS
1964    URS
1968    URS
1972    URS
1976    URS
1980    URS
1984    USA
1988    URS
dtype: object

Visualizing¶

USA Medal Counts by Edition: Line Plot

import matplotlib.pyplot as plt
%matplotlib inline

# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']

# Group usa by ['Edition', 'Medal'] and aggregate over 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()
usa_medals_by_year.head(10)

Edition  Medal 
1896     Bronze      2
         Gold       11
         Silver      7
1900     Bronze     14
         Gold       27
         Silver     14
1904     Bronze    111
         Gold      146
         Silver    137
1908     Bronze     15
Name: Athlete, dtype: int64

unstack data before plotting¶

# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')
usa_medals_by_year.head(10)

# Plot the DataFrame usa_medals_by_year
usa_medals_by_year.plot()
plt.show()

Visualizing¶

USA Medal Counts by Edition: Area Plot

# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']

# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()

# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')

# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area(alpha=.5)
plt.show()

Visualizing¶

USA Medal Counts by Edition: Area Plot with Ordered Medals
- You may have noticed that the medals are ordered according to a lexicographic (dictionary) ordering: Bronze < Gold < Silver.
  - However, you would prefer an ordering consistent with the Olympic rules: Bronze < Silver < Gold.
    - You can achieve this using Categorical types.

# Redefine 'Medal' as an ordered categorical
medals.Medal = pd.Categorical(values=medals.Medal, categories=['Bronze', 'Silver', 'Gold'], ordered=True)
medals.Medal.head()

0      Gold
1    Silver
2    Bronze
3      Gold
4    Silver
Name: Medal, dtype: category
Categories (3, object): [Bronze < Silver < Gold]

# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']

# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()

# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')

# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area(alpha=.6)
plt.show()

Medal	Bronze	Gold	Silver
NOC
AFG	1.0	NaN	NaN
AHO	NaN	NaN	1.0
ALG	8.0	4.0	2.0
ANZ	5.0	20.0	4.0
ARG	88.0	68.0	83.0
ARM	7.0	1.0	1.0
AUS	413.0	293.0	369.0

Medal	Bronze	Gold	Silver	totals
NOC
AFG	1.0	NaN	NaN	2.0
AHO	NaN	NaN	1.0	2.0
ALG	8.0	4.0	2.0	28.0
ANZ	5.0	20.0	4.0	58.0
ARG	88.0	68.0	83.0	478.0
ARM	7.0	1.0	1.0	18.0
AUS	413.0	293.0	369.0	2150.0

Medal	Bronze	Gold	Silver	totals
NOC
USA	1052.0	2088.0	1195.0	8670.0
URS	584.0	838.0	627.0	4098.0
GBR	505.0	498.0	591.0	3188.0
FRA	475.0	378.0	461.0	2628.0
ITA	374.0	460.0	394.0	2456.0
GER	454.0	407.0	350.0	2422.0
AUS	413.0	293.0	369.0	2150.0

	Event_gender	Gender
0	M	Men
348	X	Men
416	W	Women
639	X	Women
23675	W	Men

		City	Edition	Sport	Discipline	Athlete	NOC	Event	Medal
Event_gender	Gender
M	Men	20067	20067	20067	20067	20067	20067	20067	20067
W	Men	1	1	1	1	1	1	1	1
W	Women	7277	7277	7277	7277	7277	7277	7277	7277
X	Men	1653	1653	1653	1653	1653	1653	1653	1653
X	Women	218	218	218	218	218	218	218	218

	City	Edition	Sport	Discipline	Athlete	NOC	Gender	Event	Event_gender	Medal
0	Athens	1896	Aquatics	Swimming	HAJOS, Alfred	HUN	Men	100m freestyle	M	Gold
1	Athens	1896	Aquatics	Swimming	HERSCHMANN, Otto	AUT	Men	100m freestyle	M	Silver
2	Athens	1896	Aquatics	Swimming	DRIVAS, Dimitrios	GRE	Men	100m freestyle for sailors	M	Bronze
3	Athens	1896	Aquatics	Swimming	MALOKINIS, Ioannis	GRE	Men	100m freestyle for sailors	M	Gold
4	Athens	1896	Aquatics	Swimming	CHASAPIS, Spiridon	GRE	Men	100m freestyle for sailors	M	Silver

	City	Edition	Sport	Discipline	Athlete	NOC	Gender	Event	Event_gender	Medal
8019	Helsinki	1952	Aquatics	Diving	LEE, Samuel	USA	Men	10m platform	M	Gold
8021	Helsinki	1952	Aquatics	Diving	STOVER-IRWIN, Juno Roslays	USA	Women	10m platform	W	Bronze
8022	Helsinki	1952	Aquatics	Diving	MCCORMICK, Patricia	USA	Women	10m platform	W	Gold
8023	Helsinki	1952	Aquatics	Diving	MYERS-POPE, Paula Jean	USA	Women	10m platform	W	Silver
8024	Helsinki	1952	Aquatics	Diving	CLOTWORTHY, Robert Lynn	USA	Men	3m springboard	M	Bronze

	City	Edition	Sport	Discipline	Athlete	Gender	Event	Event_gender	Medal
NOC
URS	2049	2049	2049	2049	2049	2049	2049	2049	2049
USA	1495	1495	1495	1495	1495	1495	1495	1495	1495

NOC	USA	URS
Edition
1952	130.0	117.0
1956	118.0	169.0
1960	112.0	169.0
1964	150.0	174.0
1968	149.0	188.0
1972	155.0	211.0
1976	155.0	285.0
1980	NaN	442.0
1984	333.0	NaN
1988	193.0	294.0

Medal	Bronze	Gold	Silver
Edition
1896	2	11	7
1900	14	27	14
1904	111	146	137
1908	15	34	14
1912	31	45	25
1920	38	110	45
1924	51	110	37
1928	17	43	24
1932	59	78	44
1936	15	51	26