Importing Data in Pandas¶

Pandas supports a number of data formats out of the box including:

CSV, Excel
JSON
HDF5
SQL databases
and others

The major benefit for using Pandas to load these data is

it provides a simple, consistent mechanism for each of them and loads them directly into the Pandas DataFrame
reducing the need to go elsewhere to perform the same operations with more code or overhead.

Pandas I/O supports loading these data formats directly from local storage or using a URL containing such data.

Importing Pandas¶

You will most often load the Pandas library with the following line:

In [1]:

import pandas as pd

Loading CSV and Excel¶

CSV¶

CSV files are still a staple in data file formats. They're portable, flexible, flat, usually easy to parse and ubiquitous. We will begin by showing how to use Pandas to load CSV directly into a DataFrame.

DATA SOURCE

US Baseball Statistics Archive by Sean Lahman (CCBY-SA 3.0):

We have put the dataset for batting data into our local datasets folder.

Loading this into a Pandas DataFrame will require us to use the read_csv function, which will attempt to load the CSV data directly into the DataFrame.

In [2]:

df = pd.read_csv("./datasets/Batting.csv")

If we inspect this DataFrame, will get exactly what we expect -- each line corresponding to the row in file. NOTE: where there are missing values, Pandas will automatically fill the data with NaN.

In [3]:

df

Out[3]:

	playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
0	abercda01	1871	1	TRO	NaN	1	4	0	0	0	...	0.0	0.0	0.0	0	0.0	NaN	NaN	NaN	NaN	NaN
1	addybo01	1871	1	RC1	NaN	25	118	30	32	6	...	13.0	8.0	1.0	4	0.0	NaN	NaN	NaN	NaN	NaN
2	allisar01	1871	1	CL1	NaN	29	137	28	40	4	...	19.0	3.0	1.0	2	5.0	NaN	NaN	NaN	NaN	NaN
3	allisdo01	1871	1	WS3	NaN	27	133	28	44	10	...	27.0	1.0	1.0	0	2.0	NaN	NaN	NaN	NaN	NaN
4	ansonca01	1871	1	RC1	NaN	25	120	29	39	11	...	16.0	6.0	2.0	2	1.0	NaN	NaN	NaN	NaN	NaN
5	armstbo01	1871	1	FW1	NaN	12	49	9	11	2	...	5.0	0.0	1.0	0	1.0	NaN	NaN	NaN	NaN	NaN
6	barkeal01	1871	1	RC1	NaN	1	4	0	1	0	...	2.0	0.0	0.0	1	0.0	NaN	NaN	NaN	NaN	NaN
7	barnero01	1871	1	BS1	NaN	31	157	66	63	10	...	34.0	11.0	6.0	13	1.0	NaN	NaN	NaN	NaN	NaN
8	barrebi01	1871	1	FW1	NaN	1	5	1	1	1	...	1.0	0.0	0.0	0	0.0	NaN	NaN	NaN	NaN	NaN
9	barrofr01	1871	1	BS1	NaN	18	86	13	13	2	...	11.0	1.0	0.0	0	0.0	NaN	NaN	NaN	NaN	NaN
10	bassjo01	1871	1	CL1	NaN	22	89	18	27	1	...	18.0	0.0	1.0	3	4.0	NaN	NaN	NaN	NaN	NaN
11	battijo01	1871	1	CL1	NaN	1	3	0	0	0	...	0.0	0.0	0.0	1	0.0	NaN	NaN	NaN	NaN	NaN
12	bealsto01	1871	1	WS3	NaN	10	36	6	7	0	...	1.0	2.0	0.0	2	0.0	NaN	NaN	NaN	NaN	NaN
13	beaveed01	1871	1	TRO	NaN	3	15	7	6	0	...	5.0	2.0	0.0	0	0.0	NaN	NaN	NaN	NaN	NaN
14	bechtge01	1871	1	PH1	NaN	20	94	24	33	9	...	21.0	4.0	0.0	2	2.0	NaN	NaN	NaN	NaN	NaN
15	bellast01	1871	1	TRO	NaN	29	128	26	32	3	...	23.0	4.0	4.0	9	2.0	NaN	NaN	NaN	NaN	NaN
16	berkena01	1871	1	PH1	NaN	1	4	0	0	0	...	0.0	0.0	0.0	0	3.0	NaN	NaN	NaN	NaN	NaN
17	berryto01	1871	1	PH1	NaN	1	4	0	1	0	...	0.0	0.0	0.0	0	0.0	NaN	NaN	NaN	NaN	NaN
18	berthha01	1871	1	WS3	NaN	17	73	17	17	1	...	8.0	3.0	1.0	4	2.0	NaN	NaN	NaN	NaN	NaN
19	biermch01	1871	1	FW1	NaN	1	2	0	0	0	...	0.0	0.0	0.0	1	0.0	NaN	NaN	NaN	NaN	NaN
20	birdge01	1871	1	RC1	NaN	25	106	19	28	2	...	13.0	1.0	0.0	3	2.0	NaN	NaN	NaN	NaN	NaN
21	birdsda01	1871	1	BS1	NaN	29	152	51	46	3	...	24.0	6.0	0.0	4	4.0	NaN	NaN	NaN	NaN	NaN
22	brainas01	1871	1	WS3	NaN	30	134	24	30	4	...	21.0	4.0	0.0	7	2.0	NaN	NaN	NaN	NaN	NaN
23	brannmi01	1871	1	CH1	NaN	3	14	2	1	0	...	0.0	0.0	0.0	0	0.0	NaN	NaN	NaN	NaN	NaN
24	burrohe01	1871	1	WS3	NaN	12	63	11	15	2	...	14.0	0.0	0.0	1	1.0	NaN	NaN	NaN	NaN	NaN
25	careyto01	1871	1	FW1	NaN	19	87	16	20	2	...	10.0	5.0	0.0	2	1.0	NaN	NaN	NaN	NaN	NaN
26	carleji01	1871	1	CL1	NaN	29	127	31	32	8	...	18.0	2.0	1.0	8	3.0	NaN	NaN	NaN	NaN	NaN
27	conefr01	1871	1	BS1	NaN	19	77	17	20	3	...	16.0	12.0	1.0	8	2.0	NaN	NaN	NaN	NaN	NaN
28	connone01	1871	1	TRO	NaN	7	33	6	7	0	...	2.0	0.0	0.0	0	0.0	NaN	NaN	NaN	NaN	NaN
29	cravebi01	1871	1	TRO	NaN	27	118	26	38	8	...	26.0	6.0	3.0	3	0.0	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
102786	wittgni01	2016	1	MIA	NL	48	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102787	wolteto01	2016	1	COL	NL	71	205	27	53	15	...	30.0	4.0	1.0	21	53.0	2.0	0.0	4.0	0.0	1.0
102788	wongko01	2016	1	SLN	NL	121	313	39	75	7	...	23.0	7.0	0.0	34	52.0	2.0	9.0	0.0	5.0	3.0
102789	woodal02	2016	1	LAN	NL	15	16	2	4	0	...	2.0	0.0	0.0	1	9.0	0.0	0.0	2.0	0.0	0.0
102790	woodbl01	2016	1	CIN	NL	70	2	0	0	0	...	0.0	0.0	0.0	0	2.0	0.0	0.0	0.0	0.0	0.0
102791	woodtr01	2016	1	CHN	NL	81	11	0	2	0	...	1.0	0.0	0.0	1	5.0	0.0	0.0	0.0	0.0	0.0
102792	worleva01	2016	1	BAL	AL	35	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102793	worthda01	2016	1	HOU	AL	16	39	4	7	2	...	1.0	0.0	0.0	1	6.0	0.0	0.0	0.0	0.0	1.0
102794	wrighda03	2016	1	NYN	NL	37	137	18	31	8	...	14.0	3.0	2.0	26	55.0	0.0	0.0	0.0	0.0	0.0
102795	wrighda04	2016	1	CIN	NL	4	5	0	0	0	...	0.0	0.0	0.0	0	2.0	0.0	0.0	1.0	0.0	0.0
102796	wrighda04	2016	2	LAA	AL	5	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102797	wrighmi01	2016	1	BAL	AL	18	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102798	wrighst01	2016	1	BOS	AL	25	4	0	0	0	...	0.0	0.0	0.0	0	3.0	0.0	0.0	0.0	0.0	0.0
102799	yateski01	2016	1	NYA	AL	41	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102800	yelicch01	2016	1	MIA	NL	155	578	78	172	38	...	98.0	9.0	4.0	72	138.0	4.0	4.0	0.0	5.0	20.0
102801	ynoaga01	2016	1	NYN	NL	10	3	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102802	ynoami01	2016	1	CHA	AL	23	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102803	ynoara01	2016	1	COL	NL	3	5	0	0	0	...	0.0	0.0	0.0	0	2.0	0.0	0.0	0.0	0.0	0.0
102804	youngch03	2016	1	KCA	AL	34	1	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102805	youngch04	2016	1	BOS	AL	76	203	29	56	18	...	24.0	4.0	2.0	21	50.0	0.0	3.0	0.0	0.0	4.0
102806	younger03	2016	1	NYA	AL	6	1	2	0	0	...	0.0	1.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102807	youngma03	2016	1	ATL	NL	8	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102808	zastrro01	2016	1	CHN	NL	8	3	0	0	0	...	0.0	0.0	0.0	0	2.0	0.0	0.0	0.0	0.0	0.0
102809	zieglbr01	2016	1	ARI	NL	36	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102810	zieglbr01	2016	2	BOS	AL	33	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102811	zimmejo02	2016	1	DET	AL	19	4	0	1	0	...	0.0	0.0	0.0	0	2.0	0.0	0.0	1.0	0.0	0.0
102812	zimmery01	2016	1	WAS	NL	115	427	60	93	18	...	46.0	4.0	1.0	29	104.0	1.0	5.0	0.0	6.0	12.0
102813	zobribe01	2016	1	CHN	NL	147	523	94	142	31	...	76.0	6.0	4.0	96	82.0	6.0	4.0	4.0	4.0	17.0
102814	zuninmi01	2016	1	SEA	AL	55	164	16	34	7	...	31.0	0.0	0.0	21	65.0	0.0	6.0	0.0	1.0	0.0
102815	zychto01	2016	1	SEA	AL	12	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0

102816 rows × 22 columns

We will soon learn that Pandas, supports some typical "Pythonic" use cases for accesing data. The first we will encounter is with len(). We can get the size of this dataset (in rows) with the standard Python len() function, which will return exactly what we expect.

In [4]:

len(df)

Out[4]:

Every DataFrame will have a columns attribute, which contains the column index for our dataset. Thus, getting the length of that attribute returns, again, what we expect.

In [5]:

df.columns

Out[5]:

Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'G', 'AB', 'R', 'H',
       '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP', 'SH',
       'SF', 'GIDP'],
      dtype='object')

In [6]:

len(df.columns)

Out[6]:

If we want both column and row counts DataFrame.shape will return the tuple to do this:

In [7]:

df.shape

Out[7]:

(102816, 22)

Which returns what we expect (yet again).

Accessing column data by label¶

One of the nice things about Pandas is that we can reference the columns of data by their names (or labels). For example, we have a yearID label, teamID label, G label for game counts, and so on. For our dataset to learn what the labels are in detail see the documentation for the provided links.

In [10]:

df.yearID[:10]

Out[10]:

0    1871
1    1871
2    1871
3    1871
4    1871
5    1871
6    1871
7    1871
8    1871
9    1871
Name: yearID, dtype: int64

In [11]:

df.G[-10:]

Out[11]:

102806      6
102807      8
102808      8
102809     36
102810     33
102811     19
102812    115
102813    147
102814     55
102815     12
Name: G, dtype: int64

Let's say we want all the player data for the Washington Nationals from 2015 and 2016 where a player played in 100 or more games:

In [12]:

df_was = df[(df.yearID > 2014) & (df.teamID=='WAS') & (df.G > 99)]
df_was

Out[12]:

	playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
100193	desmoia01	2015	1	WAS	NL	156	583	69	136	27	...	62.0	13.0	5.0	45	187.0	0.0	3.0	6.0	4.0	9.0
100250	escobyu01	2015	1	WAS	NL	139	535	75	168	25	...	56.0	2.0	2.0	45	70.0	0.0	8.0	1.0	2.0	24.0
100251	espinda01	2015	1	WAS	NL	118	367	59	88	21	...	37.0	5.0	2.0	33	106.0	5.0	6.0	3.0	3.0	6.0
100422	harpebr03	2015	1	WAS	NL	153	521	118	172	38	...	99.0	6.0	4.0	124	131.0	15.0	5.0	0.0	4.0	15.0
100950	ramoswi01	2015	1	WAS	NL	128	475	41	109	16	...	68.0	0.0	0.0	21	101.0	2.0	0.0	0.0	8.0	16.0
100993	robincl01	2015	1	WAS	NL	126	309	44	84	15	...	34.0	0.0	0.0	37	52.0	4.0	5.0	0.0	1.0	6.0
101176	taylomi02	2015	1	WAS	NL	138	472	49	108	15	...	63.0	16.0	3.0	35	158.0	9.0	1.0	1.0	2.0	5.0
101725	espinda01	2016	1	WAS	NL	157	516	66	108	15	...	72.0	9.0	2.0	54	174.0	12.0	20.0	7.0	4.0	4.0
101895	harpebr03	2016	1	WAS	NL	147	506	84	123	24	...	86.0	21.0	10.0	108	117.0	20.0	3.0	0.0	10.0	11.0
102245	murphda08	2016	1	WAS	NL	142	531	88	184	47	...	104.0	5.0	3.0	35	57.0	10.0	8.0	0.0	8.0	4.0
102429	ramoswi01	2016	1	WAS	NL	131	482	58	148	25	...	80.0	0.0	0.0	35	79.0	2.0	2.0	0.0	4.0	17.0
102449	rendoan01	2016	1	WAS	NL	156	567	91	153	38	...	85.0	12.0	6.0	65	117.0	2.0	7.0	0.0	8.0	5.0
102451	reverbe01	2016	1	WAS	NL	103	350	44	76	9	...	24.0	14.0	5.0	18	34.0	0.0	3.0	2.0	2.0	12.0
102472	robincl01	2016	1	WAS	NL	104	196	16	46	4	...	26.0	0.0	0.0	20	38.0	0.0	2.0	1.0	5.0	4.0
102763	werthja01	2016	1	WAS	NL	143	525	84	128	28	...	69.0	5.0	1.0	71	139.0	0.0	4.0	0.0	6.0	17.0
102812	zimmery01	2016	1	WAS	NL	115	427	60	93	18	...	46.0	4.0	1.0	29	104.0	1.0	5.0	0.0	6.0	12.0

16 rows × 22 columns

Excel¶

Loading Excel data is nearly as easy as CSV data. This time we'll use a different data source and show how to access it in a slightly different manner. Instead of the local file source, we will use a remote URL for the resource. This will show us exactly how easy it is to seamlessly interchange various data resources.

DATA SOURCES

US Bureau of Transportation Statistics | Airline Employment Data which includes data for year-over-year percentage change in employment for workers in the passenger airline industry

To read data from the data set we will access it by URL and use the pandas.read_excel() method note we're using the sheetname=None parameter to read each sheet to be assigned its own key in a dictionary for easy lookup by sheet name.

In [13]:

xl = pd.read_excel(
    "https://www.bts.gov/sites/bts.dot.gov/files/docs/newsroom/206581/airline-employment-press-tables-web.xlsx",
    sheetname=None)

Notice now, if we want to access the sheet called Table1 we can easily do this in a Pythonic way much like any other dictionary. The result is the DataFrame representation of that sheet.

In [14]:

xl_tbl1 = xl['Table1']
xl_tbl1

Out[14]:

	Table 1: Yearly Change in Scheduled Passenger Airline Full-time Equivalent Employees* by Airline Group	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5
0	Most recent 13 months - percent change from sa...	NaN	NaN	NaN	NaN	NaN
1	NaN	Network Airlines	Low-Cost Airlines	Regional Airlines	Other Airlines	All Passenger Airlines **
2	May 2015 - May 2016	2.3	10.7	0.2	9.3	3.7
3	Jun 2015 - Jun 2016	2.3	11	0.9	10.6	3.9
4	Jul 2015 - Jul 2016	2.4	11.3	3.3	11.2	4.3
5	Aug 2015 - Aug 2016	2.5	11	3.3	11.9	4.3
6	Sep 2015 - Sep 2016	2.6	10.6	2.9	13	4.3
7	Oct 2015 - Oct 2016	2.7	10.3	0.3	12.7	4
8	Nov 2015 - Nov 2016	2.3	9.8	0.2	13.5	3.7
9	Dec 2015 - Dec 2016	2.4	9.5	0.2	13.7	3.7
10	Jan 2016 - Jan 2017	2.3	9.7	1.9	12.7	3.9
11	Feb 2016 - Feb 2017	2.4	9.4	2.4	11.8	3.9
12	Mar 2016 - Mar 2017	2.7	9.1	2	11.7	4
13	Apr 2016 - Apr 2017	2.6	8.5	2.1	10.7	3.9
14	May 2016 - May 2017	2.4	8.3	2.5	4.2	3.6
15	Source: Bureau of Transportation Statistics	NaN	NaN	NaN	NaN	NaN
16	* Full-time Equivalent Employee (FTE) calculat...	NaN	NaN	NaN	NaN	NaN
17	** Includes network, low-cost, regional and ot...	NaN	NaN	NaN	NaN	NaN
18	Note: Percent changes based on numbers prior t...	NaN	NaN	NaN	NaN	NaN
19	Note: See Table 2 for all passenger airlines, ...	NaN	NaN	NaN	NaN	NaN

One problem we have here is that the data is not exactly as clean as we want it to be. We'll spend more time talking about the iloc() method in the next section, but for now, let's get a flavor for how we might clean this up so it is more usable.

In [15]:

# lets select the (row) index 
idx = xl_tbl1.iloc[2:15, 0:1]

# lets select the (col) index
col = xl_tbl1.iloc[1,1:]

print(idx)
print(col)

   Table 1: Yearly Change in Scheduled Passenger Airline Full-time Equivalent Employees* by Airline Group
2                                 May 2015 - May 2016                                                    
3                                 Jun 2015 - Jun 2016                                                    
4                                 Jul 2015 - Jul 2016                                                    
5                                 Aug 2015 - Aug 2016                                                    
6                                 Sep 2015 - Sep 2016                                                    
7                                 Oct 2015 - Oct 2016                                                    
8                                 Nov 2015 - Nov 2016                                                    
9                                 Dec 2015 - Dec 2016                                                    
10                                Jan 2016 - Jan 2017                                                    
11                                Feb 2016 - Feb 2017                                                    
12                                Mar 2016 - Mar 2017                                                    
13                                Apr 2016 - Apr 2017                                                    
14                                May 2016 - May 2017                                                    
Unnamed: 1             Network Airlines
Unnamed: 2            Low-Cost Airlines
Unnamed: 3            Regional Airlines
Unnamed: 4               Other Airlines
Unnamed: 5    All Passenger Airlines **
Name: 1, dtype: object

In [16]:

# we'll create the index object
idxs = pd.Index([v[0] for v in idx.values])
idxs

Out[16]:

Index(['May 2015 - May 2016', 'Jun 2015 - Jun 2016', 'Jul 2015 - Jul 2016',
       'Aug 2015 - Aug 2016', 'Sep 2015 - Sep 2016', 'Oct 2015 - Oct 2016',
       'Nov 2015 - Nov 2016', 'Dec 2015 - Dec 2016', 'Jan 2016 - Jan 2017',
       'Feb 2016 - Feb 2017', 'Mar 2016 - Mar 2017', 'Apr 2016 - Apr 2017',
       'May 2016 - May 2017'],
      dtype='object')

In [17]:

# set the columns
cols = [v for v in col.values]
cols

Out[17]:

['Network Airlines',
 'Low-Cost Airlines',
 'Regional Airlines',
 'Other Airlines',
 'All Passenger Airlines **']

In [18]:

# now for the data
data = xl_tbl1.iloc[2:15,1:].values
data

Out[18]:

array([[2.3, 10.7, 0.2, 9.3, 3.7],
       [2.3, 11, 0.9, 10.6, 3.9],
       [2.4, 11.3, 3.3, 11.2, 4.3],
       [2.5, 11, 3.3, 11.9, 4.3],
       [2.6, 10.6, 2.9, 13, 4.3],
       [2.7, 10.3, 0.3, 12.7, 4],
       [2.3, 9.8, 0.2, 13.5, 3.7],
       [2.4, 9.5, 0.2, 13.7, 3.7],
       [2.3, 9.7, 1.9, 12.7, 3.9],
       [2.4, 9.4, 2.4, 11.8, 3.9],
       [2.7, 9.1, 2, 11.7, 4],
       [2.6, 8.5, 2.1, 10.7, 3.9],
       [2.4, 8.3, 2.5, 4.2, 3.6]], dtype=object)

In [19]:

# putting it all together ...
df_tbl1 = pd.DataFrame(data=xl_tbl1.iloc[2:15,1:].values,
                       columns=[v for v in col.values], 
                       index=pd.Index([v[0] for v in idx.values]))
df_tbl1

Out[19]:

	Network Airlines	Low-Cost Airlines	Regional Airlines	Other Airlines	All Passenger Airlines **
May 2015 - May 2016	2.3	10.7	0.2	9.3	3.7
Jun 2015 - Jun 2016	2.3	11	0.9	10.6	3.9
Jul 2015 - Jul 2016	2.4	11.3	3.3	11.2	4.3
Aug 2015 - Aug 2016	2.5	11	3.3	11.9	4.3
Sep 2015 - Sep 2016	2.6	10.6	2.9	13	4.3
Oct 2015 - Oct 2016	2.7	10.3	0.3	12.7	4
Nov 2015 - Nov 2016	2.3	9.8	0.2	13.5	3.7
Dec 2015 - Dec 2016	2.4	9.5	0.2	13.7	3.7
Jan 2016 - Jan 2017	2.3	9.7	1.9	12.7	3.9
Feb 2016 - Feb 2017	2.4	9.4	2.4	11.8	3.9
Mar 2016 - Mar 2017	2.7	9.1	2	11.7	4
Apr 2016 - Apr 2017	2.6	8.5	2.1	10.7	3.9
May 2016 - May 2017	2.4	8.3	2.5	4.2	3.6

In [20]:

df_tbl1['Network Airlines']

Out[20]:

May 2015 - May 2016    2.3
Jun 2015 - Jun 2016    2.3
Jul 2015 - Jul 2016    2.4
Aug 2015 - Aug 2016    2.5
Sep 2015 - Sep 2016    2.6
Oct 2015 - Oct 2016    2.7
Nov 2015 - Nov 2016    2.3
Dec 2015 - Dec 2016    2.4
Jan 2016 - Jan 2017    2.3
Feb 2016 - Feb 2017    2.4
Mar 2016 - Mar 2017    2.7
Apr 2016 - Apr 2017    2.6
May 2016 - May 2017    2.4
Name: Network Airlines, dtype: object

JSON¶

JSON has become a standard format format for many web data sources. It is succinct, readable and very portable -- there are libraries in nearly every modern language that can parse JSON, Python being no exception. We'll load a remote JSON data source to demonstrate remote access as well as the capabilities of using Pandas to load such a source.

JSON DATA SOURCE

Quotes for developers by fortrabbit

If we haven't noticed the pattern yet, loading JSON data will come as no surprise via the pandas.read_json().

With JSON data you may get the best results with relatively flat JSON objects. If you need to obtain different results (or you're getting results that are not as expected), you might instead into the orient parameter to get different resulting DataFrames. We'll load the data as-is and reshape our DataFrame for some extra practice.

In [21]:

df = pd.read_json(
    "https://raw.githubusercontent.com/fortrabbit/quotes/master/quotes.json")
df

Out[21]:

	author	text
0	Martin Golding	Always code as if the guy who ends up maintain...
1	Unknown	All computers wait at the same speed.
2	Unknown	A misplaced decimal point will always end up w...
3	Unknown	A good programmer looks both ways before cross...
4	Unknown	A computer program does what you tell it to do...
5	Unknown	"Intel Inside" is a Government Warning require...
6	Arthur Godfrey	Common sense gets a lot of credit that belongs...
7	Unknown	Chuck Norris doesn’t go hunting. Chuck Norris ...
8	Unknown	Chuck Norris counted to infinity... twice.
9	Unknown	C is quirky, flawed, and an enormous success.
10	Unknown	Beta is Latin for still doesn’t work.
11	Unknown	ASCII stupid question, get a stupid ANSI!
12	Unknown	Artificial Intelligence usually beats natural ...
13	Ted Nelson	Any fool can use a computer. Many do.
14	Unknown	Hey! It compiles! Ship it!
15	Martin Luther King Junior	Hate cannot drive out hate; only love can do t...
16	Unknown	Guns don’t kill people. Chuck Norris kills peo...
17	Unknown	God is real, unless declared integer.
18	John Johnson	First, solve the problem. Then, write the code.
19	Oscar Wilde	Experience is the name everyone gives to their...
20	Miguel de Icaza	Every piece of software written today is likel...
21	Unknown	Computers make very fast, very accurate mistakes.
22	Unknown	Computers do not solve problems, they execute ...
23	Unknown	I have NOT lost my mind—I have it backed up on...
24	Unknown	If brute force doesn’t solve your problems, th...
25	Unknown	It works on my machine.
26	Unknown	Java is, in many ways, C++??.
27	Unknown	Keyboard not found...Press any key to continue.
28	Unknown	Life would be so much easier if we only had th...
29	Unknown	Mac users swear by their Mac, PC users swear a...
...	...	...
159	Paul Graham	OO programming offers a sustainable way to wri...
160	Nikita Popov	Ruby is rubbish! PHP is phpantastic!
161	Douglas Adams	So long and thanks for all the fish!
162	Cicero	If I had more time, I would have written a sho...
163	Jeff Atwood	The best reaction to "this is confusing, where...
164	Jeff Atwood	The older I get, the more I believe that the o...
165	Douglas Crockford	"That hardly ever happens" is another way of s...
166	Anna Debenham	Hello, PHP, my old friend.
167	Melvin Conway	Organizations which design systems are constra...
168	Melvin Conway	In design, complexity is toxic.
169	Jeffrey Zeldman	Good is the enemy of great, but great is the e...
170	Rick Lemons	Don't make the user provide information that t...
171	Donald E. Knuth	You're bound to be unhappy if you optimize eve...
172	Anna Nachesa	If the programmers like each other, they play ...
173	Edsger W. Dijkstra	Simplicity is prerequisite for reliability.
174	Jordi Boggiano	Focus on WHY instead of WHAT in your code will...
175	Andrei Herasimchuk	The best engineers I know are artists at heart...
176	Barry Boehm	Poor management can increase software costs mo...
177	Daniel Bryant	If you can't deploy your services independentl...
178	Daniel Bryant	If you can't deploy your services independentl...
179	Jeff Atwood	No one hates software more than software devel...
180	Robert C. Martin	The proper use of comments is to compensate fo...
181	Cory House	Code is like humor. When you have to explain i...
182	Steve Maguire	Fix the cause, not the symptom.
183	David Heinemeier Hansson	Programmers are constantly making things more ...
184	Linus Torvalds	People will realize that software is not a pro...
185	Ron Fein	Design is choosing how you will fail.
186	Steve Jobs	Focus is saying no to 1000 good ideas.
187	Ron Jeffries	Code never lies, comments sometimes do.
188	Unknown	Be careful with each other, so you can be dang...

189 rows × 2 columns

Though not a best practice, say we wanted to set the author as the index and the quote of the text the value. In this dataset, we're going to have repeated index values, and it might make sense if we wanted to access the data this way, but be very careful doing this in practice.

In [22]:

df1 = df.set_index(df['author']).drop('author', axis=1)
df1

Out[22]:

	text
author
Martin Golding	Always code as if the guy who ends up maintain...
Unknown	All computers wait at the same speed.
Unknown	A misplaced decimal point will always end up w...
Unknown	A good programmer looks both ways before cross...
Unknown	A computer program does what you tell it to do...
Unknown	"Intel Inside" is a Government Warning require...
Arthur Godfrey	Common sense gets a lot of credit that belongs...
Unknown	Chuck Norris doesn’t go hunting. Chuck Norris ...
Unknown	Chuck Norris counted to infinity... twice.
Unknown	C is quirky, flawed, and an enormous success.
Unknown	Beta is Latin for still doesn’t work.
Unknown	ASCII stupid question, get a stupid ANSI!
Unknown	Artificial Intelligence usually beats natural ...
Ted Nelson	Any fool can use a computer. Many do.
Unknown	Hey! It compiles! Ship it!
Martin Luther King Junior	Hate cannot drive out hate; only love can do t...
Unknown	Guns don’t kill people. Chuck Norris kills peo...
Unknown	God is real, unless declared integer.
John Johnson	First, solve the problem. Then, write the code.
Oscar Wilde	Experience is the name everyone gives to their...
Miguel de Icaza	Every piece of software written today is likel...
Unknown	Computers make very fast, very accurate mistakes.
Unknown	Computers do not solve problems, they execute ...
Unknown	I have NOT lost my mind—I have it backed up on...
Unknown	If brute force doesn’t solve your problems, th...
Unknown	It works on my machine.
Unknown	Java is, in many ways, C++??.
Unknown	Keyboard not found...Press any key to continue.
Unknown	Life would be so much easier if we only had th...
Unknown	Mac users swear by their Mac, PC users swear a...
...	...
Paul Graham	OO programming offers a sustainable way to wri...
Nikita Popov	Ruby is rubbish! PHP is phpantastic!
Douglas Adams	So long and thanks for all the fish!
Cicero	If I had more time, I would have written a sho...
Jeff Atwood	The best reaction to "this is confusing, where...
Jeff Atwood	The older I get, the more I believe that the o...
Douglas Crockford	"That hardly ever happens" is another way of s...
Anna Debenham	Hello, PHP, my old friend.
Melvin Conway	Organizations which design systems are constra...
Melvin Conway	In design, complexity is toxic.
Jeffrey Zeldman	Good is the enemy of great, but great is the e...
Rick Lemons	Don't make the user provide information that t...
Donald E. Knuth	You're bound to be unhappy if you optimize eve...
Anna Nachesa	If the programmers like each other, they play ...
Edsger W. Dijkstra	Simplicity is prerequisite for reliability.
Jordi Boggiano	Focus on WHY instead of WHAT in your code will...
Andrei Herasimchuk	The best engineers I know are artists at heart...
Barry Boehm	Poor management can increase software costs mo...
Daniel Bryant	If you can't deploy your services independentl...
Daniel Bryant	If you can't deploy your services independentl...
Jeff Atwood	No one hates software more than software devel...
Robert C. Martin	The proper use of comments is to compensate fo...
Cory House	Code is like humor. When you have to explain i...
Steve Maguire	Fix the cause, not the symptom.
David Heinemeier Hansson	Programmers are constantly making things more ...
Linus Torvalds	People will realize that software is not a pro...
Ron Fein	Design is choosing how you will fail.
Steve Jobs	Focus is saying no to 1000 good ideas.
Ron Jeffries	Code never lies, comments sometimes do.
Unknown	Be careful with each other, so you can be dang...

189 rows × 1 columns

Though we haven't talked about it, there is a very interesting and useful mechanism for filtering data using the apply() method. In this case, we're going to write a cute anonymous function that finds all the quotes by the author Unknown with java in the quote.

In [23]:

df1.loc["Unknown"][df1.loc["Unknown"]["text"]
                   .apply(lambda v: "jav" in v.lower())]

Out[23]:

	text
author
Unknown	Java is, in many ways, C++??.

SQL¶

Loading SQL data into a DataFrame is also supported by Pandas. You might need to take a look at the SQLAlchemy and the documentation on creating database engines, as this is the framework supported directly by Pandas.

SQL DATA SOURCE

Jeopardy! Data Analysis - a sqlite database by cmohamma

This file contains a number of tables that contain the Jeopardy! game clues, players, wins, categories, etc. We will only use a fraction of the data to demonstrate the SQL capabilities.

Our example will use a SQLite database so we can demonstrate the example in a standalone context. We'll show reading a table in full using the read_sql_table() and then how to do ad hoc queries using read_sql_query().

In [24]:

from sqlalchemy import create_engine
engine = create_engine('sqlite:///datasets/database.sqlite')

with engine.connect() as conn, conn.begin():
    data = pd.read_sql_table('final', conn)

In [25]:

data[:10]

Out[25]:

	game_id	clue_id	value	category	clue	strike1	strike2	strike3	answer
0	280	16720	100	HIGH ROLLERS	After an 1891 roulette run, Charles Wells was ...	What is Atlantic City?	What is Las Vegas?	What is Monaco?	Monte Carlo
1	429	25403	100	OH, CRAPS!	The combo that totals one shy of "boxcars"	What is 11?	What is 10?	What is 9?	5 & 6
2	866	51549	100	ROCK & POP	It was the last decade in which Cher didn't ha...	What are the 1980s?	What are the 1970s?	What are the 1990s?	1950s
3	1018	60582	100	LET'S HAVE A BALL	Sink it & you've scratched	Um...	What is the pinball?	What is the 8-ball?	the cue ball
4	1069	63644	100	WHAT A YEAR!	Dewaele won the Tour de France, Coco Chanel wa...	What is 1933?	What is 1987?	What is 1927?	1929
5	1473	84364	100	EUROPEAN HISTORY	A former Socialist, he formed the anti-Communi...	Who was Lenin?	Who was Franco?	Who was Hitler?	Benito Mussolini
6	1635	93864	100	CHRISTIANITY	According to tradition, Dismas & Gestas were t...	Who are the thieves?	What is Cavalry?	What is Mt. Olive?	Calvary
7	4166	242419	100	NAME THE DECADE	Paul Revere & William Dawes warn colonists tha...	What is the 16th century?	What is the 18th century?	What is the 18th century?	the 1770s
8	112	6679	200	ODD ALPHABETS	In alphabet radio code, "B" is Bravo and "F" s...	What's the Flamingo?	What's a Fandango?	What's the Flamenco? - you have it written the...	Foxtrot
9	354	20984	200	SPORTS	A filly becomes a mare at this age	What is 3?	What is 1?	What is 2?	4

Now say we want to find out the distribution of occupations of players over the years. When we look into the players table, we can see we can create a query that allows for us to aggregate these occupations easily.

Using read_sql_query() we can get the job done and dump this into a DataFrame.

In [26]:

query = """
    SELECT occupation, count(occupation) as freq FROM players
    WHERE occupation != ''
    GROUP BY occupation 
    ORDER BY count(occupation) DESC 
    """

with engine.connect() as conn, conn.begin():
    occupation_data = pd.read_sql_query(query, conn)

In [27]:

occupation_data[:10]

Out[27]:

	occupation	freq
0	attorney	380
1	senior	228
2	graduate student	212
3	writer	176
4	teacher	159
5	junior	158
6	law student	120
7	lawyer	112
8	homemaker	101
9	actor	97

PROBLEM

there are many occupations that are the same, but labeled differently
- "attorney" and "lawyer"
- or the various kinds of "teachers"
what would happen if we want to know if the selection of players was fair across occupations?

Let's find all occupations with teach in the name ...

In [28]:

freq_all_occupations = occupation_data.freq.sum()

combined_teacher_freq = \
        occupation_data[
            occupation_data['occupation']
                .str.contains('teach')]\
        .sum()

In [29]:

combined_teacher_freq

Out[29]:

occupation    teacherhigh school teacherhigh school English ...
freq                                                        830
dtype: object

Notice the occupation is the concatenation of all those teachers. We want to change that to a single label "teacher".

In [30]:

combined_teacher_freq['occupation'] = 'teacher'

In [31]:

combined_teacher_freq

Out[31]:

occupation    teacher
freq              830
dtype: object

We now need only append the data to our original DataFrame:

In [32]:

occupation_data = \
    occupation_data[
        ~occupation_data['occupation']
            .str.contains('teach')] \
    .append(combined_teacher_freq, ignore_index=True)

In [33]:

occupation_data[-10:]

Out[33]:

	occupation	freq
4205	writer for an online magazine	1
4206	writer's assistant	1
4207	writer-producer	1
4208	writing instructor	1
4209	yoga instructor	1
4210	yogurt franchise operator	1
4211	youth ministry consultant	1
4212	zoo docent	1
4213	zoo educator	1
4214	teacher	830

Now let's add the percentage column and call it pct:

In [34]:

occupation_data['pct'] = occupation_data['freq']/occupation_data.freq.sum()

In [35]:

occupation_data.sort_values(by='pct', ascending=False)[:10]

Out[35]:

	occupation	freq	pct
4214	teacher	830	0.078905
0	attorney	380	0.036125
1	senior	228	0.021675
2	graduate student	212	0.020154
3	writer	176	0.016732
4	junior	158	0.015020
5	law student	120	0.011408
6	lawyer	112	0.010647
7	homemaker	101	0.009602
8	actor	97	0.009221

... on to Part III: Manipulating DataFrames.