{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Module 6: Pandas\n", "## Chapter 16 from the Alex DeCaria textbook\n", "\n", "Pandas started as a way to manipulate financial data but obviously can be used for diverse applications.\n", "\n", "Pandas introduces two new data type structures. The first data structure type is a Pandas *series*, which is a one dimensional array, but it is indexed. The second data structure is a *'dataframe'*, which is similar to a 2-D NumPy array, except that it can hold multiple data types within it (strings, floats, integers, datetime objects, ect...) and also has an index. Dataframes are really just collections of series, and dataframes are nominally equivalent to excel spreadsheets.\n", "\n", "Pandas is particularly good for working with time series data. In addition, Pandas also provides easier ways to view data either time series or the equivalent of spreadsheets, 2D arrays. For today, we will: \n", "- Read csv-formatted data into Python \n", "- Learn how to subset and slice dataframes \n", "- Aggegrate and group data\n", "- Work with time objects in Pandas\n", "\n", "**Before starting:** Make sure that you open up a Jupyter notebook session using OnDemand so you can interactively follow along with today's lecture! Also be sure to copy this Jupyter Notebook and then create a new file with the current date to use to experiment with the syntax. \n", "\n", "\n", " " ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import datetime as dt\n", "#note the following can be confusing. datetime now refers to datetime.datetime, timedelta refers to datetime.timedelta\n", "from datetime import datetime, timedelta, timezone\n", "import zoneinfo\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import matplotlib.dates as mdates\n", "from matplotlib.dates import DateFormatter\n", "#access a function from the urllib module\n", "from urllib.request import urlretrieve\n", "#also let's use a linux command to see the file size\n", "import os" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas \n", "\n", "You have seen Pandas being used extensively before in the Chapter 2 code \n", "and also when dealing with reading data into Pandas in module 4. So let's go back and use the slc_temp.ipynb notebook code but change it to read different variables, in this case pressure and wind speed and direction, at a different location, in this case, above us, WBB" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be accessing public data from a commercial site for which I need to disclose that I am involved with the company, Synoptic Data:\n", "\n", "I am on the Board of Directors\n", "I am a shareholder\n", "I have a grant from that company to help with their customer support and research and development" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saved wbb.csv 1301.374 KB\n" ] } ], "source": [ "#grab station pressure and wind speed and direction data at WBB in October\n", "url = \"https://api.synopticdata.com/v2/stations/timeseries?&token=bace3f05279d4de1bb2f03011843709e&start=202210010600&end=202210200600&obtimezone=local&output=csv&stid=wbb&vars=pressure,wind_speed,wind_direction\"\n", "# define the file to write the data into\n", "filename = \"wbb.csv\"\n", "#let's try to get the file from the web\n", "try:\n", " #get the file over the web\n", " urlretrieve(url, filename)\n", " print(\"Saved\", filename, os.path.getsize(filename)/1000., 'KB')\n", "except:\n", " print(\"something wrong grabbing the file\")\n", " print(\"but the program continues, so may be in error\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# STOP!\n", "Launch a terminal window and look at the file uaing more\n", "\n", "How many header lines are there?\n", "\n", "Are there any footer lines that need to be removed? How do you check for those?\n", "\n", "How many columns in each row of data?" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "#define the headers\n", "headers = ['STID','Date','PRES','SPED','DRCT']\n", "#define the data types as a dictionary\n", "dtypes = { 'STID': 'str','Date': 'str', 'PRES': 'float','SPED': 'float','DRCT': 'float'}\n", "\n", "\n", "#there may be times with no actual values \n", "#those will get changed to NaN by the pandas read_csv function by the option na_filter\n", "#Create a pandas dataframe. \n", "\n", "#There is a lot packed into the following read, so let's explain a bit first\n", "# 1) the parse_dates option indicates take the 2nd column, Date, and then use the \"date_parser\"\n", "# now we have to confess to what the lambda operation is" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "#See page 113\n", "#lambda is a one-line function for really simple operations\n", "# you need to define it before you use it" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.141592653589793\n" ] } ], "source": [ "# for example\n", "# compute area of a circle\n", "area = lambda radius : np.pi * radius**2\n", "print(area(-1))" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "# so in the following date_parser is a one-line function that uses the argument col (all the times)\n", "# in this case, the col is processed using the pandas to_datetime function and uses the argument utc=True \n", "# since the data are available in UTC time" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | PRES | \n", "SPED | \n", "DRCT | \n", "
---|---|---|---|
Date | \n", "\n", " | \n", " | \n", " |
2022-10-01 06:00:00+00:00 | \n", "853.1 | \n", "0.0 | \n", "NaN | \n", "
2022-10-01 06:01:00+00:00 | \n", "853.1 | \n", "0.9 | \n", "233.2 | \n", "
2022-10-01 06:02:00+00:00 | \n", "853.1 | \n", "1.3 | \n", "221.7 | \n", "
2022-10-01 06:03:00+00:00 | \n", "853.1 | \n", "1.2 | \n", "213.9 | \n", "
2022-10-01 06:04:00+00:00 | \n", "853.1 | \n", "1.2 | \n", "218.1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
2022-10-20 05:56:00+00:00 | \n", "859.9 | \n", "2.3 | \n", "32.9 | \n", "
2022-10-20 05:57:00+00:00 | \n", "859.9 | \n", "2.3 | \n", "42.9 | \n", "
2022-10-20 05:58:00+00:00 | \n", "859.9 | \n", "1.8 | \n", "45.9 | \n", "
2022-10-20 05:59:00+00:00 | \n", "859.9 | \n", "2.6 | \n", "50.8 | \n", "
2022-10-20 06:00:00+00:00 | \n", "859.9 | \n", "3.2 | \n", "42.4 | \n", "
27351 rows × 3 columns
\n", "\n", " | PRES | \n", "SPED | \n", "DRCT | \n", "
---|---|---|---|
Date | \n", "\n", " | \n", " | \n", " |
2022-10-01 00:00:00-06:00 | \n", "853.1 | \n", "0.0 | \n", "NaN | \n", "
2022-10-01 00:01:00-06:00 | \n", "853.1 | \n", "0.9 | \n", "233.2 | \n", "
2022-10-01 00:02:00-06:00 | \n", "853.1 | \n", "1.3 | \n", "221.7 | \n", "
2022-10-01 00:03:00-06:00 | \n", "853.1 | \n", "1.2 | \n", "213.9 | \n", "
2022-10-01 00:04:00-06:00 | \n", "853.1 | \n", "1.2 | \n", "218.1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
2022-10-19 23:56:00-06:00 | \n", "859.9 | \n", "2.3 | \n", "32.9 | \n", "
2022-10-19 23:57:00-06:00 | \n", "859.9 | \n", "2.3 | \n", "42.9 | \n", "
2022-10-19 23:58:00-06:00 | \n", "859.9 | \n", "1.8 | \n", "45.9 | \n", "
2022-10-19 23:59:00-06:00 | \n", "859.9 | \n", "2.6 | \n", "50.8 | \n", "
2022-10-20 00:00:00-06:00 | \n", "859.9 | \n", "3.2 | \n", "42.4 | \n", "
27351 rows × 3 columns
\n", "\n", " | PRES | \n", "SPED | \n", "DRCT | \n", "PRES_SMTH | \n", "SPED_SMTH | \n", "
---|---|---|---|---|---|
Date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
2022-10-01 00:00:00-06:00 | \n", "853.1 | \n", "0.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2022-10-01 00:01:00-06:00 | \n", "853.1 | \n", "0.9 | \n", "233.2 | \n", "NaN | \n", "NaN | \n", "
2022-10-01 00:02:00-06:00 | \n", "853.1 | \n", "1.3 | \n", "221.7 | \n", "NaN | \n", "NaN | \n", "
2022-10-01 00:03:00-06:00 | \n", "853.1 | \n", "1.2 | \n", "213.9 | \n", "NaN | \n", "NaN | \n", "
2022-10-01 00:04:00-06:00 | \n", "853.1 | \n", "1.2 | \n", "218.1 | \n", "NaN | \n", "NaN | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2022-10-19 23:56:00-06:00 | \n", "859.9 | \n", "2.3 | \n", "32.9 | \n", "NaN | \n", "NaN | \n", "
2022-10-19 23:57:00-06:00 | \n", "859.9 | \n", "2.3 | \n", "42.9 | \n", "NaN | \n", "NaN | \n", "
2022-10-19 23:58:00-06:00 | \n", "859.9 | \n", "1.8 | \n", "45.9 | \n", "NaN | \n", "NaN | \n", "
2022-10-19 23:59:00-06:00 | \n", "859.9 | \n", "2.6 | \n", "50.8 | \n", "NaN | \n", "NaN | \n", "
2022-10-20 00:00:00-06:00 | \n", "859.9 | \n", "3.2 | \n", "42.4 | \n", "NaN | \n", "NaN | \n", "
27351 rows × 5 columns
\n", "\n", " | PRES | \n", "SPED | \n", "DRCT | \n", "PRES_SMTH | \n", "SPED_SMTH | \n", "
---|---|---|---|---|---|
Date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
2022-10-01 00:00:00-06:00 | \n", "853.1 | \n", "0.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2022-10-01 00:01:00-06:00 | \n", "853.1 | \n", "0.9 | \n", "233.2 | \n", "NaN | \n", "NaN | \n", "
2022-10-01 00:02:00-06:00 | \n", "853.1 | \n", "1.3 | \n", "221.7 | \n", "NaN | \n", "NaN | \n", "
2022-10-01 00:03:00-06:00 | \n", "853.1 | \n", "1.2 | \n", "213.9 | \n", "NaN | \n", "NaN | \n", "
2022-10-01 00:04:00-06:00 | \n", "853.1 | \n", "1.2 | \n", "218.1 | \n", "NaN | \n", "NaN | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2022-10-19 23:56:00-06:00 | \n", "859.9 | \n", "2.3 | \n", "32.9 | \n", "NaN | \n", "NaN | \n", "
2022-10-19 23:57:00-06:00 | \n", "859.9 | \n", "2.3 | \n", "42.9 | \n", "NaN | \n", "NaN | \n", "
2022-10-19 23:58:00-06:00 | \n", "859.9 | \n", "1.8 | \n", "45.9 | \n", "NaN | \n", "NaN | \n", "
2022-10-19 23:59:00-06:00 | \n", "859.9 | \n", "2.6 | \n", "50.8 | \n", "NaN | \n", "NaN | \n", "
2022-10-20 00:00:00-06:00 | \n", "859.9 | \n", "3.2 | \n", "42.4 | \n", "NaN | \n", "NaN | \n", "
27351 rows × 5 columns
\n", "\n", " | PRES | \n", "SPED | \n", "DRCT | \n", "PRES_SMTH | \n", "SPED_SMTH | \n", "
---|---|---|---|---|---|
Date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
2022-10-16 07:25:00-06:00 | \n", "854.6 | \n", "18.7 | \n", "55.4 | \n", "855.6 | \n", "13.0 | \n", "
2022-10-16 09:00:00-06:00 | \n", "855.9 | \n", "17.2 | \n", "67.9 | \n", "856.0 | \n", "12.6 | \n", "
2022-10-16 04:28:00-06:00 | \n", "853.6 | \n", "16.7 | \n", "63.4 | \n", "854.8 | \n", "11.9 | \n", "
2022-10-16 07:27:00-06:00 | \n", "854.8 | \n", "16.4 | \n", "53.8 | \n", "855.6 | \n", "13.0 | \n", "
2022-10-16 07:02:00-06:00 | \n", "854.8 | \n", "16.3 | \n", "55.3 | \n", "855.4 | \n", "12.8 | \n", "
2022-10-16 06:49:00-06:00 | \n", "854.9 | \n", "16.1 | \n", "53.0 | \n", "855.4 | \n", "12.4 | \n", "
2022-10-16 09:44:00-06:00 | \n", "856.5 | \n", "16.1 | \n", "58.9 | \n", "856.2 | \n", "11.6 | \n", "
2022-10-16 07:26:00-06:00 | \n", "854.8 | \n", "16.0 | \n", "52.2 | \n", "855.6 | \n", "13.0 | \n", "
2022-10-16 08:54:00-06:00 | \n", "855.9 | \n", "15.9 | \n", "73.6 | \n", "856.0 | \n", "12.5 | \n", "
2022-10-16 09:01:00-06:00 | \n", "856.0 | \n", "15.9 | \n", "74.7 | \n", "856.0 | \n", "12.6 | \n", "
2022-10-16 02:07:00-06:00 | \n", "853.8 | \n", "15.9 | \n", "67.9 | \n", "854.5 | \n", "11.6 | \n", "
2022-10-16 08:56:00-06:00 | \n", "856.0 | \n", "15.8 | \n", "74.1 | \n", "856.0 | \n", "12.5 | \n", "
2022-10-16 09:45:00-06:00 | \n", "856.5 | \n", "15.8 | \n", "58.5 | \n", "856.2 | \n", "11.6 | \n", "
2022-10-16 09:04:00-06:00 | \n", "856.2 | \n", "15.6 | \n", "71.7 | \n", "856.0 | \n", "12.6 | \n", "
2022-10-16 07:18:00-06:00 | \n", "854.7 | \n", "15.6 | \n", "48.8 | \n", "855.5 | \n", "13.0 | \n", "
2022-10-16 08:57:00-06:00 | \n", "856.0 | \n", "15.6 | \n", "74.2 | \n", "856.0 | \n", "12.5 | \n", "
2022-10-16 07:15:00-06:00 | \n", "854.8 | \n", "15.6 | \n", "51.5 | \n", "855.5 | \n", "13.0 | \n", "
2022-10-16 06:48:00-06:00 | \n", "855.0 | \n", "15.5 | \n", "58.0 | \n", "855.4 | \n", "12.4 | \n", "
2022-10-16 06:41:00-06:00 | \n", "854.9 | \n", "15.5 | \n", "64.6 | \n", "855.4 | \n", "12.1 | \n", "
2022-10-16 06:57:00-06:00 | \n", "854.9 | \n", "15.5 | \n", "54.2 | \n", "855.4 | \n", "12.7 | \n", "