From geospatial info to a pandas dataframe for time sequence evaluation
Time sequence evaluation of geospatial knowledge permits us to investigate and perceive how occasions and attributes of a spot change over time. Its use circumstances are huge ranging, notably in social, demographic, environmental and meteorology/local weather research. In environmental sciences, for instance, time sequence evaluation helps analyze how land cowl/land use of an space adjustments over time and its underlying drivers. It’s also helpful in meteorological research in understanding the spatial-temporal adjustments in climate patterns (I’ll shortly show one such case research utilizing rainfall knowledge). Social and financial sciences massively profit from such evaluation in understanding dynamics of temporal and spatial phenomena equivalent to demographic, financial and political patterns.
Spatial illustration of information is kind of highly effective. Nonetheless, it may be a difficult process to investigate geospatial knowledge and extract fascinating insights, particularly for an information scientist/analyst who’s not skilled in geographical info science. Happily, there are instruments to simplify this course of, and that’s what I’m trying on this article. I wrote my earlier article on a number of the fundamentals of geospatial knowledge wrangling—be happy to test that out:
On this article I’ll undergo a sequence of processes — ranging from downloading raster knowledge, then transferring knowledge right into a pandas dataframe and organising for a standard time sequence evaluation duties.
Knowledge supply
For this case research I’m utilizing spatial distribution of rainfall in Hokkaido prefecture, Japan between the intervals 01 January to 31 December of 2020 — accounting for three hundred and sixty six days of the yr. I downloaded knowledge from an open entry spatial knowledge platform ClimateServe — which is a product of a joint NASA/USAID partnership. Anybody with web entry can simply obtain the information. I’ve uploaded them on GitHub together with codes if you wish to comply with alongside. Right here’s the snapshot of some raster pictures in my native listing:
Setup
First, I arrange a folder the place the raster dataset is saved so I can loop by way of them in a while.
# specify folder path for raster dataset
tsFolderPath = './knowledge/hokkaido/'
Subsequent, I’m importing a number of libraries, most of which might be acquainted to knowledge scientists. To work with raster knowledge I’m utilizing the rasterio
library.
# import libraries
import os
import rasterio
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Visualize knowledge
Let’s take a look at how the raster pictures appear to be in a plot. I’ll first load in a random picture utilizing rasterio
after which plot it utilizing matplotlib
performance.
# load in raster knowledge
rf = rasterio.open('./knowledge/hokkaido/20201101.tif')fig, ax = plt.subplots(figsize=(15,5))
_ = ax.imshow(rf.learn()[0], cmap = 'inferno')
fig.colorbar(_, ax=ax)
plt.axis('off')
plt.title('Day by day rainfall Jan-Dec 2020, Hokkaido, Japan');
As you may see, this picture is a mixture of pixels, the worth of every pixel represents rainfall for that individual location. Brighter pixels have excessive rainfall worth. Within the subsequent part I’m going to extract these values and switch them right into a pandas
dataframe.
Extract knowledge from raster recordsdata
Now into the important thing step — extracting pixel values for every of the 366 raster pictures. The method is easy: we are going to loop by way of every picture, learn pixel values and retailer them in a listing.
We are going to individually preserve observe of dates in one other record. The place are we getting the dates info? For those who take a more in-depth take a look at the file names, you’ll discover they’re named after every respective day.
# create empty lists to retailer knowledge
date = []
rainfall_mm = []# loop by way of every raster
for file in os.listdir(tsFolderPath):
# learn the recordsdata
rf = rasterio.open(tsFolderPath + file)
# convert raster knowledge to an array
array = rf.learn(1)
# retailer knowledge within the record
date.append(file[:-4])
rainfall_mm.append(array[array>=0].imply())
Word that it didn’t take lengthy to loop by way of 366 rasters due to low picture decision (i.e. giant pixel dimension). Nonetheless, it may be computationally intensive for top decision datasets.
So we simply created two lists, one shops the dates from file names and the opposite has rainfall knowledge. Listed here are first 5 gadgets of two lists:
print(date[:5])
print(rainfall_mm[:5])>> ['20200904', '20200910', '20200723', '20200509', '20200521']
>> [4.4631577, 6.95278, 3.4205956, 1.7203209, 0.45923564]
Subsequent on to transferring the lists right into a pandas
dataframe. We are going to take an additional step from right here to alter the dataframe right into a time sequence object.
Convert to a time sequence dataframe
Transferring lists to a dataframe format is a simple process in pandas
:
# convert lists to a dataframe
df = pd.DataFrame(zip(date, rainfall_mm), columns = ['date', 'rainfall_mm'])
df.head()
We now have a pandas
dataframe, however discover that ‘date’ column holds values in strings, pandas
doesn’t know but that it symbolize dates. So we have to tweak it somewhat bit:
# Convert dataframe to datetime object
df['date'] = pd.to_datetime(df['date'])
df.head()
df['date'].data()
Now the dataframe is a datetime object.
It’s also a good suggestion to set date column because the index. This facilitates slicing and filtering knowledge by completely different dates and date vary and makes plotting duties straightforward. We are going to first kind the dates into the precise order after which set the column because the index.
df = df.sort_values('date')
df.set_index('date', inplace=True)
Okay, all processing achieved. You at the moment are prepared to make use of this time sequence knowledge nevertheless you would like. I’ll simply plot the information to see the way it seems to be.
# plot
df.plot(figsize=(12,3), grid =True);
Lovely plot! I wrote a number of articles up to now on tips on how to analyze time sequence knowledge, right here’s one:
Extracting fascinating and actionable insights from geospatial time sequence knowledge could be very highly effective because it exhibits knowledge each in spatial and temporal dimensions. Nonetheless, for knowledge scientists with out coaching in geospatial info this could be a daunting process. On this article I demonstrated with a case research how this tough process could be achieved simply with minimal efforts. The information and codes can be found on my GitHub if you wish to replicate this train or take it to the subsequent degree.
Thanks for studying. Be at liberty to subscribe to get notification of my forthcoming articles on Medium or just join with me through LinkedIn or Twitter. See you subsequent time!