kahlee.info

Python, AWS and JavaScript for data analytics

August 2021

Recently I spent some time building a web app based on data from NSW property sales data (available from NSW Valuer General) which allows you to view average property sales by month and by suburb.

The dataset was large and complex, stretching back to the 1990's and having changed file structure in that time. Due to this change in file structure, I decided to only focus on data from 2002 onwards as it was more consistent. The data still required cleaning and handling of invalid data though, and some years had inconsistent file paths after unzipping the data. I had to find efficient ways of handling all of this.

The data was separated into weekly zip files, with each underlying data file split based on district (or local government area). I wanted to eventually separate the data based on suburbs, and suburbs could change LGA over the years, so I decided to first pull all of the data together and then split it into suburbs/postcodes.

After unzipping the data using Python to loop through and unzip all the files, I used pandas to read the data files into a dataframe for processing. Since all of the data between 2002 and 2021 followed a consistent file structure, this worked well apart from a few invalid values, but setting the quoting parameter helped handle these values without needing to remove them.

Once I had a dataframe, I was able to filter it on only the records I needed using query - the sale record, residential properties only, and where the percent interest of sale was 100%. Initially I had tried to do this on file read (i.e. putting only the data I needed into dataframe), but this proved to be less efficient than reading all data and filtering later.

Next was to split the dataframe based on suburb/postcode, and then to export these dataframes to json files (one for each suburb/postcode combination). Originally I was doing this very inefficiently by looping through a list of suburb names and creating a new dataframe by querying the original dataframe. I was able to improve the overall process by around 70% by instead using the pandas groupby process and then using get_group for each suburb/postcode.

Because I wanted to be able to search based on property area, I couldn't aggregate the files ahead of time, so I decided to do the aggregation on the fly using a combination of AWS Lambda and JavaScript. The Lambda script (using Python) reads the json file and pulls only the data matching the suburb and area filters, and then JavaScript gets the total sales per month and average sale price per month. Finally, these values are presented in a chart using Chart.js which allows for easy dynamic refresh if further areas or suburbs are selected.

There are some inefficiencies with this setup which I'm still working through. For example, the Python script processes all historical data at once, meaning every time there is new data available, all data must be processed again. I looked at splitting it into individual weekly files to make the update process simpler, but this resulted in over 150,000 files, which is a lot to manage for a small personal project. As a workaround, I use filecmp to compare old data output to the new data output, so that I only need to manage files which have been updated.

Once I had set up this framework (Python/pandas, AWS, and JavaScript) I was able to easily reuse it for smaller/less complex data sets, such as analysing the NSW Covid-19 data from Data.NSW, which takes less than a minute to process four datasets.