Jacob Mulquin

Companion Card Directory

8 similar datasets, 8 different presentation implementations. Why? Government.

✍️ Jacob Mulquin
πŸ“… 11/08/2022

I was cleaning out my Code folder the other day and came across an old Python project titled companion_card_directory.

Oh wow, I'd totally forgotten about this!

This project was intended to unify all the different Companion Card directories from around Australia into a single place. The idea was say you lived in Wollongong, but were going on a holiday up to the Gold Coast, you wouldn't have to traverse through another website with different UI and mechanics. The project would comprise of a scraper and a webpage to display the data. It looks like I stopped working on it after the scraper starting spitting out JSON datasets.

So I fired up the code:

cd companion_card_directory
python3 companion_card_directory

Things seemed to be going smoothly, until I was greeted with this lovely error:

Downloading: https://www.sa.gov.au/__data/assets/pdf_file/0009/684828/I051-Companion-Card-Affiliate-List-07_2021.pdf
Traceback (most recent call last):
  File "companion_card_directory.py", line 9, in <module>
    scrape()
  File "/home/jacob/Code/companion_card_directory/companion_card_directory/scrape.py", line 465, in sa
    pdf = pdfplumber.open(file)
  File "/home/jacob/.local/lib/python3.8/site-packages/pdfplumber/pdf.py", line 60, in open
    return cls(path_or_fp, **kwargs)
  File "/home/jacob/.local/lib/python3.8/site-packages/pdfplumber/pdf.py", line 33, in __init__
    self.doc = PDFDocument(PDFParser(stream), password=password)
  File "/home/jacob/.local/lib/python3.8/site-packages/pdfminer/pdfparser.py", line 39, in __init__
    PSStackParser.__init__(self, fp)
  File "/home/jacob/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 502, in __init__
    PSBaseParser.__init__(self, fp)
  File "/home/jacob/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 172, in __init__
    self.seek(0)
  File "/home/jacob/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 514, in seek
    PSBaseParser.seek(self, pos)
  File "/home/jacob/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 202, in seek
    self.fp.seek(pos)
AttributeError: 'bytes' object has no attribute 'seek'

Also, of the 5 states/territories that had been scraped previously, only 2 were still populating results. Thankyou ACT and Queensland for not fixing what wasn't broken.

*facepalm* So that's why I abandoned it, because it's a pain in the ass to have to update the scraping method each time one of the state governments does a departmental restructure or decides on a web refresh. This should not be surprising at all, because as with all web-scraping projects, diligent upkeep is required or things fall apart.

Anyway, I wanted to see it in action again, so I decided to make updates and document the process here. Fingers crossed some of the states/territories will come to their senses and make the data available in CSV/JSON/XML, but I'm not feeling hopeful. I will be lodging feedback with each entity to advocate that the data be made available in more accessible formats.

What is the Companion Card?

A Sample Companion Card

The Companion Card is a card provided to some people with a disability that enables them to take a support person with them to eligible venues without incurring the cost for that support person. It was introduced because it is discriminatory to expect a person to have to pay for a companion if that companion is required due to their disability. While it's not compulsory, businesses are encouraged to adopt it's usage where the cost would not be prohibitive (i.e. It makes sense for a museum, but not so much for a restaurant)

It was an endeavour introduced by the Victorian government and now all states and territories have implemented the Companion Card program, with each state being responsible for issuing cards. I don't know why there wasn't a push for a federal companion card but thankfully the cards can be used between jurisdictions.

There used to be a National site available at https://companioncard.gov.au but it seems to have been decommissioned from 1 February 2022.

State/territory sites are as follows:

Fixing what's broken

New South Wales

Ho' boy the NSW government has had a massive overhaul of their digital stuff lately.

For their search results, they are using elasticsearch and have it presented at https://www.nsw.gov.au/api/v1/elasticsearch/prod_content/_search, a simple POST request with the correct query in the body and BAM, JSON data with name, description and category. Unfortunately we still need to scrape each single page to get contact details and address.

Thiiis close NSW, you almost got 5 stars.

Fixed! :)

Northern Territory

It looks like the NT government has updated their website in the past year. They used to have a very basic 1-page website which listed all the businesses located but now they have a dedicated wesite.

The result list is all well and good, but for some odd reason they don't include the affiliate "venue type" within the result set itself, so in order to get category data, multiple redundant requests requests needed to be made.

But we got there, and now we get a nice JSON object of the results instead of HTML.

Through playing around with it I discovered that the site was developed by TropicsNet.

Fixed! :)

South Austraila

South Australia is an interesting case because they don't provide the list of affiliates through their webpage, they only offer a PDF file. Obviously it's broken because the PDF I was extracting from previously doesn't exist anymore.

I updated the code to look at the page where the PDF is linked and find the URL to the PDF that way. I need to run the function to get from remote or cache twice but I really can't be bothered to figure out why. I chalk it up to my n00b python state and I'm too lazy to search out why it's not working.

Come on South Australia, up your game mate. You only provide a list of business names in a PDF file. There is no contact information, no addresses, no other formats. You can do better, I believe in you.

Fixed! :)

Tasmania

Tasmania's website is interesting, they have a page titled "Tasmanian businesses that accept the Companion Card", but also a "Tasmanian Companion Card directory". Hmmm..

I was originally using the former, but for some reason it has stopped working with an error:

tas
Cached: /home/jacob/Code/companion_card_directory/data/tas/tasmanianbusinessesthatacceptthecompanioncard.html
Traceback (most recent call last):
  File "companion_card_directory.py", line 10, in <module>
    scrape()
  File "/home/jacob/Code/companion_card_directory/companion_card_directory/scrape.py", line 399, in tas
    name = strong.get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

It is failing because I was originally extracting the text of strong elements within the 8th+ paragraphs within the #main-content div. Yuck. Very finicky... especially since that page looks to be manually curated.

My new strategy is to look at the other page, the "directory".

The URL: https://www.companioncard.communities.tas.gov.au/affiliates/directory/search?queries_region_query_posted=1

Then they appear to have an array variable queries_region_query that is populated like so:

Now lets see if we can add them all together and get an output of all affiliates: plz plz plz

This page is so much easier to parse with BeautifulSoup.

But then I scrapped it, and refactored it to grab from each individual category page. Sure, it's not as efficient, but these government departments don't give me much choice since they don't provide easily digestable data formats.

Fixed! :)

Victoria

There's a certain sense of irony in the fact that the state that originally created a Companion Card is one of the only ones still making their data available in PDF only. At least the PDF contains an address and description.

I was hoping they would update since last year, but no luck.

Yep, the PDF still says updated 2016, lol.

Not updated in 6 years? Doesn't seem right...

This one was actually the easiest to fix: It seems like an idiosyncrasy with the pdfplumber library I was using. Adding a simple parameter and it worked just as it did before.

Fixed! :)

Western Australia

It's really easy to trigger a few different out of memory errors on this page, simply navigate to https://www.wacompanioncard.org.au/directory-affiliates/?_page=1&num=50 directly. All you have to do is navigate to the page, select "Show 50" at the bottom and then refresh the page.

But if you look closely, is it really an out of memory error?

SchrΓΆdinger's Out of Memory Error 1

Another one happens when you provide &_ajax_= as an empty paramter. That gives us an error in a different file. They probably could have done with some more testing on this site.

SchrΓΆdinger's Out of Memory Error 2

Rather than try and scrape the links to individual affiliates through this search page, I decided to go a different route. Thankfully the site is using the "Yoast SEO" plugin. This plugin automatically generates sitemaps that we can use, including this one: https://www.wacompanioncard.org.au/affiliates_dir_ltg-sitemap.xml. Oh look, a beautiful list of all the affiliate URLs.

Then it's just business-as-usual BeautifulSoup scrapy-fun-times.

Fixed! :)

Stats

State Entries Time Taken (Cached) Time Per Entry
nt 54 0.12 0.0022
act 90 5.68 0.0631
nsw 1105 28.52 0.0258
qld 844 2.66 0.0032
wa 589 29.37 0.0499
sa 764 1.93 0.0025
tas 226 0.15 0.0007
vic 661 9.72 0.0147

Of course I'm sure these stats aren't very useful because they're more an indication of how poorly optimized the code is and how the data is sourced.

Merging all the data together

So the fun part begins, how do we put this information together so it is more meaningful and easy to use?

In the NSW scrape, I included the "state region" key originally, which was a region as defined by the NSW government. I thought it's a good idea to have the regions listed as somebody may be going on a holiday to a particular destination.

To do this, postcode information is taken from the awesome Matthew Proctor. I've used this list before and it's extremely valuable. Auspost charge you for this information otherwise.

I extracted the SA3 regions into separate state files, organised by postcode, e.g: in postcodes/nsw.json:

"2500": {
    "postcode": "2500",
    "region": "Wollongong",
    "state": "nsw"
}

This makes it easier to lookup later.

After battling my way through python and getting frustrated that I knew exactly how to overcome my problem using PHP and subsequently reminding myself I'm using python to stretch myself, I finally got the bits of information together. Each business is now associated with a SA3 Region! There's a few empty records coming from Victoria but that's a task for another time.

Hooray, now anyone who wants to know about Companion Card affiliate businesses around the country can do so :)

I had also considered normalizing the category names as each state and territory do it slightly differently, but that's a task for another time.

The data and script

You can download the minimized dataset here (I have not linked because webcrawlers):

You can find the source here: mulquin/companion_card_directory

Here's a few sample records:

{
    "address": "255 Keira Street, Wollongong, 2500 NSW",
    "category": "Museums and galleries",
    "email": "",
    "facebook": "",
    "instagram": "",
    "name": "Project Contemporary Artspace",
    "phone": "+61 431 542 309",
    "region": "",
    "state": "nsw",
    "twitter": "",
    "website": "http://www.projectgallery.com.au/"
},
{
    "address": "PO Box 142, Wonthaggi, 3995",
    "category": "",
    "email": "",
    "facebook": "",
    "instagram": "",
    "name": "Wonthaggi Agricultural Show Society",
    "phone": "",
    "region": "Gippsland - South West",
    "state": "vic",
    "twitter": "",
    "website": ""
},
{
    "address": "Champions Way, WILLOWBANK",
    "category": "Sport and Recreation",
    "email": "",
    "facebook": "",
    "instagram": "",
    "name": "Willowbank Raceway",
    "phone": "(07) 5461 5461",
    "region": "",
    "state": "qld",
    "twitter": "",
    "website": "http://www.willowbankraceway.com.au"
},
{
    "address": "",
    "category": "",
    "email": "",
    "facebook": "",
    "instagram": "",
    "name": "ABC Collinswood Centre, Adelaide",
    "phone": "",
    "region": "",
    "state": "sa",
    "twitter": "",
    "website": ""
},
{
    "address": "",
    "category": "Events and Festivals",
    "email": "",
    "facebook": "",
    "instagram": "",
    "name": "Darwin Festival",
    "phone": "08 8943 4200",
    "region": "",
    "state": "nt",
    "twitter": "",
    "website": "http://www.darwinfestival.org.au"
},
{
    "address": "Level 5, 2 Kavanagh Street, Southbank VIC 3006",
    "category": "",
    "email": "",
    "facebook": "",
    "instagram": "",
    "name": "Australian Ballet",
    "phone": "1300 369 741",
    "region": "",
    "state": "act",
    "twitter": "",
    "website": "http://www.australianballet.com.au/"
},
{
    "address": "154 CONNELL RD, WEST END, WA 6530",
    "category": "Family activities, Tourist attractions",
    "email": "",
    "facebook": "",
    "instagram": "",
    "name": "Abrolhos Adventures",
    "phone": "(08) 9942 4515",
    "region": "Mid West",
    "state": "wa",
    "twitter": "",
    "website": "https://www.abrolhosadventures.com.au/"
},
{
    "address": "",
    "category": "Entertainment and the arts",
    "email": "",
    "facebook": "",
    "instagram": "",
    "name": "Circus Oz",
    "phone": "",
    "region": "",
    "state": "tas",
    "twitter": "",
    "website": "http://www.circusoz.com/"
}

In Summary

I'd like to give each state and territory a score out of 5 for their data and a brief judgement from my perspective.

And to all the states and territory employees that are undoubtedly perusing through the hotbed of action that is my website, please please please fight to have this data really accessible. There's nothing to gain from being coy with this data, it's publically available and both the people with disabilities and the businesses would thank you for making it available. Remember, we want them to connect!

Until next time!