This is the complementary blog post for my DrupalSouth Brisbane 2022 session.
Have you ever wondered how popular Drupal is in your local state and at the Australian Federal Government level? This blog post will help to answer that question, using open source tooling. The hope is that you gain some insight to the relative popularity of Drupal and appreciate more the impact you and Drupal have in Australia.
Existing solutions
There are a number of websites that will claim to be able to give you this information. However they all will likely want you at some point to pay them money.
- wappalyzer.com
- semrush.com
- builtwith.com
- whatcms.com
- similartech.com
- larger.io
“I wanted an open way to do this”
As it turns out, you can plug a few things together to scrape the technologies in use
- Wappalyzer library (open source)
- Puppeteer (open source)
Problem #1: How to get a list of all Australia Government domains
If anything, there are too many sources of this information:
A crawling method could also be done, loads of suitable seed sites, e.g.:
The main issue is that just having a list of sites, does not convey the importance of the site relative to another site.
Enter DomCop to which publishes a list of the top 10 million domains on the internet, including a rank and Open PageRank.
DomCop's top 10 million websites with a filter of.gov.au
applied.On top of suppling 5,795 Australian Government domains, there also is an "Open Page Rank" field. The PageRanks are calculated based on the Open data provided by Common Crawl and Common Search.
Problem #2: What is PageRank?
PageRank is a system for ranking web pages that Google's founders developed in 1996. A PageRank score of 0 is typically a low-quality website, whereas, a score of 10 would represent only the most authoritative sites on the web. It is logarithmic (with a base of 5).
A site with PageRank 3 is 5 times more authoritative than a site with PageRank 2.
Problem #3: Machinery of Government (MoG) and link rot
Australian Governments sites are never static, they are constantly evolving. Sometimes several sites merge into 1, or sometimes 1 site splits into move sites.
DHS Victoria is now closed. 3 sites now replace this 1 site.DESE is also now closed. 2 sites replace this 1 site.Just show me the graphs
Disclaimer:
- This is based on Sept 22, 2022 data
- The scoring is based off PageRank data, so the percentages are not raw counts of websites, but an approximation of how important the respective sites are compared to others (assumes a logarithmic base of 5).
- Wappalyzer detection is not perfect (see the end of this blog post for upstreamed PRs), and there is still a fairly large portion of sites where the CMS cannot be identified
- MoGs make this tricky (PageRank relies on incoming links, which break due to MoGs)
- Only
*.gov.au
domains considered (some Government sites use other TLDs) - Unlikely newly created websites are in the top 10 million just yet (due to how PageRank works)
All sites (*.gov.au)
All sites (*.gov.au)Federal sites (not state based domains)
Programmes like GovCMS are having an impact here.
Federal sites (every non-state based domain)Victoria *.vic.gov.au
The Single Digital Presence (SDP) programme makes a mark in Victoria.
Victoria (*.vic.gov.au)New South Wales *.nsw.gov.au
Large Drupal sites like https://www.nsw.gov.au/ and https://www.service.nsw.gov.au/ help to make Drupal dominant in NSW.
New South Wales (*.nsw.gov.au)South Australia *.sa.gov.au
South Australia (*.sa.gov.au)Western Australia *.wa.gov.au
A lot of unknown CMSs in WA, including sites like https://ww2.health.wa.gov.au/ which I still have no idea what the CMS used is.
Western Australia (*.wa.gov.au)Tasmania *.tas.gov.au
The lowest usage of Drupal for any Australian state or territory and the highest percentage of Wordpress.
Tasmania (*.tas.gov.au)Queensland *.qld.gov.au
Queensland (*.qld.gov.au)Australian Capital Territory *.act.gov.au
The highest percentage of Squiz compared to any other Australia state or territory.
Australian Capital Territory (*.act.gov.au)Northern Territory *.nt.gov.au
Northern Territory (*.nt.gov.au)Open Source Software (OSS) CMS vs Proprietary CMS
For the CMS' that can be identified, splitting them into 2 categories, OSS and Proprietary.
Open Source Software (OSS) CMS vs Proprietary CMSDrupal sites by major version
For sites reporting as Drupal, Drupal 9 and 7 are the most popular.
Drupal by major versionObservations and other unusual findings
#1 - Drupal usage
“Drupal powers roughly 27% of all digital experiences that you use in the Australian government”
#2 - Top contender
“Squiz Matrix is the top contender with 15%, and has a clear state led mandate in certain states/territories”
#3 - TLS coverage
TLS coverage is not 100% - 129 domains found with no TLS
Domain
CMS
Page Rank
Score
unknown
5.51
7,101
unknown
4.47
1,332
http://www.mbsonline.gov.au/internet/mbsonline/publishing.nsf/Content/Home
hcl-notes
4.45
1,289
http://onesearch.slq.qld.gov.au/primo-explore/search?vid=SLQ
unknown
4.41
1,209
http://www.majorprojects.planning.nsw.gov.au/
unknown
4.41
1,209
The second most trafficked site in the Australian Government does not support TLS. Instead this awkward redirect page is used. And a sad face emoji. Sad face indeed.#4 - If in doubt, add a number
19 domains found with ww[number]
as a subdomain.
Domain
CMS
Page Rank
Score
drupal
4.85
2,455
unknown
4.62
1,695
unknown
4.49
1,375
unknown
4.3
1,013
https://www0.landgate.wa.gov.au/
squiz-matrix
4.25
935
When you run out of subdomains, just add a number.#5 - You cannot kill Dreamweaver
15 sites found in 2022.
Extending this for the future
- Crawl other domain spaces, e.g. the New Zealand government domain space
*.govt.nz
- Make a website and publish this data quarterly (DomCop's data updates around this frequency)
- Measure trends over time
Upstreamed enhancements
These are all to make the detection of CMS' and Javascript frameworks more accurate for Australia Government sites.
- Ripple #6827
- Dreamweaver #6828
- HCL DX #6829
- MODX #6832
- Squiz Matrix #6833
- Optimizely CMS #6834
- Umbraco #6835
- Elcom #6836
- Lagoon #6861
- OSS flags and links #6902
- Squiz Matrix #2 #6925
- Ektron CMS #6924
- SiteCore #6926
Raw data
If you want to do your own analysis, here is a link to a full CSV dump.
Comments
I am keen to hear feedback on this data, and what can be done to improve the scoring. Also, if you can help fill in some of the 'unknown' data, let me know, I am happy to craft another PR into Wappalyzer.