-
Notifications
You must be signed in to change notification settings - Fork 777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Translator: Australian Parliament House (Bills & Legislation) #3396
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Some comments.
"translatorID": "ee31d168-6213-4436-acf6-3e08513ba501", | ||
"label": "Australian Parliament House (Bills & Legislation)", | ||
"creator": "Simon Elvery", | ||
"target": "https:\\/\\/www\\.aph\\.gov\\.au\\/Parliamentary_Business\\/Bills_Legislation\\/Bills_Search_Results.+", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"target": "https:\\/\\/www\\.aph\\.gov\\.au\\/Parliamentary_Business\\/Bills_Legislation\\/Bills_Search_Results.+", | |
"target": "^https://www\\.aph\\.gov\\.au/Parliamentary_Business/Bills_Legislation/Bills_Search_Results", |
The regex passes if it matches any part of the string, so we need to explicitly anchor to the start and don't need padding at the end. And no need to escape slashes except in a JS regex literal.
const results = doc.querySelector('.search-filter-results'); | ||
if (!status && !results) return false; | ||
if (results) return 'multiple'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check getSearchResults(doc, true)
explicitly before returning multiple
for consistency. Can do that instead of this (similar but not identical) check.
for (var i = 0; i < rows.length; i++) { | ||
var href = attr(rows[i], 'href'); | ||
var title = text(rows[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (var i = 0; i < rows.length; i++) { | |
var href = attr(rows[i], 'href'); | |
var title = text(rows[i]); | |
for (let row of rows) { | |
var href = row.href; | |
var title = row.textContent; |
Can use for..of
, and I think you're misunderstanding what attr()
and text()
do: they get an attribute/the text of a child matching a selector, not the element itself. text(elem, '.some-child')
is roughly equivalent to elem.querySelector('.some-child')?.textContent || ''
.
const results = getSearchResults(doc, false); | ||
if (!Object.keys(results).length) return; | ||
let items = await Zotero.selectItems(results); | ||
if (!items) return; | ||
for (let url of Object.keys(items)) { | ||
await scrape(await requestDocument(url), url); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just follow the web translator template:
const results = getSearchResults(doc, false); | |
if (!Object.keys(results).length) return; | |
let items = await Zotero.selectItems(results); | |
if (!items) return; | |
for (let url of Object.keys(items)) { | |
await scrape(await requestDocument(url), url); | |
} | |
let items = await Zotero.selectItems(getSearchResults(doc, false)); | |
if (!items) return; | |
for (let url of Object.keys(items)) { | |
await scrape(await requestDocument(url)); | |
} |
} | ||
|
||
function scrape(doc, url) { | ||
let item = new Zotero.Item(detectWeb(doc,url)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let item = new Zotero.Item(detectWeb(doc,url)); | |
let item = new Zotero.Item(detectWeb(doc, url)); |
item.url = url; | ||
var permalink = attr(doc, '#main_0_billSummary_permalink', 'href'); | ||
if (permalink.length > 0) { | ||
item.url = permalink; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
item.url = url; | |
var permalink = attr(doc, '#main_0_billSummary_permalink', 'href'); | |
if (permalink.length > 0) { | |
item.url = permalink; | |
} | |
item.url = attr(doc, '#main_0_billSummary_permalink', 'href') || url; |
item.history = Array.from(doc.querySelectorAll('#main_0_mainDiv thead tr')).reduce((all, el) => { | ||
const chamber = text(el, 'th'); | ||
return all.concat(Array.from(el.querySelectorAll('tbody tr')).map((el) => { | ||
return `${chamber.length ? chamber + ' ' : ''}${text(el, 'td:first-child')} ${text(el, 'td:last-child')}`; | ||
})); | ||
}, []); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I really have no idea what's going on here. Can we rewrite this in a more imperative style with comments? I like functional programming, but a web scraper isn't usually the right place for it - we just end up with way too many nested layers of logic and it's pretty hard to follow.
item.attachments = Array.from(doc.querySelectorAll('div.bill-details tr')).reduce((acc, rowEl) => { | ||
const name = text(rowEl, 'li', 0).replace(/\\n/g,'').replace(/\s+/g, ' '); | ||
Array.from(rowEl.querySelectorAll('.format a')).forEach((aEl) => { | ||
const formatName = attr(aEl, 'img', 'alt'); | ||
const mimeType = aEl.href.split(';').map(d => d.split('=').map(dd => decodeURIComponent(dd))).find(d => d[0] === 'fileType')[1]; | ||
acc.push({ | ||
title: `${name} (${formatName})`, | ||
url: aEl.href, | ||
mimeType | ||
}) | ||
}) | ||
return acc; | ||
},[]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rewrite this using map
or just iteration?
{ | ||
"title": "First reading (Word Format)", | ||
"mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" | ||
}, | ||
{ | ||
"title": "First reading (PDF Format)", | ||
"mimeType": "application/pdf" | ||
}, | ||
{ | ||
"title": "Explanatory memorandum (Word Format)", | ||
"mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" | ||
}, | ||
{ | ||
"title": "Explanatory memorandum (PDF Format)", | ||
"mimeType": "application/pdf" | ||
}, | ||
{ | ||
"title": "2nd reading amendment STEGGALL, Zali, MP (Word Format)", | ||
"mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" | ||
}, | ||
{ | ||
"title": "2nd reading amendment STEGGALL, Zali, MP (PDF Format)", | ||
"mimeType": "application/pdf" | ||
}, | ||
{ | ||
"title": "2nd reading amendment DANIEL, Zoe, MP (Word Format)", | ||
"mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" | ||
}, | ||
{ | ||
"title": "2nd reading amendment DANIEL, Zoe, MP (PDF Format)", | ||
"mimeType": "application/pdf" | ||
}, | ||
{ | ||
"title": "2nd reading amendment SCAMPS, Sophie, MP (Word Format)", | ||
"mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" | ||
}, | ||
{ | ||
"title": "2nd reading amendment SCAMPS, Sophie, MP (PDF Format)", | ||
"mimeType": "application/pdf" | ||
}, | ||
{ | ||
"title": "2nd reading amendment TINK, Kylea, MP (Word Format)", | ||
"mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" | ||
}, | ||
{ | ||
"title": "2nd reading amendment TINK, Kylea, MP (PDF Format)", | ||
"mimeType": "application/pdf" | ||
}, | ||
{ | ||
"title": "Detail - Crossbench SHARKIE, Rebekha, MP (Word Format)", | ||
"mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" | ||
}, | ||
{ | ||
"title": "Detail - Crossbench SHARKIE, Rebekha, MP (PDF Format)", | ||
"mimeType": "application/pdf" | ||
}, | ||
{ | ||
"title": "Bills Digest (PDF Format)", | ||
"mimeType": "application/pdf" | ||
}, | ||
{ | ||
"title": "Snapshot", | ||
"mimeType": "text/html" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We definitely don't want all of these. At the very least, we only want one format per attachment.
- The "[...] Format" part of the title is unnecessary.
"items": [ | ||
{ | ||
"itemType": "statute", | ||
"nameOfAct": "Aboriginal Land Grant (Jervis Bay Territory) Amendment (Strengthening Land and Governance Provisions) Bill 2022", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we aren't really filling in most of the bill
/statute
fields that we need to make a good citation. We should be able to get most of them from the page. See Help -> List All Fields for Item Type (or hover over an item type name in the code editor)
@AbeJellinek — thanks for all your comments on this, it's cleared a few things up for me. I'll work on fixing/committing your suggestions when I can. As a first time translator author I found it very difficult to find good documentation about the structure and functions available to translators and the field mappings for different citation types. Can you recommend any good documentation I might have missed? |
There's general documentation and web translator-specific documentation, but it honestly all could use an update. As I mentioned in the comments, you should use Scaffold and build around the web translator template (+ button in the toolbar). Also take a look at some recent(ish)ly added/updated translators, like https://github.com/zotero/translators/blob/b947ce66c9da49c9019bafd6098dafe8335a98ea/PubFactory%20Journals.js and https://github.com/zotero/translators/blob/35fc547908c39f842e09b653f9f51f41b7326798/Retsinformation.js. |
(For reference, the web translator template is https://github.com/zotero/zotero/blob/ed045e640a74e2b2075e7b16fa32ea856f704ce5/chrome/content/scaffold/templates/newWeb.js) |
@drzax: I've updated the translator development docs at https://www.zotero.org/support/dev/translators somewhat. Still more work to do, but I hope they're a little clearer now. |
This translator adds support for importing documents related to bills and enacted legislation from the Australian
Parliament House website (aph.gov.au).