Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump Wikipedia export version #738

Merged
merged 1 commit into from
Jan 31, 2025
Merged

Bump Wikipedia export version #738

merged 1 commit into from
Jan 31, 2025

Conversation

maxjakob
Copy link
Contributor

@maxjakob maxjakob commented Jan 31, 2025

To fix

> python _tools/parse_documents.py dewiki-20250120-pages-articles.xml.bz2

Traceback (most recent call last):
  File "/Users/maxjakob/src/rally-tracks/wikipedia/_tools/parse_documents.py", line 59, in <module>
    to_json(file_name)
  File "/Users/maxjakob/src/rally-tracks/wikipedia/_tools/parse_documents.py", line 28, in to_json
    for doc_data in doc_generator(fp):
  File "/Users/maxjakob/src/rally-tracks/wikipedia/_tools/parse_documents.py", line 16, in doc_generator
    yield parse_page(element, namespaces)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/maxjakob/src/rally-tracks/wikipedia/_tools/parse_documents.py", line 42, in parse_page
    "title": element.find("title", XML_NAMESPACES).text,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'text'

See https://www.mediawiki.org/wiki/Help:Export#Export_format

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:rubber-stamp:

@maxjakob maxjakob merged commit 6e2396f into master Jan 31, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants