-
-
Notifications
You must be signed in to change notification settings - Fork 16
[Wikibase] Working with JSON dump
In some cases, you might wish to import & analyze Wikibase entities from serialized JSON. For example, if you are processing the JSON dump generated by dumpJson.php
(extensions/Wikibase/repo/maintenance/dumpJson.php), which is actually an array of serialized entities, now it is possible to do so with SerializableEntity
class.
First of all, suppose you are at the root of your MediaWiki installation, you may dump your Wikibase entities with the following command
php ./extensions/Wikibase/repo/maintenance/dumpJson.php --no-cache --output=entities.json
Now entities.json
contains a huge JSON array of dumped wikibase entities, one line per minified JSON of a Wikibase item. Each array item has the structure as follows
{
"type": "item",
"sitelinks": [
// ...
],
"descriptions": {
"en": {
"language": "en",
"value": "totality of space and all matter and radiation in it, including planets, galaxies, light, and us; may include their properties such as energy; may include time/spacetime"
},
"zh": {
"language": "zh",
"value": "一切空间、时间、物质和能量构成的总体"
},
// ...
},
"id": "Q2",
"claims": {
"P2": [{
"type": "statement",
"references": [],
"mainsnak": {
"snaktype": "value",
"property": "P2",
"datavalue": {
"value": "Q1",
"type": "string"
},
"datatype": "external-id"
},
"qualifiers": [],
"id": "Q2$997A7A7B-8737-49B6-9386-BD934CE9E2A7",
"rank": "normal"
}],
"P3": [{
"type": "statement",
"references": [{
"hash": "0e556569b6638a2a8a6ee29edef2644b2fc29c15",
"snaks-order": ["P2"],
"snaks": {
"P2": [{
"snaktype": "value",
"property": "P2",
"datavalue": {
"value": "Q1$8983b0ea-4a9c-0902-c0db-785db33f767c",
"type": "string"
},
"datatype": "external-id"
}]
}
}],
"mainsnak": {
"snaktype": "somevalue",
"property": "P3",
"datatype": "wikibase-item"
},
"qualifiers": [],
"id": "Q2$47BA934E-9A36-42C5-8767-C4D8D6A3F333",
"rank": "normal"
}],
// ...
},
"aliases": {
"en": [{
"language": "en",
"value": "Our Universe"
}, {
"language": "en",
"value": "The Universe"
}, {
"language": "en",
"value": "Universe (Ours)"
}, {
"language": "en",
"value": "The Cosmos"
}, {
"language": "en",
"value": "cosmos"
}]
},
"labels": {
"en": {
"language": "en",
"value": "Universe"
},
"zh": {
"language": "zh",
"value": "宇宙"
},
// ...
}
}
You can use SerializableEntity.Parse(string)
to create a SerializableEntity
instance from the JSON contained in a string, or one of the SerializableEntity.Load
overloads to create a SerializableEntity
instance from TextReader
, JsonReader
or JObject
.
It is possible to convert a Entity
into SerializableEntity
with SerializableEntity.Load(IEntity)
overload.
You can use SerializableEntity.ToJsonString
or SerializableEntity.ToJObject
to persists the entity into Wikibase-compatible JSON serialization.
You can also use SerializableEntity.WriteTo
to write the JSON serialization into TextReader
or JsonReader
.
To work with a huge JSON dump as exported by dumpJson.php
, you may use one of the SerializableEntity.LoadAll
overloads, either to load the array of entities by file name, from TextReader
, or from JsonReader
. This method returns IEnumerable<SerializableEntity>
, so if you plug it into a for-each loop, only the current working entity will be in the memory, making it possible to process a large quantity of entities in a forward-only manner.
The following code example is taken from DataModulesExporter.cs
in crystal-pool/WikibaseClientLite
, where the input JSON dump file will be converted into a set of LUA modules
foreach (var entity in SerializableEntity.LoadAll(itemsDumpReader))
{
if (entity.Type == EntityType.Item) items++;
else if (entity.Type == EntityType.Property)
properties++;
// Preprocess
entity.Labels = FilterMonolingualTexts(entity.Labels, languages);
entity.Descriptions = FilterMonolingualTexts(entity.Descriptions, languages);
entity.Aliases = FilterMonolingualTexts(entity.Aliases, languages);
// Persist
using (var module = moduleFactory.GetModule(entity.Id))
{
using (var writer = module.Writer)
{
WriteProlog(writer, $"Entity: {entity.Id} ({entity.Labels["en"]})");
using (var luawriter = new JsonLuaWriter(writer) {CloseOutput = false})
{
entity.WriteTo(luawriter);
}
WriteEpilog(writer);
}
await module.SubmitAsync($"Export entity {entity.Id}.");
}
if ((items + properties) % 500 == 0)
{
Logger.Information("Exported LUA modules for {Items} items and {Properties} properties.", items, properties);
}
}