-
Notifications
You must be signed in to change notification settings - Fork 19
Syntax
The syntax of GraphQuery consists of fieldname, pipeline and children. They define text extraction rules and data return formats:
- fieldname defines the name of the return field.
- pipeline defines the rules of text extraction and processing.
- children defines the contents and structure of child elements.
- 1. Fieldname
-
2. Pipeline
- 1. css | Use CSS selector to select elements
- 2. json | Use JSON selector to select elements
- 3. xpath | Use Xpath selector to select elements
- 4. regex | Use regular expressions to select elements
- 5. trim | Remove blanks from both ends of text
- 6. template | Use Template Strings
- 7. attr | Get current element properties
- 8. eq | Get an element in the element collection
- 9. string | Get the native string of the element
- 10. text | Get the innter text of the element
- 11. link | Reference the text of field
- 12. replace | Replace the text in the element
- 13. absolute | Absolute URL
- 3. Children
Fieldname defines the name of the field that returns the result.The composition of fieldname allows English letters, numbers and underscores
.
When the beginning and end of fieldname are wrapped by two consecutive underlines, the field will not output in the result. This is usually used for temporary compute nodes
For example:
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<isbn>0836217462</isbn>
<title lang="en">Star Wars</title>
</book>
</library>
{
title1 `css("title")`
isbn `css("isbn")`
__isbn__ `css("isbn")`
}
Result:
{
"isbn": "0836217462",
"title1": "Star Wars"
}
title1
, isbn
and __isbn__
are all fieldname, because isbn is wrapped by two successive underlines, it will not be exported.
Pipeline is defined by a set of functions wrapped in backquotes. It defines the text extraction and processing rules for the current Field. Functions in Pipeline are
case-sensitive
, and parameters need to be wrapped indouble quotes
.
If the function parameter in pipeline contains double quotes, you need to use thebackslash \
to escape.
For example, the documents and definitions of GraphQuery are as follows:
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<isbn>0836217462</isbn>
<title lang="en"> Star Wars </title>
</book>
</library>
bookname `css("title");text();trim()`
In the above definition, the field bookname's pipeline is css("title");text();trim()
, it tells the processor that the processing flow of field bookname is as follows:
-
Use the CSS selector
title
to select the node<title lang="en"> Star Wars </title>
from the document and pass to the next function. -
The text () function receives the
<title lang="en"> Star Wars </title>
passed by the previous function and extracts the text contained in it " Star Wars ". -
The trim() function receives " Star Wars ", removing the space at the ends of the text, and returning the result "Star Wars".
Therefore, the final return result is "Star Wars". Let's go to PlayGround and have a try.
Pipeline is more than just these functions shown in the above example, and the large number of functions built into Pipeline make GraphQuery compatible flexible and powerful.
The following tutorial will show you how to use these functions.
css() accepts only one parameter, which is CSS selector.
Sample:
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<isbn>0836217462</isbn>
<title lang="en">Star Wars</title>
</book>
</library>
bookname `css("title")`
Result: "Star Wars"
Note: When your pipeline ends with css, the GraphQuery engine will automatically call the text() method to remove the HTML / XML tags in the node.If you really want to include the text of the html/xml tag, please call the string() method at the end of pipeline.
json() accepts only one parameter, which is json path selector. If you are not familiar with the json path syntax, you can jump to JSON Path Syntax.
Sample:
{
"title": "Star Wars",
}
bookname `json("title")`
Result: "Star Wars"
xpath() accepts only one parameter, which is Xpath selector.
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<isbn>0836217462</isbn>
<title lang="en">Star Wars</title>
</book>
</library>
bookname `xpath("//title")`
Result: "Star Wars"
Note: When your pipeline ends with xpath, the GraphQuery engine will automatically call the text() method to remove the HTML / XML tags in the node.If you really want to include the text of the html/xml tag, please call the string() method at the end of pipeline.
regex() accepts only one parameter, which is regular expressions. The regular expressions in GraphQuery are slightly different from those you learned before.
When you have multiple groups in your regular expression, only the value of the first group will be returned.
Let's look at an example:
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<isbn>0836217462</isbn>
<title lang="en">Star Wars</title>
</book>
</library>
{
isbnA `regex("<isbn>.*?</isbn>")`
isbnB `regex("<isbn>(.*?)</isbn>")`
titleA `regex("<title lang=\"(.*?)\">(.*?)</title>")`
titleB `regex("<title lang=\"(.*?)\">.*?</title>")`
}
Result:
{
"isbnA": "<isbn>0836217462</isbn>",
"isbnB": "0836217462",
"titleA": "en",
"titleB": "en"
}
You will find that in the regular expression of titleA
, there are two groups (.*?)
, but only the result of the first group is returned, so it is the same as the result calculated by the titleB
expression.
Trim does not accept any parameters.
Sample:
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<isbn>0836217462</isbn>
<title lang="en"> Star Wars </title>
</book>
</library>
{
booknameA `css("title");text()`
booknameB `css("title");text();trim()`
}
Result:
{
"booknameA": " Star Wars ",
"booknameB": "Star Wars"
}
Template accepts only one parameter. It should be a template string. In template method, you can use {$variable name} to reference the calculated key value, and use {$} to reference the field itself. For example:
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<isbn>0836217462</isbn>
<title lang="en">Star Wars</title>
</book>
</library>
{
isbn `css("isbn");text()`
booknameA `css("title");text()`
booknameB `css("title");text();template("{$}")`
booknameC `css("title");text();template("[{$}]")`
booknameD `css("title");text();template("{$}[]")`
booknameE `css("title");text();template("wow! {$}")`
booknameF `css("title");text();template("{$} [{$isbn}]")`
booknameG `css("title");text();template("{$} {$isbn}")`
}
Result:
{
"isbn": "0836217462"
"booknameA": "Star Wars",
"booknameB": "Star Wars",
"booknameC": "[Star Wars]",
"booknameD": "Star Wars[]",
"booknameE": "wow! Star Wars"
"booknameF": "Star Wars [0836217462]",
"booknameG": "Star Wars 0836217462",
}
regex() accepts only one parameter, which is attribute name.
Sample:
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<isbn>0836217462</isbn>
<title lang="en">Star Wars</title>
</book>
</library>
{
bookID `css("book");attr("id")`
titleLang `css("title");attr("lang")`
}
Result:
{
"bookID": "b0836217462",
"titleLang": "en"
}
eq() accepts only one parameter, which is the index of element. Index starts from 0.
Sample:
<library>
<book>Star Wars I</book>
<book>Star Wars II</book>
<book>Star Wars III</book>
</library>
{
bookA `css("book");text()`
bookB `css("book");eq("0");text();`
bookC `css("book");eq("1");text();`
bookD `css("book");eq("2");text();`
bookE `css("book");eq("3");text();`
}
Result:
{
"bookA": "Star Wars IStar Wars IIStar Wars III",
"bookB": "Star Wars I",
"bookC": "Star Wars II",
"bookD": "Star Wars III",
"bookE": ""
}
string() does not accept any parameters. Because the text() method is automatically called when css() or xpath() appear at the end of the pipeline, the string() method can be used to mask this automatic call.
Sample:
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<isbn>0836217462</isbn>
<title lang="en">Star Wars</title>
</book>
</library>
{
isbnA `css("isbn")`
isbnB `css("isbn");string()`
isbnC `xpath("//isbn")`
isbnD `xpath("//isbn");string()`
}
Result:
{
"isbnA": "0836217462",
"isbnB": "<isbn>0836217462</isbn>",
"isbnC": "0836217462",
"isbnD": "<isbn>0836217462</isbn>"
}
text () does not accept any parameters.
Sample:
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<isbn> 0836217462 </isbn>
<title lang="en">Star Wars</title>
</book>
</library>
{
isbnA `css("isbn");trim()`
isbnB `css("isbn");text();trim()`
}
Result:
{
"isbnA": "<isbn> 0836217462 </isbn>",
"isbnB": "0836217462"
}
The link() method takes a parameter, which should be the name of a field that has calculated the result. You can reference the root node with __ROOT__
Sample:
<library>
<script>
var template = {
"title": "Star Wars",
"isbn": "0836217462",
}
</script>
</library>
{
json `regex("template = ([\s\S]*?)</script>")`
title `link("json");json("title")`
isbn `link("json");json("isbn")`
}
Result:
{
"json": "{\n \"title\": \"Star Wars\",\n \"isbn\": \"0836217462\",\n }\n ",
"title": "Star Wars"
"isbn": "0836217462",
}
Replace receives two arguments, the first is the string that needs to be replaced, and the second is the string replaced with.
Sample:
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<isbn>0836217462</isbn>
<title lang="en">Star Wars is OK!</title>
</book>
</library>
{
titleA `css("title");text()`
titleB `css("title");text();replace(" is OK!", "")`
}
Result:
{
"titleA": "Star Wars is OK!",
"titleB": "Star Wars"
}
Absolute is used to absoluteize the URL. It only receives one argument, it will treat this argument as the parent URL and treat the text value of the current node as a subURL.
Sample:
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<a href="/404.html">404 NOT FOUND</a>
</book>
</library>
{
urlA `css("a");attr("href")`
urlB `css("a");attr("href");absolute("https://google.com")`
}
Result:
{
"urlA": "/404.html",
"urlB": "https://google.com/404.html"
}
Children defines the structure of the child elements and the traversal method. It is defined after the pipeline, usually in the following forms:
- NO Child
- Object
- Array
- Object Array
The following examples will familiarize you with these types, The following examples all use the following html text as the query object:
<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
<isbn>0836217462</isbn>
<title lang="en">Being a Dog Is a Full-Time Job</title>
<quote>I'd dog paddle the deepest ocean.</quote>
<author id="CMS">
<?echo "go rocks"?>
<name>Charles M Schulz</name>
<born>1922-11-26</born>
<dead>2000-02-12</dead>
</author>
<character id="PP">
<name>Peppermint Patty</name>
<born>1966-08-22</born>
<qualification>bold, brash and tomboyish</qualification>
</character>
<character id="Snoopy">
<name>Snoopy</name>
<born>1950-10-04</born>
<qualification>extroverted beagle</qualification>
</character>
</book>
</library>
If you only want to get the text of a node, such as a book name, you only need the following GraphQuery:
title `css("title")`
Reuslt is "Being a Dog Is a Full-Time Job"
, the title is no child.
{
title `css("title")`
isbn `css("isbn")`
quote `css("quote")`
author `css("author")` {
name `css("name")`
born `css("born")`
dead `css("dead")`
}
}
Result is:
{
"author": {
"born": "1922-11-26",
"dead": "2000-02-12",
"name": "Charles M Schulz"
},
"isbn": "0836217462",
"quote": "I'd dog paddle the deepest ocean.",
"title": "Test TitleBeing a Dog Is a Full-Time Job"
}
The author
node defines the Object type children, contains three child nodes:
name `css("name")`
born `css("born")`
dead `css("dead")`
These three child nodes will use the string of the author
node as the document, The result of author
node is:
<author id="CMS">
<?echo "go rocks"?>
<name>Charles M Schulz</name>
<born>1922-11-26</born>
<dead>2000-02-12</dead>
</author>
Therefore, the name
, born
and dead
in the author
's chlid node will only select the content in the author
.
If you just want to get all the names, return them in array:
name `css("name")` [
content `text()`
]
Result is:
[
"Charles M Schulz",
"Peppermint Patty",
"Snoopy"
]
If you only want to get the character data, you need the following GraphQuery:
character `css("character")` [
{
name `css("name")`
born `css("born")`
dead `css("qualification")`
}
]
Result is:
[
{
"born": "1966-08-22",
"dead": "bold, brash and tomboyish",
"name": "Peppermint Patty"
},
{
"born": "1950-10-04",
"dead": "extroverted beagle",
"name": "Snoopy"
}
]
The child node of character
is Object Array type. GraphQuery will traverse the character
node and use the following GraphQuery query to get the result and return it:
{
name `css("name")`
born `css("born")`
dead `css("qualification")`
}