Skip to content
storyicon edited this page Nov 5, 2018 · 5 revisions

The syntax of GraphQuery consists of fieldname, pipeline and children. They define text extraction rules and data return formats:

  1. fieldname defines the name of the return field.
  2. pipeline defines the rules of text extraction and processing.
  3. children defines the contents and structure of child elements.

1. Fieldname

Fieldname defines the name of the field that returns the result.The composition of fieldname allows English letters, numbers and underscores.

When the beginning and end of fieldname are wrapped by two consecutive underlines, the field will not output in the result. This is usually used for temporary compute nodes

For example:

<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
    <isbn>0836217462</isbn>
    <title lang="en">Star Wars</title>
</book>
</library>
{
    title1 `css("title")`
    isbn `css("isbn")`
    __isbn__ `css("isbn")`
}

Result:

{
    "isbn": "0836217462",
    "title1": "Star Wars"
}

title1, isbn and __isbn__ are all fieldname, because isbn is wrapped by two successive underlines, it will not be exported.

2. Pipeline

Pipeline is defined by a set of functions wrapped in backquotes. It defines the text extraction and processing rules for the current Field. Functions in Pipeline are case-sensitive, and parameters need to be wrapped in double quotes.
If the function parameter in pipeline contains double quotes, you need to use the backslash \ to escape.

For example, the documents and definitions of GraphQuery are as follows:

<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
    <isbn>0836217462</isbn>
    <title lang="en"> Star Wars  </title>
</book>
</library>
bookname `css("title");text();trim()`

In the above definition, the field bookname's pipeline is css("title");text();trim(), it tells the processor that the processing flow of field bookname is as follows:

  1. Use the CSS selector title to select the node <title lang="en"> Star Wars </title> from the document and pass to the next function.

  2. The text () function receives the <title lang="en"> Star Wars </title> passed by the previous function and extracts the text contained in it " Star Wars ".

  3. The trim() function receives " Star Wars ", removing the space at the ends of the text, and returning the result "Star Wars".

Therefore, the final return result is "Star Wars". Let's go to PlayGround and have a try.
Pipeline is more than just these functions shown in the above example, and the large number of functions built into Pipeline make GraphQuery compatible flexible and powerful.
The following tutorial will show you how to use these functions.

1. css | Use CSS selector to select elements

css() accepts only one parameter, which is CSS selector.

Sample:

<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
    <isbn>0836217462</isbn>
    <title lang="en">Star Wars</title>
</book>
</library>
bookname `css("title")`

Result: "Star Wars"

Note: When your pipeline ends with css, the GraphQuery engine will automatically call the text() method to remove the HTML / XML tags in the node.If you really want to include the text of the html/xml tag, please call the string() method at the end of pipeline.

2. json | Use JSON selector to select elements

json() accepts only one parameter, which is json path selector. If you are not familiar with the json path syntax, you can jump to JSON Path Syntax.

Sample:

{
    "title": "Star Wars",
}
bookname `json("title")`

Result: "Star Wars"

3. xpath | Use Xpath selector to select elements

xpath() accepts only one parameter, which is Xpath selector.

<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
    <isbn>0836217462</isbn>
    <title lang="en">Star Wars</title>
</book>
</library>
bookname `xpath("//title")`

Result: "Star Wars"

Note: When your pipeline ends with xpath, the GraphQuery engine will automatically call the text() method to remove the HTML / XML tags in the node.If you really want to include the text of the html/xml tag, please call the string() method at the end of pipeline.

4. regex | Use regular expressions to select elements

regex() accepts only one parameter, which is regular expressions. The regular expressions in GraphQuery are slightly different from those you learned before.

When you have multiple groups in your regular expression, only the value of the first group will be returned.

Let's look at an example:

<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
    <isbn>0836217462</isbn>
    <title lang="en">Star Wars</title>
</book>
</library>
{
    isbnA `regex("<isbn>.*?</isbn>")`
    isbnB `regex("<isbn>(.*?)</isbn>")`
    titleA `regex("<title lang=\"(.*?)\">(.*?)</title>")`
    titleB `regex("<title lang=\"(.*?)\">.*?</title>")`
}

Result:

{
    "isbnA": "<isbn>0836217462</isbn>",
    "isbnB": "0836217462",
    "titleA": "en",
    "titleB": "en"
}

You will find that in the regular expression of titleA, there are two groups (.*?), but only the result of the first group is returned, so it is the same as the result calculated by the titleB expression.

5. trim | Remove blanks from both ends of text

Trim does not accept any parameters.

Sample:

<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
    <isbn>0836217462</isbn>
    <title lang="en">  Star Wars     </title>
</book>
</library>
{
    booknameA `css("title");text()`
    booknameB `css("title");text();trim()`
}

Result:

{
    "booknameA": "  Star Wars     ",
    "booknameB": "Star Wars"
} 

6. template | Use Template Strings

Template accepts only one parameter. It should be a template string. In template method, you can use {$variable name} to reference the calculated key value, and use {$} to reference the field itself. For example:

<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
    <isbn>0836217462</isbn>
    <title lang="en">Star Wars</title>
</book>
</library>
{
    isbn `css("isbn");text()`
    booknameA `css("title");text()`
    booknameB `css("title");text();template("{$}")`
    booknameC `css("title");text();template("[{$}]")`
    booknameD `css("title");text();template("{$}[]")`
    booknameE `css("title");text();template("wow! {$}")`
    booknameF `css("title");text();template("{$} [{$isbn}]")`
    booknameG `css("title");text();template("{$} {$isbn}")`
}

Result:

{
    "isbn": "0836217462"
    "booknameA": "Star Wars",
    "booknameB": "Star Wars",
    "booknameC": "[Star Wars]",
    "booknameD": "Star Wars[]",
    "booknameE": "wow! Star Wars"
    "booknameF": "Star Wars [0836217462]",
    "booknameG": "Star Wars 0836217462",
}

7. attr | Get current element properties

regex() accepts only one parameter, which is attribute name.

Sample:

<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
    <isbn>0836217462</isbn>
    <title lang="en">Star Wars</title>
</book>
</library>
{
    bookID `css("book");attr("id")`
    titleLang `css("title");attr("lang")`
}

Result:

{
    "bookID": "b0836217462",
    "titleLang": "en"
}

8. eq | Get an element in the element collection

eq() accepts only one parameter, which is the index of element. Index starts from 0.

Sample:

<library>
    <book>Star Wars I</book>
    <book>Star Wars II</book>
    <book>Star Wars III</book>
</library>
{
    bookA `css("book");text()`
    bookB `css("book");eq("0");text();`
    bookC `css("book");eq("1");text();`
    bookD `css("book");eq("2");text();`
    bookE `css("book");eq("3");text();`
}

Result:

{
    "bookA": "Star Wars IStar Wars IIStar Wars III",
    "bookB": "Star Wars I",
    "bookC": "Star Wars II",
    "bookD": "Star Wars III",
    "bookE": ""
}

9. string | Get the native string of the element

string() does not accept any parameters. Because the text() method is automatically called when css() or xpath() appear at the end of the pipeline, the string() method can be used to mask this automatic call.

Sample:

<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
    <isbn>0836217462</isbn>
    <title lang="en">Star Wars</title>
</book>
</library>
{
    isbnA `css("isbn")`
    isbnB `css("isbn");string()`
    isbnC `xpath("//isbn")`
    isbnD `xpath("//isbn");string()`
}

Result:

{
    "isbnA": "0836217462",
    "isbnB": "<isbn>0836217462</isbn>",
    "isbnC": "0836217462",
    "isbnD": "<isbn>0836217462</isbn>"
}

10. text | Get the innter text of the element

text () does not accept any parameters.

Sample:

<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
    <isbn>   0836217462  </isbn>
    <title lang="en">Star Wars</title>
</book>
</library>
{
    isbnA `css("isbn");trim()`
    isbnB `css("isbn");text();trim()`
}

Result:

{
    "isbnA": "<isbn>   0836217462  </isbn>",
    "isbnB": "0836217462"
}

11. link | Reference the text of field

The link() method takes a parameter, which should be the name of a field that has calculated the result. You can reference the root node with __ROOT__

Sample:

<library>
    <script>
        var template = {
            "title": "Star Wars",
            "isbn": "0836217462",
        }
    </script>
</library>
{
    json `regex("template = ([\s\S]*?)</script>")`
    title `link("json");json("title")`
    isbn `link("json");json("isbn")`
}

Result:

{
    "json": "{\n            \"title\": \"Star Wars\",\n            \"isbn\": \"0836217462\",\n        }\n    ",
    "title": "Star Wars"
    "isbn": "0836217462",
}

12. replace | Replace the text in the element

Replace receives two arguments, the first is the string that needs to be replaced, and the second is the string replaced with.

Sample:

<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
    <isbn>0836217462</isbn>
    <title lang="en">Star Wars is OK!</title>
</book>
</library>
{
    titleA `css("title");text()`
    titleB `css("title");text();replace(" is OK!", "")`
}

Result:

{
    "titleA": "Star Wars is OK!",
    "titleB": "Star Wars"
}

13. absolute | Absolute URL

Absolute is used to absoluteize the URL. It only receives one argument, it will treat this argument as the parent URL and treat the text value of the current node as a subURL.

Sample:

<library>
<!-- Great book. -->
<book id="b0836217462" available="true">
    <a href="/404.html">404 NOT FOUND</a>
</book>
</library>
{
    urlA `css("a");attr("href")`
    urlB `css("a");attr("href");absolute("https://google.com")`
}

Result:

{
    "urlA": "/404.html",
    "urlB": "https://google.com/404.html"
}

3. Children

Children defines the structure of the child elements and the traversal method. It is defined after the pipeline, usually in the following forms:

  • NO Child
  • Object
  • Array
  • Object Array

The following examples will familiarize you with these types, The following examples all use the following html text as the query object:

    <library>
        <!-- Great book. -->
        <book id="b0836217462" available="true">
            <isbn>0836217462</isbn>
            <title lang="en">Being a Dog Is a Full-Time Job</title>
            <quote>I'd dog paddle the deepest ocean.</quote>
            <author id="CMS">
                <?echo "go rocks"?>
                    <name>Charles M Schulz</name>
                    <born>1922-11-26</born>
                    <dead>2000-02-12</dead>
            </author>
            <character id="PP">
                <name>Peppermint Patty</name>
                <born>1966-08-22</born>
                <qualification>bold, brash and tomboyish</qualification>
            </character>
            <character id="Snoopy">
                <name>Snoopy</name>
                <born>1950-10-04</born>
                <qualification>extroverted beagle</qualification>
            </character>
        </book>
    </library>

1. Get only one text

If you only want to get the text of a node, such as a book name, you only need the following GraphQuery:

title `css("title")`

Reuslt is "Being a Dog Is a Full-Time Job", the title is no child.

2. Get an object result

{
    title `css("title")`
    isbn `css("isbn")`
    quote `css("quote")`
    author `css("author")` {
        name `css("name")`
        born `css("born")`
        dead `css("dead")`
    }
}

Result is:

{
    "author": {
        "born": "1922-11-26",
        "dead": "2000-02-12",
        "name": "Charles M Schulz"
    },
    "isbn": "0836217462",
    "quote": "I'd dog paddle the deepest ocean.",
    "title": "Test TitleBeing a Dog Is a Full-Time Job"
}

The author node defines the Object type children, contains three child nodes:

name `css("name")`
born `css("born")`
dead `css("dead")`

These three child nodes will use the string of the author node as the document, The result of author node is:

<author id="CMS">
    <?echo "go rocks"?>
        <name>Charles M Schulz</name>
        <born>1922-11-26</born>
        <dead>2000-02-12</dead>
</author>

Therefore, the name, born and dead in the author's chlid node will only select the content in the author.

3. Get an array result

If you just want to get all the names, return them in array:

name `css("name")` [
    content `text()`
]

Result is:

[
    "Charles M Schulz",
    "Peppermint Patty",
    "Snoopy"
]

4. Get an object array result

If you only want to get the character data, you need the following GraphQuery:

character `css("character")` [
    {
        name `css("name")`
        born `css("born")`
        dead `css("qualification")`
    }
]

Result is:

[
    {
        "born": "1966-08-22",
        "dead": "bold, brash and tomboyish",
        "name": "Peppermint Patty"
    },
    {
        "born": "1950-10-04",
        "dead": "extroverted beagle",
        "name": "Snoopy"
    }
]

The child node of character is Object Array type. GraphQuery will traverse the character node and use the following GraphQuery query to get the result and return it:

{
    name `css("name")`
    born `css("born")`
    dead `css("qualification")`
}