Design Doc: Caching Properties of HTML Files

Caching Properties of HTML Files

Joshua Marantz, January 10, 2012

While most HTML is not publicly cacheable, a majority of HTML has properties that are useful to remember and have available to us as we rewrite it. In general we cannot depend on such properties being correct 100% of the time, but if we use those properties for hints about optimization then stale values won’t impact correctness. In most cases we will benefit pages more than we will harm them.

We are already starting to see a few requirements for an HTML property cache, including:

Above-the-fold images
A manifest of referenced resources for link rel=subresource &/or spdy-push
Resources that need to be put into app-cache
Knowledge of which images are used at which resolutions
meta-tags, which we should copy to response-headers even if they are not found in parsing till after first flush

Resilience to Changing HTML

We need to design for changes in web-sites.

News sites change periodically; say every few minutes
A site with a counter or a display of the current time should not impact the structure at all.
Many sites have drastically different structure if a user is logged in (e.g. facebook). The DOM structure such sites may appear bimodal to a proxy.
Some sites have similar structure but customized content for logged-in users (e.g. amazon)
Sites may alter their appearance or structure depending on the locale from which they are viewed (e.g. countries with right-to -left alphabets)
Sites may occasionally be redesigned, changing DOM structure
Sites may present a different DOM structure to User-Agents they recognize as mobile. It is also common to use IE-directives to load a different set of CSS files.

Thus when we rewrite a page guided by cached HTML properties, we must never produce incorrect results when the version of the HTML being rewritten has different properties from the HTML used to populate the cache. Furthermore, if we consistently find that the HTML properties differ from what is cached, we should accumulate this knowledge in the cache so that we do not find ourselves constantly optimizing for a different version of the page than what we are serving.

The issue of mobile-specific layouts is probably best resolved by incorporating an "is_mobile" bit into the cache-key so we cache them separately. Beyond this I’m reluctant to use cache-key uniquification for locale or signed-in-state or other variables for fear of fragmenting the cache and cratering the hit-rate.

For all the other site-changes I think we should track the stability of the site with respect to the data we are storing in the cache. For example, it’s OK if the site HTML changes on every request, but if (for example) the set of above-the-fold images changes on every request then we are likely to make poor optimization choices.

Coherence

Properties don’t have TTLs. So unlike the HTTP cache, property cache entries do not self-invalidate. This means the L1/L2 cache architecture we’ve built for HTTP and metadata caching suffers from coherence issues when used for properties that can change randomly. Consequently, we can't use a local cache to store these properties.

Minimizing Round Trips

From a system perspective, we want to minimize cache round trips, especially in the warm-cache case. We expect that property payloads will typically be small. Thus we should maintain a shared protobuf aggregating all the data we want to store in the cache for HTML pages. Note that data-stability may vary between different properties. We can accommodate this while minimizing RPCs by keeping distinct stability-metadata with each property. We will need to batch up updates to the multi-property bundles to minimize the Write overhead.

For large properties that are only of potential interest on a subset of requests, we can make multiple bundles, adding RPC overhead but reducing the size of the payloads for other requests. For example, for mobile UAs we may want to lookup a low-res pre-rendered JPG/WEBP image of the page for a quick response to the browser to show while the real site loads. Note: I am not necessarily advocating this idea, but it’s an illustration of why we might want to keep some payloads in distinct cache entries. Alternatively we can simply leave some properties with empty values for requests where those properties are not applicable.

Status

I have a cl proposing a new PropertyCache mechanism for determining stability of cached objects. There’s been some useful discussion in this CL, and it doesn’t yet reflect all the positioning done in this document, but it’s a start.

Provide feedback

Saved searches