Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

Design Doc: Prefetch Proposal

Jeff Kaufman edited this page Jan 5, 2017 · 2 revisions

Prefetch Proposal

Michael Kleber, January 2011

This idea was initiated by Thomas Colthurst, Michael Kleber, Mathieu Gagne. The idea is to use machine learning to determine which of the hrefs on a page are most likely to be clicked by the user, and put that page and/or its subresources into <link rel=prefetch> attributes on this page.

The learning machine would be fed by pairs of URLs generated on each page load, using the Referer (sic) and the URL of the current page. Each pair collected in traffic (Referer=A Url=B) would strengthen the idea that page A should be rewritten with <link rel=prefetch> links for B and its resources.

The challenges of this project include:

  • Creating the machine learning infrastructure in open-source and keeping it small enough to be loaded in mod_pagespeed.so without impacting load-time for those who would turn it off.
  • Sharing the machine-learning DB across multi-process Apache architectures via shared-mem, inter-proces mutexes, the existing cache, the existing stats, or some other mechanism
  • Tuning the algorithm without direct access to server logs
  • Identifying user-agents so that we don't waste time predicting prefetches for browsers that will not benefit, or on connections (mobile) with high cost for bandwidth
  • Potentially introducing modest cacheability for HTML files (probably hash-based etags) so that browsers can pre-render

This last point deserves some more detail. mod_pagespeed forcibly turns off caching/etags for HTML files completely. Because we cache-extend resources on the page to 1 year by signing rewritten URLs with a content hash, we would be in danger of violating origin TTL if the page were cached longer than the origin resources. But I think with etags we are safe -- if the origin TTL of a resource on the page expires in between the prefetch&cache and the clicking of the link, the browser will have to do an if-modified-since request. If any of the origin resources has changed then the rewritten HTML will have a different hash and therefore a different etag and we'll avoid serving expired content. If the resources have not changed (even if they expire) the HTML will be unchanged; therefore its hash & etag will be the same as well, and the browser will just have to wait for one if-modified-since->304 round trip to display the pre-rendered page.

Clone this wiki locally