Wednesday, November 05, 2008

DIV_SRC could cut news feeds' bandwidth by 90%

The Internet is taking over from printed newspapers, magazines and journals. It is also enabling ordinary people like me to publish their thoughts, hopes and ideas in blogs such as this one, almost free of charge. Major news feed sites such as as BBC News, CNN and Slashdot.org have millions of readers word-wide, and the amount of bandwidth consumed by people browsing news feeds is large and growing fast.

News feeds change their front pages often, and for this reason they make them uncacheable, or if they can be cached then it's only for a minute or so. Pretty much every time a person browses a news feed, they get a complete copy of the page from one of the news feed's servers. But if you look at a news feed you will find that it is made up of a number of short stories, often with links to fuller versions of the story. As time goes by, new stories are added and the older ones get pushed down the page, eventually falling off the end into an archive. If you look at a given news feed twice in ten minutes then the chances are good that while the content may have moved around a bit, most of it will be unchanged. And while you probably wouldn't look at the same news feed that frequently, if you consider all the people who share your ISP, and hence your ISP's caching proxy server, the ISP may have to fetch a given news feed page in its entirety thousands of times a day, even if 90% of it is unchanged from one fetch to the next. This places a large and growing load on the news feed's servers and bandwidth, and on the ISP's upstream bandwidth.

I have written an open source program called DIV_SRC, the package can be downloaded from Google Code. The tool allows large, composite HTML pages like news feeds to be broken into sections. The base page references these sections with <div src="URI"> tags (or any other tag types that the caller chooses). DIV_SRC.js is a client-side JavaScript program that finds all of the section references of this sort in the base page and fetches the referenced content using AJAX. A demo news feed shows that if news feeds used DIV_SRC then their main pages, which are necessarily dynamic, would shrink to about 5% of their current size. The individual stories that they carry could then be cached by browsers and caching proxy servers, reducing the news feeds' bandwidth bill by 90% and giving better response times to users.

No server-side coding is required to implement DIV_SRC, it can run off a vanilla web server that delivers nothing but the usual static files of HTML, images, style sheets and JavaScript. The DIV_SRC.js JavaScript code is compact, less than 5KB in terse form.

DIV_SRC could also be used by blogging engines to reduce bandwidth consumption and to improve response times. Blog pages delivered from blog engines such as Blogspot.com come with the HTTP header Cache-control: max-age=0, which prevents browsers and caching proxy servers from caching blog pages. This ensures that every viewer always gets to see the latest version of the page, which is good, but at a cost. If the blog is popular then browsers all over the Internet will be getting fresh copies from the blog server every time the blog is viewed. By using DIV_SRC, blogging engines could ensure that browser users continue to see the latest version of each blog, but the unchanged sections of the blog will not have to be transmitted from the blog server to the browser every time a given blog is viewed. Each blog entry would be stored as a separate static HTML file which could be cached by browsers and/or caching proxy servers anywhere in the Internet.

There are some issues that would have to be addressed, like what happens if a browser without AJAX capabilities (such as a mobile phone browser) accesses a composite page. This question and several others are dealt with in the Problems page of the project's website.