I noticed that many websites, even Google and some banking sites, have poorly-written HTML with no quotes around the values of attributes, or using characters such as ampersands not escaped correctly in links. In other words, many use markup that would not validate.
I am curious about their reasons. HTML has simple rules and it is just mind-boggling that they don’t seem to follow those rules. Or do they use programs that just spit out the code?
Most people have gotten the answer basically right — that the rules are different when you serve a page a billion times a day. Bytes begin to matter, and the current level of compression clearly shows that Google is concerned with saving bandwidth.
A few points:
One, people are implying that Google’s reasons for saving bandwidth are financial. Unlikely. Even a few terabytes a day saved on the Google search results page is a drop in the bucket compared to the sum of all their properties: Youtube, Blogger, Maps, Gmail, etc. Much more likely is that Google wants its search results page, in particular, to load as quickly as possible on as many devices as possible. Yes, bytes matter when the page is loaded a billion times a day, but bytes also matter when your user is using a satellite phone in the Sahara and struggling to get 1kbps.
Two, there is a difference between the codified standards of XHTML and the like, and the de-facto standard of what actually works in every browser ever made since 1994. Here, Google’s scale matters because, where most web developers are happy to ignore any troublesome browser that accounts for less than 0.1% of their users, for Google, that 0.1% is perhaps a half million people. They matter. So their search-results page ought to work on IE 5.5. This is the reason they still use tables for layout on many high-value pages – it’s still the layout that “just works” on the greatest number of browsers.
As an exercise, while an intern at Google, I wrote a perfectly compliant XHTML/CSS version of Google’s search result page and showed it around. Eventually the question came up – why are we serving such hodge-podge HTML? Shouldn’t we be leading the web dev community towards standards? The answer I got was pretty much the second point above. Google DOES follow a standard – not the wouldn’t-it-be-nice standards of web utopia, but the this-has-to-work-absolutely-everywhere standard of reality.
Google has a good reason for writing bad HTML – every character they strip from the search page will save them probably gigabytes of bandwidth a day.
As been discussed previously, google does it for bandwidth reasons.
As for banks and other enterprisey websites, there could be multiple reasons-
- CMS spits out invalid HTML
- Dreamweaver, enough said.
- Tend to use commercial UI components that have been designed to work even on ancient browsers so they err on the careful side.
- Badly designed .NET User controls and JSTL taglibs.
For several websites such as Google, having perfect code is not “that” important.
The total size of the web-page however, is. A few bytes spared on the HTML code can mean hundreds of dollars in bandwidth.
So if they can be certain their page will be rendered correctly, they won’t hesitate to tweak their HTML.
Generally speaking, coding up a website is easy and therefore the entry barrier is very low for inexperienced or non programmers. This makes it easy to produce sub standard pages and the web is littered with them. Combine that with tools like Microsoft Frontpage that makes it even easier to make a site (and even easier to generate bad HTML code) and you’ve got a nasty situation.