How the Web works, in Amazing Detail!

31Jan - by Alan - 0 - In Web & Internet
Astonished kitten
Astonished kitten

Visiting a web site seems so simple – click on a link in your web browser, and soon – within seconds we hope – an interesting web page reveals itself to us. What could be easier?

In fact, what happens ‘under the hood (or bonnet)’ is amazingly complex. It’s taken decades of computer and software development by many thousands of people to make the process of following that hyperlink (to give it its original technical name) to the information it represents. If we were to fully explain all that happens from the moment you clicked that link, we’d fill several books! Perhaps even an encyclopedia or two. You’d need to know: the difference between the Internet and the Web; How Computers work; Computer Science; Web design; Programming; Networks; Electronics; Physics; Mathematics… (I’ve probably left out something!). This essay can only scratch the surface of all that, but I’ll give you plenty of links to more detail so that you can find out as much as you want about this wonderful technology that we take so much for granted!

A Quick Overview of How the Web Works

So that you can see the forest before you get lost in the trees, I’m going to give you a dramatically over-simplified overview that has several mysterious acronyms and probably new concepts. Don’t worry if it doesn’t make a whole lot of sense just yet – it will if you hang in there! Let’s start by supposing that you want to see this wonderful cat picture (it will open in a new window). What happens is:

  1. You clicked on a link: http://tuxar.uk/long-path/cat-picture.jpg (it’s actually longer than that but I want to keep it simple!)
  2. Your browser split the link into three pieces: the protocol (HTTP), the domain name (tuxar.uk) and the path (/long-path/cat-picture.jpg).
  3. Your browser used the DNS system to convert the server’s user-friendly domain name (tuxar.uk) into my server’s IP address (104.28.25.51 today, might change).
  4. Your browser sent a connection request to my server’s IP address.
  5. Your browser sent my server an HTTP request asking for a copy of the image stored at /long-path/cat-picture.jpg
  6. My server found the requested image and returned it to your browser via an HTTP response.
  7. Your browser received and displayed the picture.
  8. Your browser dropped the connection to my server, terminating the session.
  9. You are admiring my cat, Nova.

HTTP = HyperText Transfer Protocol; DNS = Domain Name System. Next, we’ll break those

Laptop computer
Laptop computer

steps down into much more detail!

To be reading this web page, you are using a web browser on some form of computer (desktop, notebook, tablet, smartphone, …). The browser is a software application.

The Internet vs. The World Wide Web

The computer needs a connection to the Internet, usually wired or wireless. But note that the Internet and the World Wide Web are not the same thing! The Internet is like the road or highway system, it provides the ‘routes’ along which traffic can flow. The web is a particular kind of traffic on that highway system. Other kinds include email, file transfers, VoIP, etc.

How does the Internet really work? This clip lets you ride shotgun with a packet of data—one of trillions involved in the trillions of Internet interactions that happen every second. Look deep beneath the surface of the most basic Internet transaction, and follow the packet as it flows from your fingertips, through circuits, wires, and cables, to a host server, and then back again, all in less than a second.

Client-server model
Client-server model

The World Wide Web (aka ‘web’) is based on a client-server model. This simply means that a ‘client’ such as your web browser sends a request to a ‘server’ (specifically, a web server) somewhere out there on the Internet for some information, such as a web page, and if it successfully finds the page you asked for, it sends it back over the Internet. But just like sending letters or packages, you need an address to send your request to – a web address.

Web Addresses (URLs) vs. Internet Protocol (IP) Addresses

What’s a web address? You’ve probably seen it looking like this: Google.com or Microsoft.com or Linux.com or Tuxar.uk/linux However, these are the user-friendly short forms.You might have seen web addresses looking like this: http://example.com/ – your web browser sticks that ‘http://’ on the beginning to make it known further down the line that this is a web address.

You’ve probably also seen some stuff tacked on to the end, after the domain name, e.g. http://example.com:80/more/stuff  That last part is called the path, and the web server (the software on the computer that has the web page you want) needs that to know what to send back to you. This completed web address is called a Uniform Resource Locator (URL) or Uniform Resource Identifier (URI). The components are: protocol, domain name, port number, and path or query string, as shown in this table:

Protocol Domain Name Port Number Path/Query String
http :// example.com :80 /more/stuff
IP address
IP address

However, computers prefer numbers. And just as you can telephone someone (do people say that now?) if you know their phone number, so you can ‘call up’ a computer if your computer knows the other computer’s internet number – better known as an IP address (for Internet Protocol). Because the Internet is a global network of computers each computer connected to the Internet must have a unique address. Internet addresses are in the form nnn.nnn.nnn.nnn where nnn is a number from 0 – 255. This is the human-readable form, the actual IP address is stored in binary.

Domain Name System
Domain Name System

The Internet has the equivalent of a phone directory, called the Domain Name System (DNS). So when you click a link, the computer has to find the IP address of the computer the link is referring to (Google, Microsoft, Linux, …). For now, we’ll call this the ‘DNS lookup‘. It’s similar, at the highest level, to looking up a phone number.

Here’s a practical exercise to familiarise you with DNS lookup. You can run a program on your computer to lookup a domain name: nslookup google.com (do this on your command line).

Protocols, Ports, & Packets

Now we need to know about protocols. A protocol is essentially a set of rules for sending data in or between computers. So if one computer is going to send a file to another, it might use the File Transfer Protocol (FTP). For web pages, we need the HyperText Transfer Protocol (HTTP). For email, there’s POP3 and SMTP.

Port Numbers: These are like telephone extensions. Programs that listen for messages from other computers are given ‘well-known’ port numbers, so that the receiving computer knows who it’s for. Web servers are usually on port 80. If it’s a secure transaction (protocol https) then the port number is usually 443. The port number gets put in after the domain name, like so: http://bing.com:80/ Searches usually have a path like this: http://www.bing.com/search?q=keyword (notice the ‘?’)

Your request now has to go down into several layers of software to reach the Internet. These are the layers of the TCP/IP Protocol stack. They look like this:

Protocol Layer Description
Application Protocols Protocols specific to applications such as HTTP (web), SMTP (e-mail), FTP (file transfer), etc.
Transmission Control Protocol TCP directs packets to a specific application on a computer using a port number.
Internet Protocol IP directs packets to a specific computer using an IP address.
Hardware Converts binary packet data to network signals and back. (E.g. ethernet network card, modem for phone lines, etc.)
IP Packets
IP Packets

HTTP functions as a request-response protocol in the client-server computing model. A web browser, for example, may be the client and an application running on a computer hosting a web site may be the server. The client submits an HTTP request message to the server. The server, which provides resources such as HTML files and other content, or performs other functions on behalf of the client, returns a response message to the client. The response contains completion status information about the request and may also contain requested content in its message body.

Your browser is the application in the top layer. The request will be sent in one or more packets, depending on its size. The packets may even go via different routes in the Internet, but each packet also contains a serial number so that they can be re-assembled once they have all arrived at the destination. At which point, they go back up the layers into the web server.

Internet layering
Internet layering

The TCP layer adds things like the source & destination port numbers; sequence number; etc. The IP layer adds things like source & destination IP addresses, etc. The final packet looks like this: [IP header][TCP header][application data]

Into The Internet!

So your request gets sent out from your computer, into the big scary Internet! Basically, the Internet is a lot of globally interconnected networks and their routers. The primary function of a router is to forward a packet toward its destination network, which is the destination IP address of the packet. To do this, a router needs to search the routing information stored in its routing table.

A routing table is a data file that is used to store route information about directly connected and remote networks. The routing table contains network/next hop associations. These associations tell a router that a particular destination can be optimally reached by sending the packet to a specific router that represents the “next hop” on the way to the final destination. The next hop association can also be the outgoing or exit interface to the final destination. The network/exit-interface association can also represent the destination network address of the IP packet. This association occurs on the router’s directly connected networks.

Illuminating my Web Server with a LAMP

Eventually your request reaches my web host (Pair) in the USA (I’m in the UK but I lived in the USA when I chose Pair). There’s a collection of programs there called a LAMP stack, for Linux, Apache, MySQL, and PHP. Linux is the computer operating system, Apache is the web server, MySQL is a database, and PHP is the programming language of the application that has overall responsibility for processing your request.

My web server is Apache, the oldest and most popular web server software there is. A web server stores, processes and delivers web pages to clients. The communication between client and server takes place using the Hypertext Transfer Protocol (HTTP). Pages delivered are most frequently HTML documents, which may include images, style sheets and scripts in addition to text content.

LAMP Architecture
LAMP Architecture

It passes the request onto a more specialised software called WordPress (the PHP part of LAMP), which helps me to create, manage, and nicely present all my web pages. WordPress powers about 1/5 th of all the websites there are – more than any other software (known as a Content Management System, or CMS).

It looks into another piece of software called a database (specifically MySQL) to find out all the stuff needed to built the text of that web page (this text and links to the pictures, style sheetsJavaScript) and hands it back to the web server, which then sends it back to your computer.

Now to be clear, it doesn’t send all the stuff needed for the web page in one go, it puts in links to all the bits and pieces that are needed. The images, JavaScript etc will be collected later, possibly from other computers, possibly from none because your computer already has them.

Your Computer Gets my Web Page

HyperText Markup Language
HyperText Markup Language

The beginnings of your web page arrive in your computer’s web browser. It might look like this:

<head>
<title>Just a Simple Web Page</title>
</head>
<body>
<p>This is a picture</p>
<img src="/images/picture.jpg" />
<p>This is a <a href="http://Tuxar.uk/">link</a></p>.
</body>

The words that appear in angle brackets are tags (or HTML elements). The content of the title tag is what will appear in your browser’s title bar (at the top). The content of the p (paragraph) tag is text that will appear in your web page. The content of the src attribute is a link to a picture. Your browser now has to fetch that. In this case it’s on the web site you got the page from, but if you’ve been to that web site recently, your browser may already have it in a cache (a memory of recent stuffs from the web). Once it has that it can display your web page fully (it may even have started before it had everything needed). There’s also a link to a website.

The HTML displayed above is trivial compared to the HTML in most web pages. For an idea of what it more typically looks like, right-click on this page and select View page source. In the early days of the web, we wrote HTML by hand! But our web pages were very simple, modern web pages are much more complex.

The picture on the left shows how long it takes to load elements for the Tuxar.uk homepage. We use several techniques to make it as fast as we can, such as by using a Content Distribution Network (CDN). This works by placing our content on several web servers around the world, so that it can be served from the closest web server (in the CDN) to you. It has a shorter distance to travel and fewer network hops to reach you, thus taking much less time – a couple of seconds instead of several if you’re on the other side of the world!

Cat Nova Richmond
Cat Nova Richmond

And that’s how the web works!

Resources

Leave a Reply

Your email address will not be published. Required fields are marked *