User:Millosh/Internet and Programming for Linguists/The Big Picture/Internet

To help you understand how the Internet works, we'll look at the things that happen when you do a typical Internet operation — pointing a browser at the front page of this document at its home on the Web at the Linux Documentation Project. This document is

http://www.tldp.org/HOWTO/Unix-and-Internet-Fundamentals-HOWTO/index.html

which means it lives in the file HOWTO/Unix-and-Internet-Fundamentals-HOWTO/index.html under the World Wide Web export directory of the host www.tldp.org.

Names and locations
The first thing your browser has to do is to establish a network connection to the machine where the document lives. To do that, it first has to find the network location of the host www.tldp.org (‘host’ is short for ‘host machine’ or ‘network host'; www.tldp.org is a typical hostname). The corresponding location is actually a number called an IP address (we'll explain the ‘IP’ part of this term later).

To do this, your browser queries a program called a name server. The name server may live on your machine, but it's more likely to run on a service machine that yours talks to. When you sign up with an ISP, part of your setup procedure will almost certainly involve telling your Internet software the IP address of a nameserver on the ISP's network.

The name servers on different machines talk to each other, exchanging and keeping up to date all the information needed to resolve hostnames (map them to IP addresses). Your nameserver may query three or four different sites across the network in the process of resolving www.tldp.org, but this usually happens very quickly (as in less than a second). We'll look at how nameservers detail in the next section.

The nameserver will tell your browser that www.tldp.org's IP address is 152.19.254.81; knowing this, your machine will be able to exchange bits with www.tldp.org directly.

The Domain Name System
The whole network of programs and databases that cooperates to translate hostnames to IP addresses is called ‘DNS’ (Domain Name System). When you see references to a ‘DNS server’, that means what we just called a nameserver. Now I'll explain how the overall system works.

Internet hostnames are composed of parts separated by dots. A domain is a collection of machines that share a common name suffix. Domains can live inside other domains. For example, the machine www.tldp.org lives in the .tldp.org subdomain of the .org domain.

Each domain is defined by an authoritative name server that knows the IP addresses of the other machines in the domain. The authoritative (or ‘primary') name server may have backups in case it goes down; if you see references to a secondary name server or (‘secondary DNS') it's talking about one of those. These secondaries typically refresh their information from their primaries every few hours, so a change made to the hostname-to-IP mapping on the primary will automatically be propagated.

Now here's the important part. The nameservers for a domain do not have to know the locations of all the machines in other domains (including their own subdomains); they only have to know the location of the nameservers. In our example, the authoritative name server for the .org domain knows the IP address of the nameserver for .tldp.org but not the address of all the other machines in .tldp.org.

The domains in the DNS system are arranged like a big inverted tree. At the top are the root servers. Everybody knows the IP addresses of the root servers; they're wired into your DNS software. The root servers know the IP addresses of the nameservers for the top-level domains like .com and .org, but not the addresses of machines inside those domains. Each top-level domain server knows where the nameservers for the domains directly beneath it are, and so forth.

DNS is carefully designed so that each machine can get away with the minimum amount of knowledge it needs to have about the shape of the tree, and local changes to subtrees can be made simply by changing one authoritative server's database of name-to-IP-address mappings.

When you query for the IP address of www.tldp.org, what actually happens is this: First, your nameserver asks a root server to tell it where it can find a nameserver for .org. Once it knows that, it then asks the .org server to tell it the IP address of a .tldp.org nameserver. Once it has that, it asks the .tldp.org nameserver to tell it the address of the host www.tldp.org.

Most of the time, your nameserver doesn't actually have to work that hard. Nameservers do a lot of cacheing; when yours resolves a hostname, it keeps the association with the resulting IP address around in memory for a while. This is why, when you surf to a new website, you'll usually only see a message from your browser about "Looking up" the host for the first page you fetch. Eventually the name-to-address mapping expires and your DNS has to re-query — this is important so you don't have invalid information hanging around forever when a hostname changes addresses. Your cached IP address for a site is also thrown out if the host is unreachable.

Packets and routers
What the browser wants to do is send a command to the Web server on www.tldp.org that looks like this:

GET /LDP/HOWTO/Fundamentals.html HTTP/1.0

Here's how that happens. The command is made into a packet, a block of bits like a telegram that is wrapped with three important things; the source address (the IP address of your machine), the destination address (152.19.254.81), and a service number or port number (80, in this case) that indicates that it's a World Wide Web request.

Your machine then ships the packet down the wire (your connection to your ISP, or local network) until it gets to a specialized machine called a router. The router has a map of the Internet in its memory — not always a complete one, but one that completely describes your network neighborhood and knows how to get to the routers for other neighborhoods on the Internet.

Your packet may pass through several routers on the way to its destination. Routers are smart. They watch how long it takes for other routers to acknowledge having received a packet. They also use that information to direct traffic over fast links. They use it to notice when another router (or a cable) have dropped off the network, and compensate if possible by finding another route.

There's an urban legend that the Internet was designed to survive nuclear war. This is not true, but the Internet's design is extremely good at getting reliable performance out of flaky hardware in an uncertain world. This is directly due to the fact that its intelligence is distributed through thousands of routers rather than concentrated in a few massive and vulnerable switches (like the phone network). This means that failures tend to be well localized and the network can route around them.

Once your packet gets to its destination machine, that machine uses the service number to feed the packet to the web server. The web server can tell where to reply to by looking at the command packet's source IP address. When the web server returns this document, it will be broken up into a number of packets. The size of the packets will vary according to the transmission media in the network and the type of service.

TCP and IP
To understand how multiple-packet transmissions are handled, you need to know that the Internet actually uses two protocols, stacked one on top of the other.

The lower level, IP (Internet Protocol), is responsible for labeling individual packets with the source address and destination address of two computers exchanging information over a network. For example, when you access http://www.tldp.org, the packets you send will have your computer's IP address, such as 192.168.1.101, and the IP address of the www.tldp.org computer, 152.2.210.81. These addresses work in much the same way that your home address works when someone sends you a letter. The post office can read the address and determine where you are and how best to route the letter to you, much like a router does for Internet traffic.

The upper level, TCP (Transmission Control Protocol), gives you reliability. When two machines negotiate a TCP connection (which they do using IP), the receiver knows to send acknowledgements of the packets it sees back to the sender. If the sender doesn't see an acknowledgement for a packet within some timeout period, it resends that packet. Furthermore, the sender gives each TCP packet a sequence number, which the receiver can use to reassemble packets in case they show up out of order. (This can easily happen if network links go up or down during a connection.)

TCP/IP packets also contain a checksum to enable detection of data corrupted by bad links. (The checksum is computed from the rest of the packet in such a way that if either the rest of the packet or the checksum is corrupted, redoing the computation and comparing is very likely to indicate an error.) So, from the point of view of anyone using TCP/IP and nameservers, it looks like a reliable way to pass streams of bytes between hostname/service-number pairs. People who write network protocols almost never have to think about all the packetizing, packet reassembly, error checking, checksumming, and retransmission that goes on below that level.

HTTP, an application protocol
Now let's get back to our example. Web browsers and servers speak an application protocol that runs on top of TCP/IP, using it simply as a way to pass strings of bytes back and forth. This protocol is called HTTP (Hyper-Text Transfer Protocol) and we've already seen one command in it — the GET shown above.

When the GET command goes to www.tldp.org's webserver with service number 80, it will be dispatched to a server daemon listening on port 80. Most Internet services are implemented by server daemons that do nothing but wait on ports, watching for and executing incoming commands.

If the design of the Internet has one overall rule, it's that all the parts should be as simple and human-accessible as possible. HTTP, and its relatives (like the Simple Mail Transfer Protocol, SMTP, that is used to move electronic mail between hosts) tend to use simple printable-text commands that end with a carriage-return/line feed.

This is marginally inefficient; in some circumstances you could get more speed by using a tightly-coded binary protocol. But experience has shown that the benefits of having commands be easy for human beings to describe and understand outweigh any marginal gain in efficiency that you might get at the cost of making things tricky and opaque.

Therefore, what the server daemon ships back to you via TCP/IP is also text. The beginning of the response will look something like this (a few headers have been suppressed):

HTTP/1.1 200 OK Date: Sat, 10 Oct 1998 18:43:35 GMT Server: Apache/1.2.6 Red Hat Last-Modified: Thu, 27 Aug 1998 17:55:15 GMT Content-Length: 2982 Content-Type: text/html

These headers will be followed by a blank line and the text of the web page (after which the connection is dropped). Your browser just displays that page. The headers tell it how (in particular, the Content-Type header tells it the returned data is really HTML).