IB/Group 4/Computer Science/Web Science

Option C in the IB Computer Science course.

Creating the Web
Commonly the Internet, an internet, and the World Wide Web (otherwise referred to as the web) have been commonly mixed up. However, each is quite different.

An internet simply refers to a set of interconnected networks. The Internet refers to the global computing network that utilizes standardized communications protocol including IP addresses. In other words, the internet is a wide- area network that spans the planet. The World Wide Web (Web) is the information space comprised of various web resources that can be accessed via the Internet. In other words the World Wide Web is a service that runs on The Internet.

The analogy can be made that the Internet is a restaurant and the web is its most popular dish.

Growth of the Web
Generally, it can be characterized that the change in the web was a movement from personal sites to blogs, or publication to participation. It was a move from static pages to dynamic ones.

Early Forms of the Web
Sometimes referred to by "Web 1.0", early stage's of the web where personal and static web pages hosted on ISP (internet service provider) web servers or on free web hosting services. Generally before the advent of dynamic programming languages such as Perl, PHP, and Python, some design elements included: online guestbooks instead of comment sections and HTML forms were mailto forms.

Web 1.0 is associated with the business model of Netscape - focusing on software creation, updates, and bug fixes and the distribution of such to end users.

Web 2.0
Web 2.0 referred to a web that emphasized user participation and contribution in sites such as social media sites and blogs. Featured client-side technologies such as Ajax and JavaScript as well as dynamic programming languages. The focus on user interface, application software, and storage of files has been referred to as "network as a platform". Key features of Web 2.0 include:


 * Folksonomy - free classification of information (such as in tagging)
 * User Participation - site users are encouraged to add value/content to the site
 * Mass Participation - universal web access has led to the differentiation of concerns from the user base
 * SaaS (Software-as-a-Service)

In contrast to Web 1.0, Web 2.0 is associated with Google, which focused not on creating end-user software but providing a service based on existing data.

The Semantic Web
The Semantic Web was extended through the standards by the World Wide Web Consortium (W3C) that promoted common data formats and a unity in exchange protocols. For example, the Resource Description Framework (RDF) specification was promoted as a general method for conceptual modelling for web resources using subject-predicate-object expressions (e.g. subject: "the table", predicate: "has the length of", object: "one meter").

Protocol and Addressing
Protocols are a set of rules for communication that ensure proper, compatible communication for a certain successful process to take place e.g. TCP/IP. Protocols ensure the universality of the web. Standards, on the other hand, are a set of technical specifications that should be followed to allow for functionality but do not have to be necessarily followed in order to have a successful process to take place e.g. HTML. Without them, it would be like communicating in a foreign language without knowing the foreign language.

e.g. without TCP, there would be no transport protocol and packets would be lost.

e.g. without HTML, there would be no standard scripting language for displaying webpages and different web browsers may not display all pages

Web Browser
A software tool for retrieving, presenting, and traversing information resources on the web.

TCP and IP together comprise a suite of protocols that carry out the basic functionality the web.

Internet Protocol (IP)
IP is a network protocol that defines routing to addresses of the data packets. Every computer holds a unique IP address and IP ensures the process of getting all data to the destination.

Transmission Control Protocol (TCP)
Information sent over the internet is broken into “packets” and sent through different routes to reach a destination. TCP creates data packets, puts them back together in the correct order, and checks that no packets were lost.. '

File Transfer Protocol (FTP)
FTP is the protocol that provides the methods for sharing or copying files over a network. It is primarily utilized for uploading files to a web site and certain downloading sites may utilize an FTP server. However, HTTP is more common for downloading. When using FTP, the URL will reflect as such with.

Hypertext Transfer Protocol (HTTP)
HTTP is a specific set of internet protocol used to communicate between web servers and web browsers. HTTP is a text based protocol as a new connection must be established for each new user request and communicates without knowledge of the communications network.

Hypertext Transfer Protocol Secure (HTTPS)
As HTTP does not provide much security, HTTPS was developed and added encryption to a connection using Transport Layer Security (TLS) or Secure Sockets Layer (SSL).

Uniform Resource Locator (URL)
A standard way of specifying the location of a webpage.

Uniform Resource Identifier (URI)

A means of identifying a specific webpage on a website.URLs have typica characteristics which are in the URI.

For example,  the URL , has a n protocol identifie  r http, resource name is example.com and a specific file name.

A URI is a string that identifies a resource. A URL is specific type of URI that provides the address of a web resource as well as the means to retrieve the resource. For example,  identifies   protocol for retrieval,   as the address, and the specific file.

Domain Name Server (DNS)
A Domain Name Server is a special type of server that relates a web address to an IP address, acting somewhat like a directory. It utilizes a hierarchical decentralized naming system, sorting by root DNS servers or top level domain servers (such as  and  ) then to authoritative DNS servers below each top level (for example,   may be under  ).

Hypertext Mark-up Language (HTML)
HTML is the standard markup language used to make web pages. Characteristics:
 * Allows for embedded images/objects or scripts
 * HTML predefined tags structure the document
 * Tags are marked-up text strings, elements are “complete” tabs, with opening and closing, and attributes modify values of an element
 * Typically paired with CSS for style

Cascading Style Sheet (CSS)
CSS sheets describe how HTML elements are displayed. It can control the layout of several web pages at once.

Extensible Mark-Up Language (XML)
XML is a markup specification language that defines rules for encoding documents (to store and transport data) that is both human- and machine- readable. XML, as a metalanguage, supports the creation of custom tags (unlike HTML) using Document Type Definition (DTD) files which define the tags. XML files are data, not a software.

Extensible Stylesheet Language Transformations (XSLT)
XSLT is a language for transforming XML documents into other XML documents or other formats such as HTML. It creates a new document based on the content of the existing one.

Javascript
JavaScript is a dynamic programming language widely utilized to create web resources. Characteristics include:
 * Client side
 * Supports object-oriented programming styles
 * Does not include input/output
 * Can be used to embed images or documents, create dynamic forms, animation, slideshows, and validation for forms
 * Also used in games and applications

Web Pages
Head contains title and meta tags, metadata. Metadata describe the document itself or associates it with related resources such as scripts and style sheets. Body contains headings, paragraphs and other content.

Title defines the title in the browser’s toolbar.

Meta tags are snippets of text that describe a page’s content but don’t appear on the page itself, only in the page’s code. Helps search engines find relevant websites.

Personal pages are pages created by individuals for personal content rather than for affiliations with an organization. Usually informative or entertaining containing information on topics such as personal hobbies or opinions.

Blogs or Weblogs is a mechanism allowing for publishing periodic articles on a website.

Search Engine Pages or Search Engine Results Page (SERP) display results by a search engine from a query.

Forums or online discussion boards usually organized by topics where people can hold conversations through posted messages. Typically has different user groups which define a user’s roles and abilities.

Static web pages contain the same content on each load of the page, but dynamic web pages’ content can change depending on user input. Static websites are faster to develop and cheaper to develop, host, and maintain, but lack the functionality and easy ability to update that dynamic web sites have. Dynamic web pages include e-commerce systems and discussion boards.

Dynamic web pages can use PHP, ASP.NET frameworks or Java Server Page (JSP) scriptlets. JSP scriptlets are a small piece of executable code intertwined in HTML. JSP is server-side. JavaScript on the other hand is client-side. ASP.NET framework can use simple pages (SPA) or MVC (Model View Logic) models to generate dynamic web pages or applications, hosts a variety of .NET languages such as razor syntax C#. PHP is server-side scripting for web development and can be embedded into HTML code or used with templates or frameworks.

Server-side scripting runs on server, requires a request sent and return data. More secure for client. Includes PHP, JSP, and ASP.NET.

Client-side scripting runs script on client’s side. Can pose security risk to client, but faster. Includes JavaScript and JSON.

Connection strings is a string that specifies about a data source and the means to connect to it. Commonly used for database connection.

CGI is a standard way for web servers to interface executable programs installed on a server that generate web pages dynamically.

Layers of the Web
Surface web is anything able to be found and accessed by search engines. The deep web includes web pages that cannot be found by search engines due to protection through need of authentication. Can usually only be accessed by already knowing the link or having the proper authentication. The dark web on the other hand can usually only be found through TOR as access requires encryption and anonymization factors.

Search Engines
Web search engine is a site that helps you find other websites through methods such as keyword searching and concept-based searching. Searches through following the different links of a website.

Searching Algorithms
The term searching is about looking at the queries that have been entered and the index is searched for matches. These things are taken into account when searching: checking term frequency, zone indexes (placing different weight on title v. description), relevance of feedback, vector model (looking at the cosine similarity of a document).

PageRank is an algorithm used by Google. Link analysis algorithm that assigns numerical weighting to each element of hyperlinked texts. PR(E) (page rank of E). A hyperlink to a page counts as a vote or support of a particular page. Importance by association. Number of paths to the page divided by number of outgoing links from the page/step before and then considering the PR of the previous page/step. Altogether, the different PageRanks would sum 1, its a probability distribution.

Hyperlink-Induced Topic Search (HITS) algorithm is a link analysis program that also rates Web pages. Hubs and authorities. A good hub points to many pages, a good authority is a page linked to by many hubs. Each page is assigned two scores: its authority, which estimates value of content, and its hub value, which estimates the value of its links to other pages. First generates a root set (most relevant pages) through text-based algorithm. Then a base set generated by augmenting the root set with web pages linked from it or to it. The base set and all the hyperlinks in the base set form a focused subgraph upon which HITS is performed.

Web Crawlers
Web crawlers, also known as web spiders, are internet bots that systematically index websites by going through different links while collecting information about the site. Also copies the site for index.

Bot also known as a web robot is a software application that runs automated tasks or scripts over the Internet and can do so at a high rate. Usually repetitive tasks.

Web crawlers can be stopped from accessing a page with a robots.txt file through robot exclusion protocol.

Meta tags are used for indexing for keywords, retrieval (if index is relevant to search query), and may sometimes be used for ranking. Google for example, gives meta tags no weight. Students should be aware that this is not always a transitive relationship.

Crawling is a process of exploration of every link page and returning copy of that page. Use of several web crawlers or running multiple processes in parallel at once to maximize download rate. Has to be careful not to download the same site more than once.

Indexing is a process where each page is analysed for words and then the page is added to an index of websites. Indexing allows for speedy searching and to provide high relevancy.


 * Trustworthiness of linking domain/hub
 * Popularity of linking page
 * Relevancy of content between source and target page
 * Anchor text used in link
 * Amount of links to the same page on source page
 * Amount of domains linking to target page
 * Relationship between source and target domains
 * Variations of anchor text in link to target page

Search Engine Optimization

 * Allow search engines to find your site
 * Have a link-worthy site
 * Identify key words, metadata
 * Ensure search-friendly architecture
 * Have quality content
 * Update content regularly

Black hat use aggressive SEO strategies that exploit search engines rather than focusing on human audience - short term return. Include usage of: White hat techniques are “within” guidelines and considered ethical - long term return.
 * Blog spamming
 * Link farms
 * Hidden text
 * Keyword stuffing
 * Parasite hosting
 * Cloaking
 * Guest blogging
 * Link baiting
 * Quality content
 * Site optimization

Mobile Computing
Mobile computing is human-computer interaction during which the computer can be expected to be transported during normal usage (or otherwise is mobile). Most popular devices include the smart phone and the tablet.

Ubiquitous computing
Ubiquitous computing is the concept where computing is made to appear anytime anywhere. An overwhelming spread of computing (pervasive computing). It comes in different forms e.g. laptops, tablets.

Peer-2-Peer Networks
Peer-2-Peer Networks are ones in which each computer or node acts as both client and server which allows for resources to be commonly shared by all within the network. Autonomy from central servers achievable. An example of P2P is torrenting.

Grid Computing
Grid computing is the collection of computer resources in multiple locations to reach a common goal. Distinguished from cluster computing in that grid computing assigns specific roles to each node. Grids can be used for software libraries. Persistent, standards-based service infrastructure.

Ubiquitous computing is being perpetuated by mobile computing. The idea is spreading and manifesting.

P2P addresses is more about assuring connectivity and a network of shared resources, while grid network focuses more upon infrastructure. Both deal with the organization of resource sharing within virtual communities.

Ubiquitous computing commonly are characterized by multi-device interaction (P2P and grid), but are not necessarily synonymous.

Grid in grid computing links together resources (PCs, workstations, servers, storage elements) and provides mechanism needed to access them.

Interoperability and Open Standards
Interoperability is a property of a system to work with other products without any restrictions in access or implementation.

Open standards is a standard publicly available and has various rights to use associated with it.


 * Peer-to-peer: architectures where there is no special machines that provide a service or manage the network resources. Instead all responsibilities are uniformly divided among all machines, known as peers. Peers can serve both as clients and as servers.
 * Client–server: architectures where smart clients contact the server for data then format and display it to the users. Input at the client is committed back to the server when it represents a permanent change.
 * Three-tier: architectures that move the client intelligence to a middle tier so that stateless clients can be used. This simplifies application deployment. Most web applications are three-tier.

A distributed system is a software system in which components located on networked computers which communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal and thus this causes to have everything on other computers and not to make a computer 'boss' which is a head as all of them are on the same level.

Compression
Lossless recovers every single bit of original data when decompressed (GIF).

Lossy eliminates redundant or certain information. (JPEG)


 * It can be only used with Lossless compression.
 * It is helpful if you do not have the original file.
 * It might not bring every bit back and some minor details might be missing.

The Evolving Web
Web 2.0 and the increase of dynamic web pages have allowed for user contribution to greatly proliferate and the widespread usage of social networking, blogging, and comment sections.

Cloud Computing
Cloud computing is hosting on remote servers on the internet to store, manage, and process data rather than on a local server or personal computer. Cloud computing more widely shares resources than in the client-server architecture. Client-server architecture merely refers to the communication between client and server and the distribution of “responsibility”.

Public Computing Private Cloud Computing Hybrid approach
 * Anyone can access it.
 * No maintenance and updating up to the company.
 * A hosted data center where the data is protected by a firewall.
 * Great option for companies who have expensive data centers as they can use their current infrastructure.
 * However, maintenance and updating is up to the company.

When using both private and public cloud. Effects of use of cloud computing for organizations
 * Less costly
 * Device and location independence
 * Maintenance is easier
 * Performance is easily monitored
 * Security is interesting

Intellectual Computing
Creative Commons gives freedom to share, adapt, and even use commercially information. Has different redistributions and some may allow usage without crediting, but may not indicate it is their own intellectual property. Privacy: information shared with visiting sites, how that information is used, who that information is shared with, or if that information is used to track users.

Identification: the process of comparing a data sample against all of the systems databased reference templates in order to establish the identity of the person trying to gain access to the system.

Authentication: a process in which the credentials provided are compared to those on file in a database of authorized users’ information on a local operating system or within an authentication server.

These three components enable for safe and secure internet browsing.
 * Future Networks and Wireless Ad hoc Networks
 * Future Networks in Vehicular Ad Hoc Networks
 * 5G and Internet of Things (IoT)
 * Future Internet applications in IoT
 * Steps towards Future of Smart Grid Communications
 * Routing in Machine to Machine (M2M) and Future Networks
 * Fusion of Future Networking Technologies and Big Data / Fog Computing
 * Future Internet and 5G architectural designs
 * 5G advancements in VANETs (Vehicular Ad Hoc Network)
 * Mobile edge computing
 * Security and Privacy in future Networks
 * Networking Protocols for Future Networks
 * Data Forwarding in Future Networks
 * New Applications for Future Networks
 * Transport Layer advancements in Future Networks
 * Cloud based IoT architectures and use cases

Internet of Things (IoT)
IoT refers to the network of physical objects embedded with electronics and other needed technology to enable these objects to collect and exchange data.

New multinational online oligarchies or monopolies may occur that are not restricted by one country.
 * Innovation can drop if there is a monopoly. There is therefore danger of one social networking site, search engine, browser creating a monopoly limiting innovation.
 * Tim Berners-Lee describes today’s social networks as centralized silos, which hold all user information in one place.
 * Web browsers (Microsoft)
 * Cloud computing is dominated by Microsoft.
 * Facebook is dominating social networking.
 * ISPs may favor some content over other.
 * Mobile phone operators blocking competitor sites.
 * Censorship of content.

Net Neutrality
A principle idea that Internet Service Providers (ISP) and governments should treat all data and resources on the Internet the same, without discrimination due to user, content, platform, or other characteristics.The term  'Decentralized Web' is being used to refer to a series of technologies that replace or augment current communication protocols, networks, and services and distribute them in a way that is robust against single-actor control or censorship.