The World of Peer-to-Peer (P2P)/Building a P2P System

= Building a P2P System =

The Peer
The Peer and the user running it, is the corner stone of all P2P systems, without peers you will not be able to create the Network, this seems obvious but it is very common to disregard the users needs and focus on the final objective, the Network itself, kind of looking to a forest and not seeing the trees.

Building communities
There is a benefit in incorporating the social element into the software. Enabling people to come together not only to share content or services but around a common goal were cooperation for an optimal state should remain the final objective.


 * A P2P application is a type of social software

Social software is defined as any software that promotes and enables social collaboration. This is of course part of what defines a modern p2p system, where the gatekeeper is no more and participants are free to interact on their own terms. The p2p applications becomes an enabling tool in this reality.

The world can increasingly be defined as a network of networks, all things are interconnected at some level. As ubiquity of Internet becomes a reality it not only makes communications speed increase but also the volume of sources this creates a quality assurance problem.

Online collaboration, built around the shared goal of a functional peer to peer network will help not only to improve the system but to establish a relation of trust that is followed by the emergence of a reputation latter. Across the participants and in the network and software itself, this personalization will ultimately enable to extend this trust and reputation to the multitude of relations that are possible in a peer to peer network.

Social Networks (person-to-person)
As communities emerge and aggregate, they increase the addressability not only of content as before but between producers and consumers that will self organize as to establishing of super and subgroups (many-to-many) based on personal preferences.

Enabling meaningful exchange of information and promoting increased collaboration also increases disponibility of rare or obscure content and the importance of microcontent, due to the removal for extraneous information, since content can consistently be directed to the right audience (pulled, not pushed).

Using a social software or any type of centralized social network permits the easy extraction of metrics on user actions this has been highly explored by corporations, that have for some time been waging war not only to get a share of the data such interactions generate but also for control of such information (ie: Google and Microsoft). Microsoft has a research project dedicated especially to this subject (Enterprise Social Computing).


 * Opening the walled gardens

A user oriented GUI
As you start to project your P2P application the GUI is what the users will have to interact with to use your creation, you should attempt to define not only what OS you will support but within what framework you can design the application to be used from a WEB browser or select a portable framework so you can port it to other systems.

Overwhelming a user with options is always a bad solution and will only be enticing to highly experienced users, even if it done based on what you like, you should keep in mind that you aren't creating it for your own use.
 * Become the user

Aside from the technical decisions the functionally you offer should also be considered, the best approach is to be consistent and offer similar options to existing implementations, other applications or even how it is normally done on the environment/OS you are using. There are several guidelines you may opt to follow for instance Apple provides a guideline for OSX ( http://developer.apple.com/documentation/UserExperience/Conceptual/OSXHIGuidelines/ ).
 * Simplicity is the ultimate sophistication


 * Content is king


 * The difference is in the little details


 * Guide the user

P2P Networks Topology
The topology of a P2P network can be very diverse, it depends on the medium it is run (Hardware), the size of the network (LAN,WAN) or even on the software/protocol that can impose or enable a specific network organization to emerge.

Below we can see the most common used topologies (there can be mixed topologies or even layered on the same network), most if not all peer to peer networks are classified as overlay networks, since they are created over an already existing network with its own topology.



Ring topology - Mesh topology - Star topology - Total Mesh or Full Mesh - Line topology - Tree topology - Bus topology - Hybrid topology

A fully connected P2P network is not feasible when there is a large numbers of peers participating.
 * In network of several thousands or millions of participants (n = participants)
 * Were a peer would have to handle O(n2) overall connections, it wouldn't scale.

Since most P2P networks are also a type of overlay networks, the resulting topology of a P2P system depends also on the protocol, the infrastructure (medium/base network) or by the interaction of peers. This "variables" add more layers to the complexity of the normal networks, where the basic characteristics are bandwidth, latency and robustness, making most P2P networks into self-organizing overlay networks.

When performing studies of P2P networks the resulting topology (structural properties) is of primary importance, there are several papers on the optimization or characteristics of P2P networks.

The paper Effective networks for real-time distributed processing ( http://arxiv.org/abs/physics/0612134 ) by Gonzalo Travieso and Luciano da Fontoura Costa, seems to indicate that  uniformly random interconnectivity scheme, is specific Erdős-Rényi (ER) random network model with fixed number of edges, as being largely more efficient than the scale-free counterpart, the Barabási-Albert(BA) scale-free model.

Self-Organizing Systems (SOS)
P2P systems fall by default in this category, taking in account that most try to remove the centralization or Control and Command (C&C) structure that exists in most centralized systems/networks.

The study of self-organizing systems is relatively new, and it applies to a huge variety of systems or structures from organizations to natural occurring events). Mostly done with math and depending on models and simulations, its accuracy depends on the complexity of the structure, number of intervenients and initial options.

This book will not cover this aspect of the P2P systems in great detail but some references and structural characteristics do run in parallel with the SOS concepts. For more information on SOS check out the Self-Organizing Systems (SOS) FAQ for USENET Newsgroup comp.theory.self-org-sys (http://www.calresco.org/sos/sosfaq.htm).

Synergy One of the emerging characteristic of SOS is synergy, where members aggregate around a common goal each working to the benefit of all. This is also present in most P2P systems were peers work to optimally share resources and improve the network. Swarm The effect generated of participants in a SOS, to aggregate and do complementary work so to share knowledge and behavior in a way it improves coordination. This is observable in the natural world on flocks of birds or social insects. We will covert this aspect of P2P later on, but keep it in mind that it related to SOS.

Bootstrap
Most P2P systems don't have (or need) a central server, but need to know an entry point into the network. This is what is called bootstrapping the P2P application. It deals with the connectivity of peers, to be able find and connect other P2P peers (the network) even without having a concrete idea who and what is where.

This is not a new problem; it is shared across several network technologies, that avoid the central point of failure of requiring a central server to index all participants - a gatekeeper.

One solution, Zero Configuration Networking (zeroconf) (see http://www.zeroconf.org/), compromises a series of techniques that automatically (without manual operator intervention or special configuration servers) creates a usable Internet Protocol (IP) network. These techniques are often used to help bootstrap, configure or open a path across routers and firewalls.

Hybrid vs real-Peer systems
One of the main objectives of the P2P system is to make sure no single part of it critical to the collective objective. By introducing any type of centralization to a peer-to-peer Network one is creating points of failure, as some Peers will be more than others this can even lead to security or stability problems, as with the old server-client model, were a single user could crash the server and deny its use to others in the network.

Availability
Due to the instability of a P2P network, where nodes are always joining and leaving, some efforts must be made to guarantee not only that the network is available and enabling new peers to join, but that the resources shared continue to be recognized or at least indexed while temporarily not available.

Integrity
Due to the open nature of peer-to-peer networks, most are under constant attack by people with a variety of motives. Most attacks can be defeated or controlled by careful design of the peer-to-peer network and through the use of encryption. P2P network defense is in fact closely related to the "Byzantine Generals Problem". However, almost any network will fail when the majority of the peers are trying to damage it, and many protocols may be rendered impotent by far fewer numbers.

Clustering
Computer science defines a computer cluster in general terms as a group of tightly coupled computers that work together closely so that in many respects they can be viewed as though they are a single computer.

The components of a cluster are commonly, connected to each other through fast networks and usually deployed to improve performance and/or availability over that provided by a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.

As we have seen before, this concept if applied to distributed networks or WANs (in place of LANs), generates distributed computations, grids and other systems. All of those applications are part of the P2P concept.

We loosely define clusters as a physical, social or even economical/statistical event. That is is defined by the aggregation of entities due to sharing a property in common, that property may be a shared purpose or a characteristic, or any other communality.

As we look at the topologies generated by P2P networks we can observe that most protocols generate some kind of clustering around networks structures, resources and they can even emerge as a result of the status of network      conditions. Clustering is then an unsupervised learning problem, an automatic emerging event that results on the creation of ad hoc collection of unlabeled objects (data/items or events). For more information on clusters you may check A Tutorial on Clustering Algorithms ( http://home.dei.polimi.it//matteucc/Clustering/tutorial_html/ ).

A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters using the same set of characteristics.

This concept is very important on not only the form of P2P networks but has also implications on the social structure/relations that can be build upon the use of P2P applications.

Flashcrowds
Flashcrowd is a behavioral model in that participants will tend to aggregate/crowd around an event, in P2P terms this can be a scarce resource, for instance as we will see later BitTorrent promotes big files over small ones and new over old, this is a result of a connection to the network based on single items, the speeds on BitTorrent does depend uppermost on generating flashcrowds around files (more peers, more speed that will snow ball in more seeders).

This can also result in a DoS (Denial of Service) or flood. For instance, if a P2P system is poorly designed, attempting to connect a significant number of peers to the network may disturb the bootstrap method used.

You should educate users, more connections doesn't equate to more speed, at most it will result in more responses to queries but that may depend on how the protocol is structured, but enabling them will cost bandwidth. On the other side more peers will result on more resources to be shared that will also result in overlapping of shares and so better speeds, as a P2P network gets bigger the better it can provide for its users.

Optimizations
There are simple optimizations that should be done in any P2P Protocol that could bring a benefit to both peers and the network in general. These are based on system metrics or profile like IP (ISP or range), content, physical location, share history, ratios, searches and many other variables.

Most of the logic/characteristics of P2P networks and topologies (in a WAN environment) will result in aggregation of peers and so this clusters will share the same properties of Distributed Behavioral Models, like Flocks, Herds and Schools this results in an easier to study environment and to establish correlation about the peers relations and extrapolate ways to improve efficiency.


 * Alignment
 * To get information/characteristics from the local system/environment in order to optimize the peer "location" on the network by selecting and optimize the separation and cohesion functions to improve the local neighborhood.


 * Separation
 * To implement a way to avoid crowding with other peers based on unwanted alignment.


 * Cohesion
 * To select peers based on their own alignment.

P2P Networks Traffic
The P2P traffic detected on the Internet due to the nature of the protocols and topology used some times can only be done by estimation based on the perceived use (users on-line, number of downloads of a given implementation) or by doing point checks on the networks itself. It is even possible to access the information if the implementation on the protocol or application was done with this objective in mind, several implementations of Gnutella a for example have that option and Bearshare did even reports some of the users system parameters, like type of firewall etc...

Examples of services that provide such traffic information over P2P networks are for instance Cachelogic ( http://www.cachelogic.com/research/2005_slide16.php# ).

One of your most important considerations is how you project the way peers will communicate, even if we discard the use of central servers like SuperNodes and multiple distributed clients as Peer/Nodes there will be several questions to consider:
 * Communication


 * Will communication need to go across firewalls and proxy servers?
 * Is the network transmission speed important? Can it be configured by the user?
 * Will communications be synchronous or asynchronous?
 * Will it need/use only a single port? what port should we use?
 * What resources will you need to support? Is there size limits? Should compression be used ?
 * Will the data have to be encrypted?
 * etc...

One must carefully consider your project's specific goals and requirements, this will help you evaluate the use of toolkits and frameworks. Try not to reinvent the wheel if you can't came up with a better solution or have the time,capacity or disposition to. You can also use open standards (ie: use the HTTP protocol) to, but are also free to explore other approaches.

Today even small LANs will have at least a firewall and probably a router even if all components are under the user's control, some user will just lack the knowledge on how to set them properly. Another consideration is on the simple configuration requirements of the application, most users will have problems dealing with technical terms and dependencies, even new versions of the Windows OS will have a default enabled firewall, those may prove to be a unsurmountable barrier for users, thankfully there are some tools available to make life easier for all.

UPnP
The Universal Plug and Play (UPnP) architecture consists in a set of open standards and technologies promulgated by the UPnP Forum ( http://www.upnp.org/ ), with the goal of extending the Plug and Play concept to support networks and peer-to-peer discovery (automatic discovered over the network. wired or wireless), configuration and control and so enable that appliances, PCs, and services be able to connect transparently. Permitting any UPnP device to dynamically join a network, obtain its IP address and synchronize capabilities (learn from and inform other devices). It can also be seen in general terms as similar to a distributed Simple Network Management Protocol (SNMP).

As UPnP is offered in most modern routers, network devices and it is also supported by Microsoft since Windows XP. While supposedly aimed to address the problem for network programs users in accepting incoming connection from the Internet ("port forwarding" or "NAT traversal"), as it would remove the necessary step in configuring router to accept incoming connections and then route them to the LAN's local machine behind the router, something that is hard to explain and for the common user to understand.

All this makes it a necessity that a Windows P2P application should support this architecture programmatically so to avoid the requirements that users deal with the necessary changes when UPnP is enabled (by default is should be disable due to security risks). With the added necessity that if there is a firewall on the local machine that doesn't conform to UPnP, it must have its configuration changed so to enable the necessary TCP (port 2859) and UDP (port 1900) communications for UPnP.

Microsoft provides the UPnP Control Point API. It is available in Windows Me, CE .Net, XP and later in the system services “SSDP Discovery Service” (ssdpsrv) and “Universal Plug and Play Device Host” (upnphost) or by COM libraries. It can be used in C++ or Visual Basic applications or in scripts embedded in HTML pages.

Further information on the UPnP technology on Windows can be gathered on this sources:
 * Programming control point application using the UPnP Control Point API ( http://www.codeproject.com/KB/IP/upnplib.aspx ) by amatecki
 * Using UPnP for Programmatic Port Forwardings and NAT Traversal ( http://www.codeproject.com/KB/IP/PortForward.aspx ) by By Mike O'Neill


 * Freely usable implementation
 * CyberLink ( http://sourceforge.net/projects/clinkcc) for C++ is a development package for UPnP programmers. Using the package, you can create UPnP devices and control points easily. released under the BSD License.
 * MiniUPnP Project ( https://miniupnp.tuxfamily.org ) open source C implementation under a BSD compatible license.
 * GUPnP ( http://www.gupnp.org ), an object-oriented open source framework for creating UPnP devices and control points, written in C using GObject and libsoup. It provides the same set of features as libupnp, but shields the developer from most of UPnP's internals. Released under the GNU LGPL.
 * UPNPLib ( http://www.sbbi.net/site/upnp ) open source Java implementation under a Apache Software License.

Security Considerations
By using a P2P system users will broadcast their existence to others, this in contrast to a centralized service were they may interact with others but their anonymity can be protected.

Violating the security of a network can be a crime, for instance the 2008 case of the research project from University of Colorado and University of Washington, the researchers engaged in the motorization of users across the Tor anonymous proxy network and could have faced legal risks for the snooping.

This vulnerability of distributed communications can result in identity attacks (e.g. tracking down the users of the network and harassing or legally attacking them), DoS, Spamming, eavedropping and other threats or abuses. All this actions are generally targeted to a single user and some may even be automated, there are several actions the creator can take to make it more difficult but ultimately they can't be stopped and should be expected and dealt with, one of the first steps is to provide information to the user so they can locally implement hardware or software actions and even a social behavior to counteract this abuse.

DoS (denial of service), Spamming
Since each user is a "server" they are also prone to denial of service attacks (attacks that may, if optimized, make the network run very slowly or break completely), the result may depend on the attacker resources and how the decentralized is the P2P protocol on the other hand to be the target of spam (e.g. sending unsolicited information across the network- not necessarily as a denial of service attack) does only depend how visible and contactable you are, if for instance other users can send messages to you. Most P2P applications support some kind of chat system and this type of abuse is very hold on such system, they can address the problem but will complete solve it, what can lead to social engineering attacks were users can be lead to perform actions that will compromise them or their system, on this last point only giving information to users that enables them to be aware of the risk will work.

Software traffic control
Since most Network applications and in specific P2P tools are prone to be a source of security problems (they will bypass some of the default security measures from inside), when using or creating such a tool one must take care on granting the possibility or configuring the system to be as safe as possible.

Tools for security
There are several tools and options that can be used for this effect, be it configuring a firewall, adding a IP blocker or making sure some restrictions are turned on by default as you deploy your application.


 * PeerGuardian 2 ( http://phoenixlabs.org/pg2/ ) a OpenSource tool produced by Phoenix Labs’, consisting in a IP blocker for Windows OS that supports multiple lists, list editing, automatic updates, and blocking all of IPv4 (TCP, UDP, ICMP, etc)
 * PeerGuardian Lite ( http://phoenixlabs.org/pglite/ ) a version of the PeerGuardian 2 that is aimed at having a low system footprint.

Blocklists
Blocklists are text files containing the IP addresses of organizations opposed and actively working against file-sharing (such as the RIAA), any enterprise that mines the networks or attempts to use resources without participating in the actual sharing of files. It is basically a spam filter like the ones that exist for eMail systems.

Firewalls
As computers attempt to be more secure for the user, today's OS will provide by default some form of external communication restriction that will permit the user to define different levels of trust, this is called a Firewall. A Firewall may have hardware or software implementation and is configured to permit, deny, or proxy data through a computer network. Most recent OSs will come with a software implementation running, since a connection to the Internet are becoming common and the lack or even the default configuration of the Firewall can cause some difficulties to the use of P2P applications.

on Windows
Microsoft in the last OS releases has taken the option for the security of user but without his intervention to include and enabling by default a simple firewall solution since Windows XP SP2. This blocks any incoming (HTTP over port 80 or mail over ports 110 or 25), the classification of "unsolicited messages" is a bit over empathized since the messages can well be part of a P2P (or any other type of distributed) network. This can also have implications on blocking UPnP capabilities on the local machine.


 * ICF (Internet Connection Firewall) is included with Microsoft's Windows XP, Windows Server 2003, and Windows Vista operating systems.

NAT
NAT (network address translation)

NAT Traversal
Users behind NAT should be able to connect with each other, there are some solutions available that try to enable it.

STUN
STUN (Simple Traversal of UDP over NATs) is a network protocol which helps many types of software and hardware receive UDP data properly through home broadband routers that use NAT ).

Quoted from its standard document, RFC 3489:


 * "Simple Traversal of User Datagram Protocol (UDP) Through Network Address Translators (NATs) (STUN) is a lightweight protocol that allows applications to discover the presence and types of NATs and firewalls between them and the public Internet.


 * It also provides the ability for applications to determine the public IP address allocated to them by the NAT.


 * "STUN works with many existing NATs, and does not require any special behavior from them. As a result, it allows a wide variety of applications to work through existing NAT infrastructure."

As STUN RFC states this protocol is not a cure-all for the problems associated with NAT but it is particularly helpful for getting voice over IP working through home routers. VoIP signaling protocols like SIP use UDP packets for the transfer of sound data over the Internet, but these UDP packets often have trouble getting through NATs in home routers.

STUN is a client-server protocol. A VoIP phone or software package may include a STUN client, which will send a request to a STUN server. The server then reports back to the STUN client what the public IP address of the NAT router is, and what port was opened by the NAT to allow incoming traffic back in to the network.

The response also allows the STUN client to determine what type of NAT is in use, as different types of NATs handle incoming UDP packets differently. It will work with three of four main types: full cone NAT, restricted cone NAT, and port restricted cone NAT. It will not work with symmetric NAT (also known as bi-directional NAT) which is often found in the networks of large companies.



Port Forwarding
If a peer is behind a router or firewall (using NAT) it may be necessary to configure it by hand, as to allow the P2P programs to function properly, this can be a daunting task for those technically challenged and can even have an impact in the adoption of the P2P application. This can be resolved automatically using some solutions like UPnP.

Port forwarding or port mapping, as it is sometimes referred, is the act of forwarding a network port from one network node to another, virtually creating a path across the network, in the case at hand, from the Internet side that is connected to the router or firewall to a computer inside the LAN permuting an external user to reach a port on a private IP address (inside a LAN) from the outside. Otherwise the computer inside the LAN couldn't be accessible and P2P wouldn't be able to fully work (it could contact outside machines but it wouldn't be fully visible from the outside).

Due to the great variety of different implementations of GUIs that ultimately permits the same functionality, making the needed changes isn't a trivial process and one solution will not fit all instances.

The first fact that one should be aware of is that routers or the computer that is connected to the Internet will have a distinct IP addresses, one is provided by your ISP and is visible from the Internet the other will be visible only inside the LAN, this one can be assigned by the user or will default to a factory or Operative System factory setting.

To perform any changes to the configuration of the router you will need to know the IP that you can access its configuration page and login into it, if using a computer this same task can be done locally or remotely if the software allows it. Some minimum knowledge on how IPs and ports, even protocols (some configurations permit to distinguish between TCP and UDP packets) are used is required from the user to perform those tasks, as such this can become complicated to the normal user.

Unique ID
Numbers control your life. Over time anyone identified by multitude of numbers. Like a phone number, your credit card numbers, the driver's license number, the social security number, even zip codes or your car license plate. They way this unique numbers are created, attributed and verified is fascinating. The Unique ID site (http://www.highprogrammer.com/alan/numbers/) by Alan De Smet provides information on the general subject. To our particular subject what is relevant is the algorithms behind it all. How to establish a unique identifiers for users and resources.

Universally Unique Identifiers (UUID) / Globally Unique Identifier (GUID)
For a P2P protocol/application to be able to manage user identification, authentication, build a routing protocol, identify resources etc... there is a need for (a set of) Unique Identifiers. While a true peer to peer protocol doesn't intend to establish a centralized service, it can frequently make use of an already established such service. So for instance, Usenet makes use of domain names to create globally unique identifier for articles. If the protocol refrains from even make use of such a service then uniqueness can only be based on random numbers and mathematical probabilities.

The problem of generating unique IDs can be broken down as uniqueness over space and uniqueness over time which, when combined, aim to produce a globally unique sequence. This leads to a problem detected over some P2P networks using Open Protocols/Multiple vendors implementations, due to the use of different algorithms on the generation of the GUIDs the uniqueness over space is broken leading to sporadic collisions.

UUIDs are officially and specifically defined as part of the ISO-11578 standard other specifications also exist, like RFC 4122, [http://www.itu.int/ITU-T/studygroups/com17/oid.html ITU-T Rec. X.667].

Examples of uses


 * 1) Usenet article IDs.
 * 2) In Microsoft's Component Object Model (COM) morass, an object oriented programming model that incorporates MFC (Microsoft Foundation Classes), OLE (Object Linking Embedding), ActiveX, ActiveMovie and everything else Microsoft is hawking lately, a GUID is a 16 byte or 128 bit number used to uniquely identify objects, data formats, everything.
 * 3) The identifiers in the windows registry.
 * 4) The identifiers used in used in RPC (remote procedure calls).
 * 5) Within ActiveMovie, there are GUID's for video formats, corresponding to the FOURCC's or Four Character Codes used in Video for Windows. These are specified in the file uuids.h in the Active Movie Software Developer Kit (SDK). ActiveMovie needs to pass around GUID's that correspond to the FOURCC for the video in an AVI file.

Security There is a know fragility on UUIDs of version 1 (time and node based), as they broadcast the node's ID.

Software implementation
Programmers needing to implement UUID could take a look on these examples:


 * OSSP uuid ( http://www.ossp.org/pkg/lib/uuid/ ) is an API for ISO C, ISO C++, Perl and PHP and a corresponding CLI for the generation of DCE 1.1, ISO/IEC 11578:1996, and RFC4122 compliant Universally Unique Identifiers (UUIDs). It supports DCE 1.1 variant UUIDs of version 1 (time and node based), version 3 (name based, MD5), version 4 (random number based), and version 5 (name based, SHA-1). UUIDs are 128-bit numbers that are intended to have a high likelihood of uniqueness over space and time and are computationally difficult to guess. They are globally unique identifiers that can be locally generated without contacting a global registration authority. It is Open Sourced under the MIT/X Consortium License.

Hashes, Cryptography and Compression
Most P2P systems have to implement at least one algorithms for Hashing, D/Encryption and De/Compression, this section will try to provide some ideas of this actions in relation to the P2P subject as we will address this issues later in other sections.

Detailed information on the subject can be found on the Cryptography Wikibook ( http://en.wikibooks.org/wiki/Cryptography ).

One way of creating structured P2P networks is by maintaining a Distributed Hash Table (DHT), that will server as a distributed index of the resources on the network.

Another need for cryptography is in the protection of the integrity of the distributed resources themselves, to make them able to survive an attack most implementations of P2P some kind of Hash function (MD5, SHA1) and may even implement a Hash tree designed to detect corruption of the resource content as a hole or of the parts a user gets (for instance using Tiger Tree Hash).

Hash function
A hash function is a reproducible method of turning some kind of data into a (relatively) small number that may serve as a digital "fingerprint" of the data. The algorithm substitutes or transposes the data to create such fingerprints. The fingerprints are called hash sums, hash values, hash codes or simply hashes. When you referring to hash or hashes some attention must be given since it can also mean the hash functions.

A group of hashes (the result of applying an hash function to data) will sometimes be referred as a bucked or more properly a hash bucket. Most hash buckets if generated by a non colliding hash function will generate distinct hashes for a give data input, there are other characteristics that should be considered when selecting an hash function but it goes beyond the scope of this book, just remember to check if the hash function/algorithm you are implementing satisfies your requirements. Wikibooks has several books that covers hashes in some way, you can check more about the subject in Algorithm implementation or Cryptography



Hash sums are commonly used as indices into hash tables or hash files. Cryptographic hash functions are used for various purposes in information security applications.

A good hash function is essential for good hash table performance. A poor choice of a hash function is likely to lead to clustering, in which probability of keys mapping to the same hash bucket (i.e. a collision) is significantly greater than would be expected from a random function. A nonzero collision probability is inevitable in any hash implementation, but usually the number of operations required to resolve a collision scales linearly with the number of keys mapping to the same bucket, so excess collisions will degrade performance significantly. In addition, some hash functions are computationally expensive, so the amount of time (and, in some cases, memory) taken to compute the hash may be burdensome.
 * Choosing a good hash function

One of the first hash algorithms used to verify the integrity of files in p2p systems (and in file transfers in general) was the MD5 created in 1992 (see rfc1321 The MD5 Message-Digest Algorithm). But as all hash most algorithms after some time some weaknesses where found, this sequence was repeated with the SHA1 algorithm and there is a high probability that others will fallow. Selecting the right tool for the job isn't enough, the programmer must continuously examine how the security of his choice is holding on in regards to the requirements placed on the selected hash algorithm.

Simplicity and speed are readily measured objectively (by number of lines of code and CPU benchmarks, for example), but strength is a more slippery concept. Obviously, a cryptographic hash function such as SHA-1 (see Secure Hash Standard FIPS 180-1) would satisfy the relatively lax strength requirements needed for hash tables, but their slowness and complexity makes them unappealing. However, using cryptographic hash functions can protect against collision attacks when the hash table modulus and its factors can be kept secret from the attacker, or alternatively, by applying a secret salt. However, for these specialized cases, a universal hash function can be used instead of one static hash.

In the absence of a standard measure for hash function strength, the current state of the art is to employ a battery of statistical tests to measure whether the hash function can be readily distinguished from a random function. Arguably the most important test is to determine whether the hash function displays the avalanche effect, which essentially states that any single-bit change in the input key should affect on average half the bits in the output. Bret Mulvey advocates testing the strict avalanche condition in particular, which states that, for any single-bit change, each of the output bits should change with probability one-half, independent of the other bits in the key. Purely additive hash functions such as CRC fail this stronger condition miserably.

For additional information on Hashing:
 * Hashing Function Lounge it includes a summary of what algorithms have known security flaws.

Implementing a Hash algorithm
Most hash algorithms have an high degree of complexity and are designed for a specific target (hashing function) that may not apply, with the same level of guarantees, in other tasks. These algorithms (or raw descriptions of tem) are freely accessible, you can implement your own version or select to use an already existing and tested implementation (with a note for security concerns). Some examples of publicly available implementations are available at the Cryptography Wikibook.

There is a need for consistent results and repeatability, as you select and implement your hash algorithm, remember to run it over a batch of test vectors. If you are using a test framework this should be added to your tests.

Hash tree (Merkle trees)
In cryptography, hash trees (also known as Merkle trees, invented in 1979 by Ralph Merkle) are an extension of the simpler concept of hash list, which in turn is an extension of the old concept of hashing. It is a hash construct that exhibits desirable properties for verifying the integrity of files and file subranges in an incremental or out-of-order fashion.

Hash trees where the underlying hash function is Tiger ( http://www.cs.technion.ac.il/~biham/Reports/Tiger/ ) are often called Tiger trees or Tiger tree hashes.

The main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not send adulterated blocks of data. This will optimize the use of the Network and permit to quickly exclude adulterated content in place of waiting for the download of the hole file to complete to check with a single hash, an partial or complete hash tree can be downloaded and the integrity of each branch can be checked immediately (since they consist in "hashed" blocks or leaves of the Hash tree), even though the whole tree/content is not available yet, making also possible for the downloading peer to upload blocks of an unfinished files.

Usually, a cryptographic hash function such as SHA-1, Whirlpool, or Tiger is used for the hashing. If the hash tree only needs to protect against unintentional damage, the much less secure checksums such as CRCs can be used.

In the top of a hash tree there is a top hash (or root hash or master hash). Before downloading a file on a P2P network, in most cases the top hash is acquired from a trusted source (a Peer or a central server that has elevated trust ratio). When the top hash is available, the hash tree can then be received from any source. The received hash tree is then checked against the trusted top hash, and if the hash tree is damaged or corrupted, another hash tree from another source will be tried until the program finds one that matches the top hash.

This requires several considerations:
 * 1) What is a trusted source for the root hash.
 * 2) A consistent implementation of the hashing algorithm (for example the size of the blocks to be transferred must be known and constant on every file transfer).

Tiger Tree Hash (TTH)
The Tiger tree hash is one of most useful form of hash tree on P2P Networks. Based on the cryptographic hash Tiger created in 1995 by Eli Biham and Ross Anderson (see http://www.cs.technion.ac.il/~biham/Reports/Tiger/ for the creator's info and C source example). The hash algorithm was designed with modern CPU in mind, in particular when dealing with working in 64-bit, it is one of the fastest and secure hashes on 32-bit machines.

The TTH uses a form of binary hash tree, usually has a data block size of 1024-bytes and uses the cryptographically secure Tiger hash.

Tiger hash is used because it's fast (and the tree requires the computations of a lot of hashes), with recent implementations and architectures, TTH is as fast as SHA1, with more optimization and the use of 64-bit processors, it will become faster, even though it generates larger hash values (192 bits vs. 160 for SHA1).

Tiger tree hashes are used in the Gnutella, Gnutella2, and Direct Connect and many other P2P file sharing protocols and in file sharing in general.

A step by step introduction to the uses of the TTH was available as part of the Tree Hash Exchange (THEX) format ( see page at the WEB Archive ).

Hash table
In computer science, a hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key, find the corresponding value. It works by transforming the key using a hash function into a hash, a number that is used to index into an array to locate the desired location ("bucket") where the values should be.



Hash tables support the efficient addition of new entries, and the time spent searching for the required data is independent of the number of items stored (i.e. O(1).)

In P2P system Hash tables are used locally on every client/server application to perform the routing of data or the local indexing of files, this concept is taken further as we try to use the same system in a distributed way, in that case distributed hash tables are used to solve the problem.

Distributed Hash Table (DHT)
The Distributed hash tables (DHTs) concept was made public in 2001 but very few did publicly-release robust implementations.

Protocols

 * Content addressable network (CAN)
 * Chord ( http://pdos.csail.mit.edu/chord/ ) - aims to build scalable, robust distributed systems using peer-to-peer ideas. It is completely decentralized and symmetric, and can find data using only log(N) messages, where N is the number of nodes in the system. Chord's lookup mechanism is provably robust in the face of frequent node failures and re-joins. A single research implementation is available in C but there are other implementations in C++, Java and Python.
 * Tulip
 * Tapestry
 * Pastry
 * Bamboo (http://bamboo-dht.org/) - based on Pastry, a re-engineering of the Pastry protocols written in Java and licensed under the BSD license.

Encryption
A part of the security of any P2P Network, encryption is needed to make make sure only the "allowed" parties have access to sensitive data. Examples are the encryption of the data on a server/client setup (even on P2P) were clients could share data without fear of it being accessed on the server (a mix of this is if the Network would in itself enable a distributed cache mechanism for transfers, server-less), encryption of transfers, to prevent man-in-the-middle attacks, or monitor of data (see FreeNet) and many other applications with the intent of protecting the privacy and enable an extended level of security to Networks.

There are several algorithm that can be used to implement encryption most used by P2P project include: BlowFish.

Resources (Content, other)
P2P applications can be used to share any type of digital assets in the form of packed information that can be uniquely identified. P2P applications are well know for sharing files, this raises several issues and possibilities.

Metadata
Metadata (meta-data, or sometimes meta-information) is "data about data". An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema.

Data is the lowest level of abstraction for knowledge, information is the next, and finally, we have knowledge highest level among all three. Metadata consists on direct and indirect data that help define the knowledge about the target item.

The hierarchy of metadata descriptions can go on forever, but usually context or semantic understanding makes extensively detailed explanations unnecessary.

The role played by any particular datum depends on the context. For example, when considering the geography of London, "E83BJ" would be a datum and "Post Code" would be metadatum. But, when considering the data management of an automated system that manages geographical data, "Post Code" might be a datum and then "data item name" and "5 characters, starting with A – Z" would be metadata.

In any particular context, metadata characterizes the data it describes, not the entity described by that data. So, in relation to "E83BJ", the datum "is in London" is a further description of the place in the real world which has the post code "E83BJ", not of the code itself. Therefore, although it is providing information connected to "E83BJ" (telling us that this is the post code of a place in London), this would not normally be considered metadata, as it is describing "E83BJ" qua place in the real world and not qua data.


 * Difference between data and metadata

Usually it is not possible to distinguish between (plain) data and metadata because:


 * Something can be data and metadata at the same time. The headline of an article is both its title (metadata) and part of its text (data).
 * Data and metadata can change their roles. A poem, as such, would be regarded as data, but if there were a song that used it as lyrics, the whole poem could be attached to an audio file of the song as metadata. Thus, the labeling depends on the point of view.

These considerations apply no matter which of the above definitions is considered, except where explicit markup is used to denote what is data and what is metadata.


 * Hierarchies of metadata

When structured into a hierarchical arrangement, metadata is more properly called an ontology or schema. Both terms describe "what exists" for some purpose or to enable some action. For instance, the arrangement of subject headings in a library catalog serves not only as a guide to finding books on a particular subject in the stacks, but also as a guide to what subjects "exist" in the library's own ontology and how more specialized topics are related to or derived from the more general subject headings.

Metadata is frequently stored in a central location and used to help organizations standardize their data. This information is typically stored in a Metadata registry.


 * Schema examples

A good example on work being done on Metadata and how to make it accessible and useful can be examined in the W3C's old page Metadata Activity Statement (http://www.w3.org/Metadata/Activity.html) and the Resource Description Framework (RDF), that by the use of a declarative language, provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. The RDF has it's own and more up to data page at http://www.w3.org/RDF/.

Metadata registries

 * Free Services
 * MusicBrainz ( http://musicbrainz.org/ ) is a community music metadatabase (nonprofit service) that attempts to create a comprehensive music information site. You can use the MusicBrainz data either by browsing this web site, or you can access the data from a client program — for example, a CD player program can use MusicBrainz to identify CDs and provide information about the CD, about the artist or about related information. MusicBrainz is also supporting MusicIP's Open FingerprintTM Architecture, which identifies the sounds in an audio file, regardless of variations in the digital-file details. The community provides a REST styled XML based Web Service.

Files
Users of a P2P application may select files as the resource to be shared across the network. The simplest and secure method is to just share the content of files in a immutable way, most P2P applications cover this need, in any case other simultaneous models can be build around the file resource and the number of possible state of those files, being the most complex the implementation of a distributed filesystem.

Sharing Files
Sharing Files on a P2P network consists in managing (indexing, enable searches and transfer) two distinct resources, the local files and the remote files.

Local Files
If the a P2P application has support for file sharing. The most common method for sharing of files is to enable the user to select one or more directories of the local file system volumes. This includes providing, in the application preferences, the capability to select the path(s) to the files to be shared and to where to put the downloaded files.

It is also common practice to automatically add the download directory to the list of shared paths, or the downloaded files, so that the peer will immediately contribute to the replication of the downloaded resources.

Monitor for Changes
After having the resources determined by the user, it is up to the application to as it is running verify any deletion, renaming or write action in general to any of the shared files since this could constitute a change to the file status (content change) or any addition of files, or alterations to the shared resources in general. This would could invalidate previously generated hashes or indexes. This problem can be addressed by generating an hash for the file content and use the OS control over the file-system to monitor changes as they happen (for instance on Windows one could use the Win32 API ReadDirectoryChangesW).

Depending on the requirements of the implementation, it may be beneficial to use multiple content hashes, a stronger to be used across the network and s faster and lower quality just to monitor local changes.

Since the application is not expected to be running all the time, and able to consistently monitor any changes done to the resources one must take steps to maintain further integrity. Because if the application is closed it will have no way to continue to monitor for changes and it will be up to the OS to permit some kind of API to detect file changes, like the last write access timestamp. So the steps needed to guarantee integrity each time the application is run is:
 * Detected shared files and changes to them


 * 1) Verify that the resources (know files and shared paths) remain valid (they still exist and are unaltered).
 * 2) Update the local index to reflect what was found in the above step (remove any invalid directories or volumes from the shared list). One can even prompt the user for corrective actions, as some my be removable media (DVDs etc...)
 * 3) Start monitoring new changes.
 * 4) Check all the resources content for changes since the last time the application exited, this may mean that each file must be checked individually for the last write date (most modern OS permit this). Depending on how it is implemented the next to steps can be included so to reduce duplication of work.
 * 5) Verify that all previous indexed files are still present.
 * 6) Verify for the existence of new files.

There are at least two paths with an higher degree of usefulness for the P2P application. The path were the application resides, since it can be needed for updating or for any required security checks and even possibly to serve as the base to locate the application preferences (if not using another setup, like the system registry on the Windows OS) and/or the download directory, a write enabled directory, where all downloads will be put.
 * Local paths of significance

External Files
This are the files that aren't directly under the local user control.

File Filtering
Most P2P application do not enable selecting single files for sharing but do it in bulk by enabling the selection of specific directories, there is a need to provide a mechanism to exclude specific files from being shared, by the will of the user, the specific of the network or even to reduce the pollution or the filesharing system burden in handling unwanted files. Be it by size, extension or even the intrinsic content of the file having the ability to weed them out is useful.

Indexing
Indexing, the task of indexing resources, like shared file, has the objective of enabling the application to know what resources are available to be shared on the network.

Indexing occurs each time the application is run, since the resources can only be monitored for changes while the application runs and when a change to the shared resources preferences is performed. This also makes it a necessary to have during the application run-time a function that monitors changes to resources.

Most P2P application also support the exclusion or filtering of shared resources, for instance if a directory is selected to be shared but some of the content needs to be excluded.

Leechers
In Nature, cooperation is widespread but so too are leechers (cheats, mutants). In evolutionary terms, cheats should indeed prosper, since they don't contribute to the collective good but simply reap the benefits of others’ cooperative efforts, but they don't. Both compete for the same goal using different strategies, cooperation is the path of less cost to all (even for leechers on the long run), cooperation provides stability and previsibility and on the other hand if cheats are not kept in some sort of equilibrium they generate a degradation of the system that can lead to its global failure.

In computer science and especially on the Internet, being a leech or leecher refers to the practice of benefiting, usually deliberately, from others' information or effort but not offering anything in return, or only token offerings in an attempt to avoid being called a leech. They are universally derided.

The name derives from the leech, an animal which sucks blood and then tries to leave unnoticed. Other terms are used, such as freeloader, but leech is the most common.

Examples


 * On peer to peer networks, a leecher shares nothing (or very little of little worth) for upload. Many applications have options for dealing with leeches, such as uploading at reduced rates to those who share nothing, or simply not allowing uploads to them at all. Some file-sharing forums have an anti-leech policy to protect the download content, where it will require users to expend more energy or patience than most leechers are willing to before they can access the "download area".


 * Most BitTorrent sites refer to leeches as clients who are downloading a file, but can't seed it because they don't have a complete copy of it. They are by default configured to allow a certain client to download more when they upload more.


 * When on a shared network (Such as a school or office LAN), any deliberate overuse of bandwidth (To the point at which normal use of the network would be noticeably degraded) can be called leeching.


 * In online computer games (especially role-playing games), leeching refers to the practice of a player joining a group for the explicit purpose of gaining rewards without contributing anything to the efforts necessary to acquire those rewards. Sometimes this is allowed in an effort to power-level a player. Usually it is considered poor behavior to do this without permission from the group. In first person shooters the term refers to a person who benefits by having his team mates carry him to a win.


 * Direct linking is a form of bandwidth leeching that occurs when placing an unauthorized linked object, often an image, from one site in a web page belonging to a second site (the leech). This constitutes an unauthorized use of the host site's bandwidth and content.

In some cases, leeching is used synonymously with freeloading rather than being restricted to computer contexts.

Trust & Reputation
Due the volatility and modularity of peer to peer networks, peer trust (mentioned in the reference to building communities) and user participation is one of the few motivations or lubricant for online transactions of services or resources that falls ultimately to the creator. Having an average good reputation in participants will also boost the trust on the network and the systems built upon it.

File sharing
Traditionally, file transfers involve two computers, often designated as a client and a server and most operations are for the copying files from one machine to another.

Most WEB and FTP servers are punished for being popular. Since all uploading is done from one central place, a popular site needs more resources (CPU and bandwidth) to be able to cope. With the use of P2P, the clients automatically mirror the files they download, easing the publisher's burden.

One limitation of most P2P protocols is that they don't provide a complex file-system emulation or a user-right system (permissions), so complex file operations like NFS or FTP protocols provide are very rare, this also has a reason to be, since the networks is decentralized a system for the authentication of users is hard to implement and most are easy to break.

Another concept about P2P transfers is the use of bandwidth, things will not be linear, transfers will depend on the availability of the resource, the load of the seeding peers, size of the network and the local user connection and load on its bandwidth.

As seen before, Downloading files with a restrictive copyright, license or under a given country law, may increase the risk of being sued. Some of the files available on these networks may be copyrighted or protected under the law. You must be aware that there is a risk involved.

from multiple sources (segmented downloading, swarm)
Multiple source download, (segmented download, swarming download), can be a more efficient way of downloading files from many peers at once. The one single file is downloaded, in parallel, from several distinct sources or uploaders of the file. This can help a group of users with asymmetric connections, such as ADSL to provide a high total bandwidth to one downloader, and to handle peaks in download demand. All swarm transfers depends on at least the existence of one full complete copy of the wanted file (a possible evolution could include adding some sort of Reed–Solomon (RS) Algorithm into the mix) and it is mostly useful for large files (size depending on the available bandwidth, since the trade off includes a higher cost in CPU and extraneous data transfers to make the system work.

This technique can not magically solve the problem, in a group of users that has insufficient upload-bandwidth, with demand higher than supply. It can however very nicely handle peaks, and it can also to some degree let uploaders upload "more often" to better utilize their connection. However, naive implementations can often result in file corruption, as there is no way of knowing if all sources are actually uploading segments of the same file. This has led to most programs using segmented downloading using some sort of checksum or hash algorithm to ensure file integrity.

Preview while Downloading
Is may help to let users preview files before the downloading process finishes and as soon as possible, this will improve the quality of shares on the network increasing users confidence and reducing lost time and bandwidth.

Poisioning and Pollution
Examples include:
 * poisoning attacks (e.g. providing files whose contents are different from the description)
 * polluting attacks (e.g. inserting "bad" chunks/packets into an otherwise valid file on the network)
 * insertion of:
 * viruses to carried data (e.g. downloaded or carried files may be infected with viruses or other malware)
 * malware in the peer-to-peer network software itself (e.g. distributed software may contain spyware)

Rights Protection
Protecting the copyrights over content shouldn't be imposed on the public, even if most copyrights policies today have removed the need to state that a work has reserved rights, there shouldn't be any expectation that the general public would have to go out of way to protect the benefits of a minority even more if they aren't clearly stated. It is expected that the works in the public domain will today outnumber works that have valid copyrights.

Given the actual situation, any p2p application will have the moral duty to inform users of the issues risen due to this very confusing situation. Some solutions to ease the problem and help copyright holders to protect their works have over time came forward but no interest seems to have been raised by those that should care about the issue. This section will try to show the possibilities available.

A simple solution to this issue would be to clearly put the responsibility of identifying work under copyright to the right owners, to implement this solution a database needs to be freely available to p2p applications so content can be declared as copyrighted or a standard must be created to clearly identify those works. In response to this problem several solution were discussed and proposed but in July 2003, a solution by the name of BluFilter was put forward by the Kokopelli Network Inc. (co-founded by Alex Sauriol), that would permit to identify the copyright status, by comparing the waveforms that are stored within the mp3 (see kokopellinetworks.com @ web.archive.org), this approach seemed viable but no adoption of a similar technology seems to be openly used to inform users but rumors have appeared that similar approaches have been used as a base legal litigations.

Priority settings
Enabling the dynamic set of priorities on transfers will not only keep users happy but provide an easy way to boost transfers on highly sicked content increasing the speed of replication of the same on the network.

One can even go a step further and permit a by resource configuration, enabling the removal or configuration access rights to each resource, like on a file system or even enable a way to permit a market for free trading of data letting users set a specific ratio for that resource.

Bandwidth Scheduler
Managing local resources is important not only to the local user but to the global network. Managing and enabling control of the application use of bandwidth will serve as incentive for users to improve how they manage that resource (what and how it is being used) and if taken in consideration by the application as a dynamic resource it can have positive effects on the global network by reducing wasting.

Many of the actual P2P applications enable users such control of their bandwidth, but not only P2P benefits from this strategy, today with a significant part of most computers connected to the Internet, managing this scarce resource is of top most importance. One example is Microsoft's Background Intelligent Transfer Service (BITS) aimed at enabling system updates or even the MS IM service to transfer data whenever there is bandwidth which is not being used by other applications, of note is also the ability to use the BITS technology since it is exposed through Component Object Model (COM).