Legal framework of textual data processing for Machine Translation and Language Technology research and development activities/MT-MP:Data Protection

Introduction
In this section we present initially some key concepts of data protection, then some data protection instruments that resemble creative commons, and finally, the types of uses of data that do not require consent from the data subject. It is important to note that in the case of data protection, the consent is required not from the data processor or controller but rather from the data subject. This practically means that it needs to be sought from a different person and possibly at a different time. Various schemes that are based in the concept of consent commons try to accommodate this problem by searching for such permissions when the data controller performs the original data collection.

Personal Data
Personal data is any information relating to an identified or identifiable natural person (data subject): an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one of more factors specific to his physical, physiological, mental, economic, cultural or social identity (Data Protection Directive Art. 2(a))

Data Controller
The person who, either alone or jointly or in common with other persons, determines the purposes for which and the manner in which any personal data are or are to be processed.

Data Processor
Any person, other than an employee of the Data Controller who processes the data on behalf of the data controller.

Processing of Personal Data
Any operation or set of operations which is performed upon personal data, whether or not by automatic means, such as collection, recording, organization, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, blocking, erasure or destruction (Art. 1(b) of the Data Protection Directive)

Data Subject’s Consent
Any freely given specific and informed indication of his wishes by which the data subject signifies his agreement to personal data relating to him being processed.

Key Data Protection Principles
In accordance to Art.6(1) of the Data Protection Directive, personal data should comply with the following principles:
 * 1) 	processed fairly and lawfully;
 * 2) 	collected for specified, explicit and legitimate purposes and not further processed in a way incompatible with those purposes. Further processing of data for historical, statistical or scientific purposes shall not be considered as incompatible provided that Member States provide appropriate safeguards;
 * 3) 	adequate, relevant and not excessive in relation to the purposes for which they are collected and/or further processed;
 * 4) 	accurate and, where necessary, kept up to date; every reasonable step must be taken to ensure that data which are inaccurate or incomplete, having regard to the purposes for which they were collected or for which they are further processed, are erased or rectified;
 * 5) 	kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the data were collected or for which they are further processed. Member States shall lay down appropriate safeguards for personal data stored for longer periods for historical, statistical or scientific use.

Sensitive Data
These are normally defined in the relevant national legislations. The UK list of Sensitive Data (Section 2 of the Data Protection Act) is indicative of what they include: Sensitive personal data means personal data consisting of information as to:
 * 1) 	the racial or ethnic origin of the data subject,
 * 2) 	his political opinions,
 * 3) 	his religious beliefs or other beliefs of a similar nature,
 * 4) 	whether he is a member of a trade union (within the meaning of the Trade Union and Labour Relations (Consolidation) Act 1992),
 * 5) 	his physical or mental health or condition,
 * 6) 	his sexual life,
 * 7) 	the commission or alleged commission by him of any offence, or
 * 8) 	any proceedings for any offence committed or alleged to have been committed by him, the disposal of such proceedings or the sentence of any court in such proceedings.

Historical Evolution of Personal Data Protection and its implications for MT & MP
The terms data protection and privacy are often used interchangeably, particularly in the management literature [see for example (Kuner 2003)]. However, in the legal literature and in the European context in particular, the two terms are considered very closely related but not identical. Privacy could be described as “a condition or state in which a person (or collective entity) is more or less inaccessible to others, either on the spatial, psychological or informational plane” (Bygrave 2002, p.23), whereas data protection is defined as “a set of measures (legal or non-legal) aimed at safeguarding persons from detriment resulting from the processing (computerized or manual) of information on them” (Bygrave 2002). Data protection may be regarded a narrower concept than privacy in the sense that the former is closer to a right of “informational self-determination”, whereas the latter relates to the protection of an individual’s “personal space” (Kuner 2007). In this document, we focus specifically on the data protection regulation that involves computerized information processing.

Three aspects of Data Protection (DP) regulation are important in relation to MT & MP processing, as being very close to the evolution of technology and the respecting regulatory responses:
 * First, the processing of personal data is one of the activities most heavily influenced by changes in information and communication technologies. Different stages in the evolution of ICTs have provoked different modes of processing and communication of personal data. From the mainframes of the 1970s that gave raise to the first DP regulations, to personal computing, the Internet and then web 2.0 technologies, data protection regulation has known consecutive related changes. As we will see in sections 3 and 5, the technological changes have been in the direction of greater decentralisation of processing, storage and communication of information and accordingly the regulatory instruments used have been of similar nature.
 * Second, the data protection regulatory framework is a relatively new one, the first regulation of that type appearing in Germany in the 1970s. It belongs to a broader category of regulatory instruments that have as their objective to protect the weaker part in a transaction. The individual is seen as being threatened by the processing of its formation by the state or by private entities and hence needs to be protected through specific regulatory means. Though the objective of regulation is to protect the individual, the recipient of regulatory power are the entities that process data. However, as the processing of personal data becomes more pervasive, it is a question both of regulatory resources and strategy how the relevant regulatory framework is to be structured. The challenge is how DP regulation is to be transformed in such a way so that it manages to protect the individual while not restricting the free flow of information necessary for the operation of social and economic activity in contemporary societies.
 * The third aspect of the DP regulatory framework is closely related to the second: the transactions related to personal data processing are highly standardised and normally regulated through formalised agreements. As in the case of Intellectual Property Rights regulations, the use of End User Licence Agreements is deployed in order to structure legally the transfer and processing of personal data from the data subject to the data processor. These EULAs have two distinctive characteristics: First, they are standardized; second, they are unilaterally defined by the strongest transacting party; and finally, while they presuppose the consent of the individual whose data are collected and processed, this consent is problematic in great many ways. The individual does not have the opportunity to negotiate the terms of the contract or to withdraw the consent if the circumstances change.

The first data protection law in Europe appears in the 1970s in the German federal state (Land) of Hessen as a result of the increasing threats to personal data from computerised data processing (Kuner 2003, p13). About the same time the Younger Committee on Privacy proposed ten guiding principles for the use of computers that manipulated personal data that eventually lead to the publishing of a White paper, the setting up of the Lindop Committtee that provided information about the setting up of a Data Protection Authority and the compilation of Codes of Practice for different sectors of the business community (Carey et al. 2004. pp.1-3). Similar considerations have given rise to the introduction of the Council of Europe Convention of 1981 that provided in turn the impetus for the Data Protection Act 1984 constituting the first UK data protection Bill. Further to this, the Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data led to the passing of the Data Protection Act of 1998, which came into force on 1 March 2000. Finally, the Directive on the Processing of Personal Data and the Protection of Privacy in the Electronic Communications Sector (2002/58/EC) introduced in 2002 applies “to the processing of personal data in connection with the provision of publicly available electronic communication services in public communications services in the Community” that led to the introduction of the 2003 Privacy and Electronic Communications Regulations published by the UK Department of Trade and Industry. In all the aforementioned instruments the relationship between technology and the normative content of the various regulatory instruments is apparent.

From the 1970s mainframes that increased the risks of data processing by the government to the 2000s Internet-based technologies proliferating the possibilities of data processing and transfer, the data protection regulation is closely linked to the evolution and development of information and communication technologies and could be described as a form of info-regulation. It is important to note that personal data regulation, very similarly to Intellectual Property Rights regulation is a pure form of info-regulation as it is not merely technology but information and its processing that it evolves around and is seems very likely to follow the trends of development for information processing and communication.

The nature and characteristic features of information processing at different stages of their development directly influences both the content and the structure of data protection regulation. Some examples from the early data protection regulation in the UK and EU are indicative of this trend.

First, in terms of content, the White Paper introduced by the British government after the Younger Report explicitly stated that: “the time has come when those who use computers to handle personal information, however responsible they are can no longer remain the sole judges of whether their own systems adequately safeguard privacy” (paragraph 30). In this passage it appears that the self-regulatory capacities of the relevant information processors are not deemed adequate for the required level of data processing. The White Paper justified such position on the basis of specific features of computer technology relating to improvements in (a) data maintenance and retention (b) ease of access to personal data (c) ease of data transfer (d) combination of data that would not be otherwise possible (e) storage, processing and transmission of data in ways not possible in the past. The technologies in place where these reports and white papers were issued describe a form of data processing that is to be primarily conducted by formal organisations of significant size, so that they are able to cover the costs of information computerisation. Such organisations would be the government itself and any commercial organisation of substantial size.

The fact that the government was deemed as one of the primary sources of risk for personal data protection is not only a result of its historical role but also of the nature of the information systems introduced to governmental agencies at the time. This is expressive in the choice of introducing independent DP authorities entrusted with the responsibility of personal data regulation rather than making that a function of the state. Also the fact that the use of Codes of Practice as one of the first measures to be adopted for the regulation of personal data is indicative of the ability, at least in principle, of the commercial sector to deal with this issue. At this stage the lay person is only the passive subject of data processing, something being in total accordance with the information technologies available at the time.

Finally, the interest in introducing DP regulation at a Europe-wide level is also indicative of the importance of the free flow of data within the framework of the then European Community and the appreciation of such flow as one of the elements contributing to the creation of the single European market. The market as an implied regulatory modality appears thus as one of the silent regulatory forces that has even partially formed the way in which the EU data protection regulation was to be further developed: to create an essentially single regulatory framework for all European States that belong to the Community then and the Union later so that the seamless participation to economic activities was possible.

As technologies of processing and reproduction of digital content became cheaper, the regulation changed in order to accommodate the proliferation of entities holding data (data users according to the UK Data Protection Act of 1984). These entities required to register with the Data Protection Authority. As we move to the UK Data Protection Act of 1998, we see that there are two broad trends: the individual rights awarded to the person whose data are processed (data subject) are increased, the scope of what constitutes data processing expands (including manual processing as well) and there is more emphasis on data transfers and the regulation of data exports.

The maturing of the European common market and the phenomenon of information processing outsourcing that started growing in the 1990s has had a direct impact on the regulation of personal data exports and processing. The introduction of personal computing in the 1980s and the Internet in the 1990s has fundamentally changed the organisational structure of the multinational corporation and made it possible to process data in places different from those of offering the services. Different forms of “sourcing”, from outsourcing to off-shoring have entailed the introduction of geographically dispersed corporate structures and accordingly influenced the way and locus of employees’ personal data processing. This becomes extremely important in the case of LR-based MT & MP where the re-use of data is necessary and hence all types of personal data processing becomes instrumental for the operation of such projects. These developments have impacted the personal data regulation accordingly not only in terms of their actual normative content, which remained to a great extend the same following the original data protection principles, as it has influenced the structure and organisation of regulatory instruments. Decentralised information processing has led to decentralised and volatile organisational forms that required more flexible regulatory means. These trends are expressed in the following forms of data protection regulation, that is, the extensive use of self-regulation, safe-harbor provisions, technical standards and contracts.

The US safe-harbor system allows personal data to be transferred under a presumption of adequacy to US-based companies that agree to be bound by the system. Interestingly, the safe-harbor basis may be found in a variety of documents ranging from the European Commission decision adopting the safe harbor system to a set of Frequently Asked Questions and the safe harbor principles. The US safe harbor system is characterized by a series of features that make it a very interesting regulatory species as it provides a standard to be followed by the companies deciding to follow the system rather than a comprehensive set of a priori rules. The companies interested in following the system must ascertain whether they are eligible for participation to the safe-harbor scheme. Secondly, the companies must determine which dispute resolution and enforcement system they want to be subject to in relation to their safe harbor system participation. As Kuner notes “This can broadly be either (1) a private, self-regulatory mechanism, such as membership in a self-regulatory group, or development of the company’s own privacy policies that comply with the safe harbor principles, or (2) a ‘legal’ or government mechanism, such when the company is subject to ‘statutory, regulatory, administrative or other body of law (or of rules) that effectively protects personal privacy’, or if it commits to work with the EU DPAs. The company must also send a written confirmation to the US department of Commerce (which can be done online) signed by a corporate officer stating its commitment to the safe harbor principles. Finally, the company must disclose its commitment to the safe harbor principles, such as by stating in its publicly available privacy policies that it participates in the safe harbor arrangement” (Kuner 2007, pp. 139-140).

Another example of the changing nature of regulation is art. 26(2) of the Data Protection Directive 95/46/EC, which stipulates that transfer to countries outside the EU may be authorised when there are safeguards that may “result from appropriate contractual clauses”. The interesting aspect is not merely that the transfer of data is possible under contractual arrangements, but also that these may appear in two forms: (a) Model contracts, which constitute standardised sets of clauses approved by the European Commission and (b) Ad Hoc contracts, that have to be approved by or notified to the PDA of the Member State from which data are to be transferred. The choice of contractual arrangements that are to be approved either by the European Commission or individual PDAs is an indication of a model of regulation that adopts both the centralised model of the independent authority and the more flexible model of contractual arrangements.

The advent of the Internet has introduced even greater possibilities of collecting data and hence the possibilities of personal data violations. In that sense there is greater emphasis in the security element. Need to control the flows of data and of course more players that need now to be aware of their capacity of data processing and hence need to comply to personal data regulations. At the same time there is more ability to the individual to monitor and in principle control the use of her data. In that sense there is need to push more control over the individual. The individual seems to need to have more control over the way in which her data are used. This is expressed particularly in the various forms of consent required for the processing of data, especially as required in the Directive 2002/58/EC on Privacy and Electronic Communications.

Web 2.0 and 3.0 mark the introduction of a huge impetus for sharing, re-using and processing each other’s personal data. What does this mean in practice for data protection is that the concept of personal data protection and consent need to be radically reconceptualised. Each person needs to be made aware of the value of her personal data, where they are stored, how they are processed and how the regulation should deal with this problem. The monitoring costs seem to be tremendous and the concept of consent seems to devolve in a world where anyone seems to be both processing and giving away personal data. In this context we move from a need for self-regulation in the industry to need for mass-micro auto-regulation. In such context the need for the individual to appreciate the importance of her personal data and be able to regulate it accordingly is crucial. For that purpose it is not any more enough just to provide such means to the individual but also to make sure that the individual is equipped with the knowledge to actually manage her personal data accordingly.

It is important to note that the different phases of development are not linear and mutually exclusive. Different sources of threat and accordingly different regulatory measures need to co-exist and co-develop and this is reflected in the co-existence of different regulatory means.

Standard Licensing Models for Personal Data
In this subsection we present a number of projects that get inspired by the Creative Commons idea in order to support the re-use of personal data without requiring any additional permissions.

Consent Commons
The Consent Commons project takes the idea of Creative Commons that there is a permission by the licensor on the work that allows it to be used by everyone under specific terms and applies it to the case of data protection. It thus assumes that at the time of obtaining copyright permissions, the time of the Intellectual Property Rights (IPR) clearance, there will also have to be clearance of the privacy rights. The interesting part here is that these permissions will have to be built upon the CC licences. This means that the data subject will have to give permission for her data to be re-used as the chosen CC licences prescribe. The consent may cover only personal data but also tissue and in that sense it will have to be compliant with data protection law, the human tissue legislation and any other confidentiality agreements that may be in place.

The Consent Commons idea originates in the area of Open Educational Resources (OER) and was funded by the UK Joint Information Systems Committee. It has as its main objective the re-use of medical educational material and specifically clinical images and could potentially also include research.

There is a number of issues associated with the Consent Commons project:
 * the level of awareness of whether and how consent should be acquired is considerably low. Very frequently the data protection law does not really require consent and if the medical practitioners could avoid the effort to obtain consent, they probably would not use the Consent Commons approach.
 * The transition from providers of clinical services to the Higher Education (HE) sector is not always without friction. A patient may be happy for her data to be used in the context of a treatment or even research, but it is not clear whether this would also be desirable in the case of educational purposes, especially if the material is to be openly re-used without an obligation to notify the data subject for the use of her medical image. These are not just personal but sensitive data and the consent is really essential.
 * There may be ownership of the IPR without consent clearance or the other way around. This means that if the content is free to move there need to be procedures that ensure that either the permissions are in place or that there is a clear way to get back to the IPR owner and the data subject to obtain the necessary permissions.
 * There are tracking issues with regard to how the content is to be used. Even when the permission is given by the data-subject it will be necessary to make sure that an as broad consent as one that allows re-use is legal and that there is some tracking of how and where the content is being used. This may lead to a need to retract permissions, which is not necessarily in tune with the current CC licences or the open content ethos and practice.
 * There is no clear guidance as to how consent is to be obtained or clear policies as to how to do this without violating both the legal framework (data protection and human tissue act) as well as the codes of conduct and professional rules and codes of ethics in different contexts.
 * There are no clear policies by the providers of clinical services or the people in medical education as to how consent is to be obtained and the licensed and consented content is to be further re-used. This is reflected in the lack of process and legal instruments that could further support their use.

Consent Commons come to give a solution in the following respects:
 * they complement the CC licences covering the data protection side of things
 * they provide a set of principles that could then be used to build processes and tools upon them
 * they also provide a three layer structure as the CC licences
 * they could potentially provide revocation of consent though it is questionable whether this would match the CC structure and philosophy
 * it has different levels of release of personal data corresponding to different types of consent:
 * fully open access
 * sharing between trusted partners
 * open but the specific users have to be approved
 * restricted access
 * some of these modes of Consent Commons are CC compliant and some others are not.

Privacy Commons
Privacy Commons is an umbrella project aiming at hosting different projects with the objective of tackling the following problems:
 * the existing privacy policies by different data controllers are not clear and understood by the end-users
 * the privacy policies are not machine readable so that they could be easily and quickly read, compared, tracked and understood
 * overall, the privacy policies are data processor rather than data-subject centric

The solution proposed by the Privacy Commons project is to standardise and modularise privacy policies, so that they may then be easily expressed in icons and become machine readable. The Privacy Commons project is implemented by different companies that wish to adhere to such principles. An example of the icons used would include the following basic elements:
 * notifying the data-subject that her personal data are collected
 * if there is a specific type of information that is collected, this also has to be made clear (e.g. banking information)
 * if aggregate statistics are collected, this is also made clear to the data-subject
 * another very important element is whether the company will or will not disclose the data to third parties; in the current implementation of the Privacy Commons project it is not clear how the data are further disseminated or to whom
 * a separate icon signifies whether the data are to be sold or not; again, the entities to which they are to be sold are not made explicit
 * finally, another key element of the project is to indicate whether there is user, corporate or shared ownership over the data. This may relate to the way in which IPR are handled but also how the personal data are to be shared. The interesting part here is that we see a transition from personal data to property rights. Even though it is not clear whether the individual has ownership over her personal data or not -and in most legal system it does not have such ownership- it is being informed as to the property status of such data. This will allow the individual to make an informed decision as to whether she would like her data to be processed or not.

Privacy Icons
Privacy Icons is a project specifically aimed at addressing the problem of privacy policies lacking clarity and not focusing on what the end users wants. The objective of the project is to produce icons that allow the data subject to understand how her personal data are to be processed and further disseminated. The privacy commons project is primarily addressed to organisations that wish to improve the communication of their data processing principles. These organisations are to adopt range of icons that represent the basic features of their privacy policies. Such features may include:
 * whether the data are to be used only for a specific purpose or for all possible purposes, including others than those originally intended
 * whether the data will be bartered or sold
 * whether the data will be passed to advertisers or not
 * how long the data are to be kept (ranging from one month to indefinitely or only for the duration that is necessary for the intended processing to take place)
 * whether the data will be given to law enforcement only when a legal process is followed or when they are given to law enforcement irrespectively of any process followed.

Two basic issues related with the Privacy Icons project are the following:
 * how will the project fit all possible privacy policies: the answer to this is that by following a “lego” like approach, the data controller is able to express its policy in an accurate and at the same time understandable fashion
 * the problem of the “evil” icons, that is, why would a data controller adopt icons that would show that it processes data in a way that may be contrary to the interests of the data subject. Precisely because no data controller would ever accept that, it is assumed that a privacy icons model could identify and rate privacy policies. If a policy does not accept to be rated, then it would be assigned the worse possible values by default. Of course, this is an approach that could be both misleading and not feasible and hence it is least likely to be adopted.

Identity Commons
The Identity Commons is an umbrella project including a number of smaller projects all aiming at creating a user-centric, identity infrastructure and to address the resulting social trust issues. Though it does not directly deal with privacy issues, it is a project that has a great impact on the protection of personal data and includes projects having a specific focus on privacy. The Identity Commons projects include:
 * Data portability
 * Higgins
 * ID-Legal
 * Identity Gang
 * Information Card Foundation
 * Internet Identity Workshop
 * Kids Online
 * OpenID Foundation
 * OSIS
 * Pamela
 * Project VRM

Conclusions
All these projects follow the same basic principles, which are:
 * Self-organization
 * Transparency
 * Inclusion
 * Empowerment
 * Collaboration
 * Openness
 * Dogfooding

Summing up the four aforementioned projects they have the following characteristics:
 * they are user centric
 * they contain mechanisms both for protecting personal data and identity
 * they also have an organisational perspective focusing also on the implementation of the project
 * they all use techno-legal mechanisms such as meta-data, tagging and APIs
 * they all assume that the content will flow over the public internet

Legal Basis for the Use of Personal Data
Personal Data could be processed on the following premises:
 * 1) 	that free, prior and informed consent has been obtained for a specific re-use purpose and for a specific duration
 * 2) 	that there is another legal basis for the processing

Consent may be obtained either through a standard release statement, such as those presented in section 3.2.2, which resemble the Creative Commons mechanism or by asking for a specific permission for the processing of Personal Data. In the latter case, the request for such permission should state the purpose and nature of the processing, its duration, the data retention period and the degree to which personal data are to be further disseminated either as such or through a value added service.

The most common legal basis for LR-based MT & MP will be that of research. The conditions for such processing are:


 * that it is only for research purposes
 * that the data is anonymised
 * and that all necessary measures for the protection of people involved are taken. This last condition is to be interpreted in different ways according to the case law in the relevant country, however, it should definitely include certain measures to prevent further dissemination of the personal data.

The problem with such definition is that a large amount of LR based research may end up into commercial purposes that go beyond the original research purpose. It needs to be noted that this is not necessarily a problem, if the commercial service does not make use of the personal data or does not further release them.

Even in cases where sensitive data are to be used, processing is possible provided that:
 * access is only provided at the site where the data are stored
 * only data that are necessary for the research purpose are extracted
 * do not record data that may be used to identify living persons
 * retain the data anonymised for subsequent uses (e.g. publications)

The problem with such conditions is that they are very difficult to be implemented in an MP/MT scenario. For this reason, and despite the obvious problems with such an approach, it is always suggested to ensure that the consent of the data subject is obtained when sensitive personal data are involved.