Legal framework of textual data processing for Machine Translation and Language Technology research and development activities/Public Sector Information

Introduction
Public Sector Information is of particular relevance in the MP/MT context, particularly since a large amount of the content used for such purposes is Public Sector Information. As Public Sector Information is deemed any form of document held by Public Sector Bodies (PSBs) of Member States (Art. 1(1) of the PSI 2013 Directive (2013/37/EU)).

According to Art.2(3) of the PSI 2013 Directive: "document" means: As a result, a vast amount of information released publicly on the Internet by different EU Member States is within the scope of the PSI Directive and as such it is eligible for re-use by third parties both for commercial and non-commercial purposes. While the discussion regarding disclosure and making available of PSI is dominated by the question of PSI licensing, which is covered in section 4, it is also likely that no licence is either necessary or possible to be used. This section covers three main cases:
 * 1) 	any content whatever its medium (written on paper or stored in electronic form or as a sound, visual or audiovisual recording);
 * 2) 	any part of such content;


 * 1) 	PSI exempted from copyright protection
 * 2) 	PSI on Public Domain works (copyright has expired)

Works not granted copyright or exempted from copyright and similar rights protection
A licence will not be necessary in the absence of copyright or similar rights (including neighbouring rights, the sui generis protection of databases or other similar rights) for that specific subject matter.

There are two main cases of PSI as copyright exempted subject matter:

PSI as exempted subject matter per se
Some jurisdictions do not assert copyright in PSI, so no license would be needed. The clarity of the regime varies in different jurisdictions: In some cases, this is clearly stated in the law, in some others this is information that could be provided by the National Copyright or Intellectual Property Office and in some others it is less than clear what the PSI status is and it has to be established on a case-by-case basis. In such cases, the relevant national copyright law will make explicit reference to the types of works and uses of such works that are non-copyrightable subject matter. These will in most jurisdictions be text used to exercise the administrative powers or to offer public service. There are explicit references to judicial, legal and administrative texts, but the limits of the exemptions are to be defined on a national jurisdiction basis.

Content Provider’s Perspective: The content provider should at least use notices to indicate the copyright status of the work as Public Domain. The Creative Commons Public Domain Mark (PDM) could greatly contribute towards such directive.

User’s Perspective: The user could either make an assessment with regard to the copyright status of the work or seek for relevant notices. It is suggested that the user uses one of the Public Domain Calculators that are available to assist her in making the relevant decisions. If the work is not copyrighted, the user may use in any way she wishes. Because legislative material is very likely to fall under this category in most civil law jurisdictions, it is important to assess the status of the work at the country of origin.

Tip: It could well be the case that the individual works are copyright-free, but their compilation is protected under copyright or the sui-generis database right. This is the case with most commercial legal databases. Hence, always pay attention to the source of the material: it is more likely to be Public Domain if it comes from an official or public web site.

Non-copyrightable subject matter released as PSI
The second case is different from the first case, as it is the nature of the PSI material rather than its use as PSI that renders it non-copyrightable.

These are mostly cases where the subject matter does not fulfil the requirements of originality and form to attract copyright protection or specific types of information, such as factual information, raw data or traditional knowledge. Such types of content constitute non-protectable subject matter and hence do not require any type of licensing, irrespectively of whether they are PSI or not. However, it needs to be highlighted that once arranged in a systematic or methodical way, the resulting set of information may be protected under copyright (if the arrangement passes the criterion of originality) or the sui generis database right (if they constitute non-original compilations).

Content Provider’s Perspective: The content provider should identify such material and convey its legal status to the end user through some form of notice. However, because of the existence of the sui generis database right, which provides even mere information with a shell of a property right, it is in very few cases that non-copyrightable subject matter will be released as such. An interesting case in the scope of the PSI 2013 Directive will be the release of traditional knowledge material from museums, archives or libraries, including oral history and songs, which may be of great interest for MP purposes, however, even such content may attract copyright through its packaging (sound recordings, transcriptions etc.).

User’s Perspective: The user will have to assess whether the material is not copyrightable or not. This is often a difficult and ambiguous exercise and is –generally- suggested either to seek for some sort of notice or use the risk mitigation techniques presented in the relevant section.

Expiration of the Copyright Term (Public Domain Works)
A licence will also not be necessary when the copyright or similar rights term has expired. Works no longer protected by copyright because of the expiration of the economic rights term should be treated as public domain works and therefore should be freely re-used. The economic rights granted under the copyright regime typically expire 70 years after the death of the last co-author or 70 years after the publishing or recording, but rules may vary and the term of protection may be greater in special cases. Public Domain calculators are being developed to help assessing whether a work will be in the Public Domain in a particular jurisdiction.

Content Provider’s Perspective: A good practice here is not to license content that is in the public domain. The Content Providers often do not have a clear understanding of the copyright status of the information they release. It is strongly suggested to make an assessment of the duration of the information to be released and mark the works accordingly.

User’s Perspective: The lack of relevant documentation and harmonization in the term of protection of different works in different jurisdictions is likely to cause significant implementation issues with regard to the assessment of whether a particular work belongs to the public domain or not (due to term expiration reasons). The Public Domain Calculators may be useful with regard to an initial assessment of the copyright status of the work, but a risk mitigation strategy should always be applied.

Tip: The risk of infringement is reduced as we get closer to the term expiration, the work is of low commercial value and the use is non-commercial.

Limitations and Exceptions
A different, but related case, is when the PSB needs to use copyright material to perform its public task or where a court requires to have access to copyrighted subject matter in order to issue a decision. These are cases, where no permission or input licence is required for the PSB or court of justice to perform its mission or task, as it will normally fall under the limitations and exceptions, fair dealing or fair use rule and could hence be used without any additional permissions.

Content Provider’s Perspective: When such a material is to be disclosed or made available for re-use this cannot be done if it contains third party copyrighted material. While the exception will cover the use of the PSI, it will not necessarily cover its re-use. This is the reason why it is strongly suggested that PSBs mark PSI containing third party material with some sort of meta-data or notice regarding the third party material.

User’s Perspective: The lack of a harmonized copyright limitations and exceptions regime across the EU has as a net effect an increasing uncertainty as to what falls within their scope. It is not clear whether the material, its use or the entity that performs it are such that they are considered as falling within the limitations and exceptions. The disparity between the fair use, fair doctrine and limitations and exceptions systems, further complicate the situation, making the request for a licence a safer option.

Use of Marks and Notices
It is highly recommended that, when PSI material is not covered by copyright or other similar rights or when it contains third party copyrighted material, the relevant marking is in place. This will increase legal certainty and allow the lowering of transaction costs.

This can be achieved in a standardized fashion by using the Creative Commons Public Domain Mark or by drafting an ad hoc notice.

Content Provider’s Perspective: Using a standardised tool such as the Public Domain Mark developed by Creative Commons provides the text in a language that is accompanied by metadata, valid across jurisdictions and translated in many languages. According to Creative Commons, the Public Domain Mark “is intended for use with old works that are free of copyright restrictions around the world, or works that have been affirmatively placed in the worldwide public domain prior to the expiration of copyright by the rights’ holder.” The Public Domain Mark tool provides the ability to generate HTML code to inform the public (and search engines) of the public domain status of the work. The Public Domain Mark enables a person who wishes to mark the work as being in the public domain to include optional useful information, such as:


 * Name of the work, e.g. title of the dataset;
 * Name and URL of the author, e.g. the division or department releasing the PSI and the source page;
 * Identifying individual or organisation, in case this information differs of the above, e.g. a higher level of the PSB which should be contacted for further information.

User’s Perspective: It is always preferable to search the material through search engines that allow the identification of the relevant licensing form or copyright status of the material or use relevant APIs or other technical means.

Licensing of Public Domain material released as PSI?
PSBs should refrain from using licences for PSI, which is in the Public Domain. Such licences would create restrictions upon the use of works that are no longer protected by copyright or similar rights and can be freely used without any conditions. Furthermore, since no copyright exists in a Public Domain work, there is no legal basis to license it. The PSI2013 Directive explicitly makes reference to the possibility of releasing material without any conditions, and the case of Public Domain material clearly falls under such case.

In addition, it is not recommended to add a licence (and therefore restrictions where none should apply) to the digitised reproductions of analogue non-copyrightable data or Public Domain works. The mere act of digitisation is not a source of new rights and keeping digitised versions in the Public Domain will guarantee they remain free to use as the original work. Digital reproductions of works which are in the Public Domain must also belong to the Public Domain. Use of Public Domain works must not be limited by the addition of unnecessary licensing requirements. In some countries, the threshold for originality is low, and digitisation might open a claim to copyright, but it is not recommended to enforce that right.

Content Provider’s Perspective: Refrain from using any licence for PD material; instead use notices where applicable.

User’s Perspective: Ensure that the material used is indeed in the PD irrespective of the licensing scheme. Check if there is any additional form that may revert the resource to copyrighted material, e.g. book format protection, database right or digitization (depending on the jurisdiction).

Concluding Remarks and Recommendations for PSBs licensing
Overall, the material released by PSBs as PSI may be used in a number of occasions for MT & MP purposes without requiring additional permissions or even a licence, either because of it belonging to the Public Domain or because it falls within relevant limitations and exceptions. When a licence is required, the normal copyright rules should apply as stipulated in section 3.1, where reference to PSI and the related licensing is also made. Directive 2013/37/EU (the New PSI Directive), and previously Directive 2003/98/EC, allows for the release of PSI for re-use under a licence or without conditions (art. 8). This practically means that a Member State may choose to release PSI for re-use without a licence if this PSI is:


 * 1) 	in the Public Domain (e.g. because the duration of the copyright has expired)
 * 2) 	is exempt from Copyright law.

The experience of the open licensing community, even outside the realms of public data regulation, favors maximum simplicity in the release of public data. Such simplicity is best served when any type of work is made reusable without any limitation, or with very few limitations. This helps to ensure licence compatibility and increases the re-use of the content by the industry and the civil society. In turn, this best serves the objectives of the Directive, i.e. growth/job creation and the objective of the Digital Agenda 2020 for greater transparency in the activity of the Public Administration.

In the open licensing community this is amply demonstrated by the recent statement issued by Creative Commons after the 2013 Global Summit, effectively supporting the development of positive user rights in copyright law, rather than relying on list-based exceptions or open licensing as sufficient solutions. The need for copyright reform is beyond the scope of this report. However, the fact that there is an urgent need for reducing uncertainty and complexity with regard to copyright limitations and exceptions--and that licensing is a patch rather than a fix to the problem--points at the direction of a legislative solution at the Member State level, something that the new PSI Directive makes possible and something that is followed by a number of Member States. This approach combined with an “openness by default” policy may allow the maximum benefits from opening PSI while substantially reducing transaction and clearance costs for potential re-users. The following recommendations could strongly support the use of PSI-based LRs for MP/MT purposes:

Recommendation 1
1. 	The adoption of a legislative solution, instead of licensing, could further reduce transaction costs, i.e. openness by default of PSI in the form of a law without further requirements of issuing or adopting of a specific licence:
 * introduce a positive, actionable user-creators’ right to PSI;
 * harmonise exemptions and exceptions to any kind of protection so that such right applies across the European single market,
 * in order to ensure interoperability make sure that no conditions are attached to the re-use right other than the ones mentioned in the Directive, i.e. acknowledgment of source and acknowledgment of whether the document has been modified by the re-user in any way;
 * provide clear report as to how the positive PSI right operates; these reports could be provided by the competent for the implementation of the PSI directive Public Sector Bodies through circulars and then to the re-users through notices.
 * make a registry for the information that is to be used under a closed/all rights reserved/re-use but not open licensing scheme. Ensure that such registry is regularly updated and provided as (linked) open data in the national data.gov (open data) portal. This should not introduce additional costs to the PSB, as it could be a simple spreadsheet file (e.g. in csv format), it would be necessary if the PSBs were to charge anyhow, and it could make use of the already existing infrastructure (data.gov site).*
 * clearly indicate when personal data are included in PSI, ensure that any re-users are obliged to also use a personal data notice (indicating the original processing purpose) when such data still exist and provide report as to how to resolve the conflict between personal data protection and the right to re-use;
 * have a single institution responsible for collecting, coordinating and administering open licensing, even where it evolves subject matter outside the scope of PSI (e.g. research data, broadcasters’ data etc). Ensure that there is harmonisation of open access policies in PSI, the cultural, educational, research and science sectors; This could be done at the Member State level through an intra-ministerial committee headed by the PSB responsible for the implementation of the PSI Directive. EU-wide coordination could be done through the introduction of a Working Group or standing committee to support the review of Art. 13(2) of the PSI 2013 Directive.
 * issue technical report ensuring the legal documentation of the relevant data sets (e.g. under which regime/law are made available).

Use report for the following issues:
 * regarding notices to be placed on PSI. This could be done through smart notices, i.e. not legally binding notices that refer to specific legislative provisions, that are permanently stored in the same way as the Public Domain Mark, and added as meta-data or mark to the relevant PSI. In fact, if PSI is equaled to Public Domain, then the CC PDM could be used, as in the case of Europeana.
 * how to mark that the PSI contains personal data and how to resolve the conflict between personal data protection and PSI re-use.
 * on how notices are to be retained by the re-users
 * on how to obtain consent by the data subject for re-use of her information at the point of collection
 * on obtaining maximum IPR from third parties

2. 	Use the licensing instrument with the least legal friction, i.e. the CC0 Public Domain Dedication (https://creativecommons.org/publicdomain/zero/1.0/)

3. 	Use standard open public licences, especially CC BY, which is the solution with the least legal friction if recommendations (a) and (b) cannot be followed.

National governments and public sector bodies (including Galleries, Libraries, Archives and Museums, as well as other public interest institutions) still prefer to use their own Open Government Licences (OGL), as this gives them more control over the wording of the licence and the licence update process. However, this poses increasing challenges as it requires a dedicated team of experts for the creation and maintenance of the licence as well as continuous monitoring of updates of other standard public licences and extra care in the wording so that interoperability between different licences of the same type is achieved. While this is possible, it is extremely difficult to achieve for all six types of Creative Commons licences. Hence, it is advisable, that if governments or public sector bodies insist in making their own licence, to only create an Attribution (by this we mean “Attribution-only”) licence.

Recommendation 2
Governments are strongly advised NOT to use: It is also important to note that the only types of licences that conform to the Open Definition (http://opendefinition.org/okd/), i.e. a definition of how all types of knowledge could be disseminated and re-used with the minimum possible restrictions are the following: However, as mentioned above, ShareAlike licences may be very problematic in terms of compatibility as they require that data-sets are remixed using licences with the same terms and conditions. In addition, CC Zero may be the preferred tool for releasing public data since it is the preferred tool for Europeana and produces the least possible interoperability frictions.
 * NoDerivative licences, since they substantially erode and effectively annule the scope and ambit of the re-use of the material and hence the application of the PSI directive.
 * NonCommercial licences, since it is extremely difficult to achieve a common definition of NonCommercial both within their own jurisdiction and--even worse--in other jurisdictions
 * ShareAlike licences, since they require that the derivative work will be released under the exact same licence, which make legal interoperability extremely difficult.
 * CC0 Public Domain Dedication or Public Domain Dedication and License (i.e. blanket waivers when construed as a licence)
 * Attribution licences
 * Attribution ShareAlike licences

It is strongly advised that the European Commission does not create its own Open Data licences. The reason is that another licence would only add to the problem of “licence proliferation”, i.e. the problem of having multiple licences with similar terms but with potential incompatibilities that do not allow the seamless re-use of different data sets. For this reason, the suggested direction is towards a European licence standard rather than a standard European licence.

Recommendation 3
It is strongly advised that even when governments choose to produce their own version of an Attribution (or other open) licence, they should always try to apply it in the following fashion:
 * retain a clear versioning/date scheme, i.e. each licence should have a specific version and a date of that version. When changes are made, these should be made public and the version number should change.
 * store the licences at a permanent location and link the licensed material with the URI of the licence