Written by Andras Schwarcz,
Text and data mining (TDM), the automated analysis of digital data searching for trends, correlations and patterns, is rapidly gaining prominence due to exponentially increasing amounts of digital data (‘Big Data’) and decreasing technology prices. TDM enables researchers to access and analyse material that was previously impossible to process. Is TDM the next step for libraries to provide information to their users? This was the premise of the event organised by the European Parliament’s Library, inviting experts to explain their take on the role libraries can and should play in utilising this new research potential. The panel consisted of Catherine Stihler (S&D, United Kingdom), Kiera McNeice of the British Library, Julien Roche, professor at Lille University, and was moderated by Joe Dunne, Director of the EP Library. Julia Reda (Greens/EFA, Germany) presented some opening remarks on the subject before the panel discussion began.
Now that the scientific information produced exceeds the amount humans can process, Text and data mining is necessary to advance research. TDM has the potential to accelerate innovation by making research more efficient, by reducing literature review time by 80 %, and by discovering connections between apparently unrelated datasets. TDM may consequently help reinterpret existing knowledge. It can help the research community and indirectly the general public to benefit from a larger proportion of research data generated mostly using public funds.
There are some issues hindering the spread of TDM globally, and especially Europe. The first to tackle are legal – such as the unclear and fragmented legal framework for copyright. Currently, researchers need the express consent of each data publisher to be able to mine for information, even if they have the right to access the data. This discourages researchers from covering a wide range of datasets, especially relevant in niche fields. Rule for copyright exceptions, digital rights management (DRM) and definitions of such crucial terms as ‘non-commercial use’ are unclear and vary across Member States. Publishers and data owners often set legal restrictions to data mining in their licensing contract with libraries and research institutions, but also technical restrictions to bulk download and crawling their websites, especially hindering APIs developed by researchers. Although many major publishers have developed and standardised their technology to allow easy access and bulk downloads, smaller publishers are not all as advanced, and do not grant easy and standardised access for TDM. There is a lack of technical expertise and fora or expert networks where researchers and institutions such as libraries could access know-how.
The debate surrounding TDM in the EU therefore is currently focused on copyright, an initiative under discussion in the European Parliament. For TDM there is a need for a unified and clear legal framework on exceptions, to be able to use all the research data to which a researcher has legal access for machine reading and data mining, even with tools developed by researchers. Further to the copyright issues, access to data, standards and interoperability need to be addressed at EU level. These are necessary for Europe to be able to compete with Asia and the USA. There is also a need for dialogue between the data users and the publisher to achieve a balance between profit for publishers and public access to knowledge. There is also much to do in educating researchers on the potential of TDM, developing tools for TDM – areas where libraries could be in the forefront.
Libraries have a public service mission to help the public navigate the digital world, make information and data publicly available in all forms. The role of the library in finding specific information is being extended to finding trends and connections in information. More specifically, libraries need to play a bridging role in connecting researchers and IT developers that brings the research questions together with the research tools. Libraries need to educate non-specialists, highlighting the potential of TDM. They need to act as proponents of data harmonisation by using open standards. They can serve as information hubs, influencers and content providers, contact points for researchers, and help developing TDM strategies for other institutions. Libraries have always played a leading role in archive digitalisation, now they have to provide machine-readable digital archives. For these reasons, librarians’ jobs will be transformed, becoming data scientists and machine reading experts.
The panel concluded that the European Parliament Library has certainly potential in the TDM field, in semantic research in the EP archives and repositories of open data, especially legal texts.