Text and data mining (TDM) is a technique for analysing large collections of text/data. It enables the identification of relationships, patterns and/or trends that otherwise cannot be readily detected. The diverse holdings of the ETH Library can be used for TDM under certain conditions, but it is important to take note of the instructions below.
Legal aspects
On 1 April 2020, the Federal Act of 9 October 1992 on Copyright and Neighbouring Rights (CopA, Bundesgesetz vom 9. Oktober 1992 über das Urheberrecht und verwandte Schutzrechte, URG) was amended and a new article (24d) introduced. This article allows the reproduction of works for TDM analysis (provided that access is lawful and complies with technical requirements) and the retention of the TDM corpus for archival and backup purposes. However, this right does not extend to the use of the copies in any other context (e.g., the distribution and publication of the works used).
For resources accessed through the library (institutional subscriptions), licensing agreements with the respective providers take precedence over the provisions of CopA. Negotiated licensing agreements may explicitly allow or prohibit TDM, or they may require a contract addendum or even a specific agreement between researcher and publisher. We strive to find solutions with our providers that allow TDM with the licensed resources as freely as possible.
Unauthorised TDM, especially by crawlers, may violate the license terms agreed upon between ETH Zurich and the providers, and may result in a loss of access for the entire university.
Permitted use
The resources licensed by the ETH Library are available exclusively to members of ETH Zurich for non-commercial scientific research.
Please contact us if you are planning a TDM project based on media licensed by the ETH Library. We will be happy to help handle the necessary clarifications with the provider and assist with the procurement of large amounts of data.
We recommend that you allow sufficient time for clarifications and data acquisition in your project schedule.
Licensed corpora, tools, APIs
Below you will find a selection of licensed scientific resources that can be used via an application programming interface (API) or designated evaluation tool, and corpora that are available specifically for TDM.
Digital Science
Dimensions Analytics API: The Dimensions Analytics API allows you to perform analytics on the Dimensions Analytics database, which contains metadata on publications, patents, records, clinical trials, and policy documents.
A limited number of licenses are available for members of ETH Zurich. Please contact eressourcen@library.ethz.ch.
Elsevier
- Science Direct, Scopus API: Elsevier allows researchers to search subscribed content on ScienceDirect for non-commercial purposes via the API. After registration, researchers receive an API key.
- Scopus data: The ETH Library has acquired the Scopus raw data. Members of ETH Zurich can be given access to these data upon request. Please contact eressourcen@library.ethz.ch.
Gale Historical Newspapers
The metadata and content of the following collections are available in XML format from Gale:
- The Times Digital Archive, 1785–2019*
- The Economist Historical Archive, 1843–2020*
- Nineteenth Century Collections Online (NCCO)
- Science, Technology and Medicine, Part I (1780–1925). Interdisciplinary collection of digitised primary sources (journals, books, manuscripts) on the development of science in the 19th century. Title list
- Science, Technology and Medicine, Part II. Extension of Science, Technology and Medicine, Part I. Title list
- Mapping the World: Maps and Travel Literature. Collection of digitised primary sources (journals, books, manuscripts) on travel and discovery in the 19th century. Title list
- Photography: The World Through the Lens. Collection of digitised primary sources (journals, books, manuscripts) on photography in the 19th century. Title list
- Eighteenth Century Collections Online (ECCO) Part II: New Edition: Medicine, Science and Technology. Interdisciplinary collection of digitised 18th-century primary sources in the fields of medicine, science and technology. Title list. ECCO and other 18th-century corpora are also freely available via Text Creation Partnership.
For more information, or if you would like to use the Gale archives, contact eressourcen@library.ethz.ch.
Linguistic Data Consortium (LDC)
The LDC collects language and text corpora for linguistic research and development purposes and develops tools for their processing. Selected corpora are available to members of ETH Zurich. Register with your ETHZ email address and select ETH Zurich as your organisation.
Newspapers: Factiva, NexisUni
Factiva and NexisUni (daily newspapers databases) allow use for TDM, but only if the required documents are downloaded manually.
Both databases offer APIs for querying large amounts of data, but the ETH Library does not provide API access due to usage restrictions and high prices depending on data volume. If you have funding available and would like to consider this option, we can assist you in obtaining the necessary information and with the licensing process. Contact us via eressourcen@library.ethz.ch.
ProQuest TDM Studio
The following data can be analysed and visualised with TDM Studio:
- The Wall Street Journal (1889–2002)
- Materials Science Collection, Materials Science Database, Engineering Collection, Engineering Database, Engineering Index
- Various freely accessible resources
To use the visualisation tool, please register with your ETH Zurich email address. If you have any problems, please contact eressourcen@library.ethz.ch.
For advanced analyses, a workbench is also available for researchers who wish to program with R or Python in Jupyter Notebook. If you are interested, please contact eressourcen@library.ethz.ch.
Swissdox@LIRI
Swissdox@LiRI: The ETH Library supports the cooperation between Swissdox and the Linguistic Research Infrastructure (LiRI) of the University of Zurich. A text corpus is available that consists of around 29 million press articles from print and online media, as well as transcripts and subtitle stocks of radio and TV broadcasts. It covers several decades and is updated daily with 5,000 to 6,000 new press articles, primarily from the German- and French-speaking parts of Switzerland. In addition to the options of classical descriptive, inferential, explorative or context-based data analysis, Swissdox@LiRI is also suitable as raw material for big data analyses and for training algorithms or neural networks.
Web of Science
- Web of Science Starter API (free plan): Query predefined metadata fields, limited to 50 requests/day, 50,000 documents/year. Registration on the Clarivate Developer Portal is required.
- Web of Science data: The ETH Library has acquired the Web of Science raw data. Members of ETH Zurich may be given access to this data upon request. Please contact eressourcen@library.ethz.ch.
Freely available resources of the ETH Library
The ETH Library offers various APIs for direct access to its own resources. Contact: api@library.ethz.ch
Freely available corpora and tools
The following resources are freely available but may have limitations on download size or retrieval speed. Information can be found on the providers’ websites.
- arXiv
Free access to preprints in physics, mathematics, computer science, statistics, financial mathematics and biology - BioMed Central
More than 300 open-access journals from BioMed Central, Chemistry Central and SpringerOpen in the fields of biology and medicine - Chronicling America: Historic American Newspapers
Collection of digitised historic newspapers from the US from 1789 to 1924 - CrossRef TDM Tool
- Free, cross-publisher service from CrossRef (including AIP, APA, APS, Elsevier, HighWire Press, Springer, Taylor&Francis, Walter de Gruyter, Wiley) for metadata retrieval. In addition to accessing OA content, some licensed content can also be obtained through this tool.
- Digital Public Library of America
Access to digitised cultural assets from US museums, libraries and archives - Europeana
Digital library with digitised content on scientific and cultural heritage from more than 2,000 European institutions - HathiTrust Digital Library
Digitised materials from over 120 academic institutions worldwide - Internet Archive
Access to millions of open-access books and texts and over 26 years of Internet history with the Wayback Machine; tutorials and API list - JSTOR
Data for Research: Extensive corpora can be compiled from the JSTOR Archive Collections and freely available content from the JSTOR and Portico services.
Constellate provides a text analysis platform that can be used to download metadata, full texts and N-grams, and to visualise data. In addition, Constellate offers a series of tutorials on using Python and natural language processing (NLP) for the digital humanities. Personal account required, and access via ETH Zurich network for inclusion of non-free documents. - The New York Times
Metadata and some full texts from The New York Times, 1851–present - Public Library of Science (PLOS)
Access to the content of journals published by the PLOS, an open-access scientific publisher - PubMed Central: databases and text-mining tools
Various open-access mining tools to search PubMed Central, an archive of open-access content from the fields of biology and biomedicine
There are a large number of freely available corpora and tools, and the list above is by no means complete. We are also happy to refer to collections of other libraries:
- List of freely available data sources and a collection of tools for data-based research from University Library of Bern
- Collection of freely available APIs for computational research from the MIT Libraries
- Collection of freely available tools for qualitative data analysis, compiled by the Carnegie Mellon University Library