Corpora: searchable collections of text

Below are two options for obtaining corpora:

Downloading a prepared corpus

Click here to download a zipped folder containing a corpus of 23 articles from the New Phytologist journal (prepared and made available with the kind permission of the Journal).

Making your own corpus for research writing development

We suggest that, for the purposes of developing your research writing (manuscripts, theses, proposals), an effective corpus can be made from around 20 articles from well-regarded journals in your own field of research. Following are the steps we suggest you can take to make your own corpus.

1. Talk the idea over with colleagues and senior members of your research group. Ask for the help of your supervisor or research leader to identify articles that meet the following criteria:

  • 2. Obtain electronic copies of the articles and convert them to the required format: only the text (sentences), and saved as .txt files. Suggested steps are as follows.

    If possible, download full-text versions of the articles from sources available to you through your institution's library or similar. Delete everything that is not scientific text – i.e. delete authors' names and institutions; tables, figures and their titles/legends; reference lists and acknowledgements. Then save the remaining text as .txt files.

    If you can only obtain .pdf files, first check that they are not locked – locked files cannot be copied. If the file can be copied, use the copy tools in the program to copy only the information you want, and paste it into a new document, piece by piece. Take care not to copy headers and footers on the pages. Do not copy authors' names, tables or figures, reference lists or acknowledgements. Take care as you paste into the new document to add spaces between the text segments you copy if necessary. Once you have copied all the relevant text, save it as a .txt file.

    3. Save all the .txt files in a single folder on your computer, for use with a concordancing program to answer your own questions about how specific English words and phrases are used in your own research field.