Document Type

Dataset

Rights

Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence

Grant Number

13/RC/2106

Disciplines

Computer Sciences, Information Science, Linguistics

Abstract

This archive contains a collection of pseudo-corpora. These are text files that contain pseudo-sentences generated artificially from a random walk over the English WordNet taxonomy.

DOI

https://doi.org/10.21427/he55-6481

Methodology

The random walk algorithm produces a pseudo-sentence from WordNet by randomly picking a node (SynSet) in WordNet, randomly choosing a word in the SynSet, and then randomly picking a connected node and repeating the process. At every step there is a 15% chance for the random walk to stop; it also stops if it has no more connected nodes to take. Once the walk stops, a sentence is generated, and the same process repeats for each new sentence.

Each line in the generated file represents one pseudo-sentence, where words are delimited by spaces.
Example sentences:
measure musical notation tonality minor mode
Dutch-processed cocoa powder chocolate milk

The corpus files are different in size, as well as in some parameters that were used to generate them.
The parameters are:
- size : number of sentences/lines in the corpus
- direction : the direction that the random walk over WordNet was allowed to go while generating sentences (possibilities are up/down/both)
- minimal sentence length : the shortest length sentence (in number of words)

README.md (3 kB)
Readme file containing a detailed description of the resource

Language

eng

File Format

.txt

Viewing Instructions

The corpora are compressed into a gzip archive. To view them they first need to be extracted, which can be done using most standard archive managers (e.g 7-Zip, WinRAR, etc.) Once extracted, the provided .txt files can be viewed with a simple text editor, such as notepad or similar.

Funder

The creation of these resources was supported by the ADAPT Centre for Digital Content Technology (https://www.adaptcentre.ie), funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Creative Commons License

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.

Share

Article Location

 
COinS