Statistics in Language
STATISTICAL STRUCTURE OF LANGUAGE
Languages can be analysed mathematically. Many aspects of grammar, vocabulary and writing display patterns that can be studied. Statistical properties are independent of the speaker or writer. Linguistic behaviour conforms to statistical expectations e.g. in English a "q" is almost always followed by a "u" and in the few exceptions where it is not it is because it is a transliteration of Arabic (e.g. Iraq, Qatar).

LETTER FREQUENCY
Letters of the alphabet show patterns of use in written language. In English the most commonly used letter is "e" followed by "t", "a", "o", "i", "n" and "s". The least used letters are "z", "q", "x" and "j". Some languages which are based on Latin script omit certain letters from their "alphabet" which they do not use e.g. in Welsh "k, v, x and z" are not considered part of their alphabet because they are not used in standard orthography.

LETTER CLUSTERS
We can also study the frequency of letter clusters, e.g. pairs and triplets. When applied to a large corpus of data (e.g. a Bible) the statistics generated can be applied to finding probable spelling or typing errors. If a letter is known to commonly follow certain letters and almost never to follow others then when it is found in combination with an unexpected letter it can be flagged and checked. Often it will be an error although it may be an unusual combination. For example in English "h" often follows c (ch), g (gh), p (ph), r (rh), s (sh), t (th) and w (wh). Combinations with other letters are rare. Thus a list of words where "h" follows letters other than c, g, p, r, s or t is likely to reveal a list of unusual words and errors.

WORD USE
In many languages there is a large vocabulary but a small number of words that are used very frequently. Typically if you take any large text (e.g. the Bible) the 15 most frequently used words will account for 25% of the text, the 50 most commonly used words are about 45% of the text and the first 100 words about 60% of the text.

WORD LENGTH
In most languages there is a relationship between word length and frequency. Shorter words are used more frequently than longer words. In English our most commonly used words are "the", "and", "I", "to", "in", "of", "a", "that", "on", "for", "was", "you" and "it". These are all short and monosyllabic. English uses a lot of short prepositions but the relationship between word length and frequency can also be seen in more polysyllabic and agglutinative languages.

SOFTWARE
Paratext has checks which uses linguistic statistical analysis principles, to generate text analysis, that can then be used for finding probable errors.


You must be logged in to make comments on this site - please log in, or if you are not registered click here to signup