Jared Neumann
This is a repository for code and files associated with the Victorian Logic Project (VLP). The VLP is intended to computationally model different aspects of the discipline of logic throughout the long nineteenth century. Currently, it is focused on the timespan between 1826 and 1860, corresponding to the publication of Richard Whately's Elements of Logic and William Whewell's On the Philosophy of Discovery respectively.
Aspects of discipline to be studied include, but are not limited to: citation analysis, the distribution and evolution of topics over time, and semantic change in technical terms.
Since the project is in its early stages of development, the corpus is small. However, it is exhaustive (as far as I know) given its parameters. It is comprised only of the first editions of monographs explicitly on logic (save for two exceptions which are taken to be a special case: Whewell (1858) and (1860)). In the future, the type of medium will be expanded to include journal and encyclopedia articles, and the time period will be extended in both directions.
The original, unprocessed contents of the corpus can be found in corpus_original.zip. It contains 44 text files corresponding to unique monographs, and contains a total of 3,755,518 tokens. They were all taken from Internet Archive and were digitized by Google. This, together with the nature of 19th-century typeface, means that the texts are pretty messy. A worthwhile task would be to improve on the quality of the original corpus. One potential way to do this without improving the scans, OCR, etc., would be to segment the tokens using a dictionary and replace words within a certain edit distance threshold, or join tokens that would make a full word if they were combined. I did not attempt this because I thought that it might generate as many errors as it corrected. But there is probably a smart way to do it.
The corpus was processed to remove most non-alphabetic characters and clean up unnecessary whitespace, etc., as described in the included bash script:
tr -d '[:punct:]' | tr -d '[:digit:]' | tr -d '[«»@#$%^*}{|£¢§]' | tr '[:upper:]' '[:lower:]' | tr '[\n\r]' ' ' | tr -s ' '
Punctuation was removed so that the corpus works better with the intended tools, which do not rely on annotations of any kind.
Then, a customized list of stopwords (based on the Mallet list for English) were used with the included Python script. There is no doubt that this list could be improved, but it is already an improvement over the list included with Python's NLTK. The processed corpus with the cleaned text and removed stopwords can be found in corpus_processed.zip. It contains a total of 1,492,672 tokens. Here is a list of the top 15 tokens in the processed corpus with frequencies (per 10,000 words):
See table
Rank | Word | Freq. |
---|---|---|
1 | logic | 45.00 |
2 | general | 38.70 |
3 | science | 37.89 |
4 | nature | 34.21 |
5 | form | 30.98 |
6 | term | 30.82 |
7 | man | 30.76 |
8 | subject | 30.65 |
9 | knowledge | 29.97 |
10 | true | 29.85 |
11 | mind | 29.41 |
12 | proposition | 27.68 |
13 | terms | 27.62 |
14 | reasoning | 27.60 |
15 | laws | 26.91 |
This is the first aspect to be explored. I am interested primarily in changes with respect to the semantics of jargon or technical terms on the assumption that conceptual changes in the discipline correlate with changes in its taxonomy.
There are many options for modeling semantic change out there right now. Several notable examples are outlined in Tang (2018). For my purposes, I chose to use the SCAN model explained in Frermann and Lapata (2016). SCAN is a Bayesian model of diachronic meaning change which the creators experimentally show worked on-par with other systems available at the time of publication. It also offers a number of advantages for use by historians. It models word senses as probability distrubitions over words, and is capable of modeling changes in the number of senses, changes in their corresponding representativeness, as well as subtle changes within individual senses. SCAN models semantic change as 'smooth', based on time-sliced distributions over vocabulary. Unfortunately, the current corpus is small, and not every time-slice is appropriately fleshed out, but that will change with the development of this project and there are still interesting things to learn with SCAN. Additionally, the source code is user-friendly, and can be found in Lea Frermann's repository for it here.
SCAN requires three inputs from the user: (1) a target word list for words to be modeled, (2) a corpus of concordances for those words, and (3) a parameters file which gives the model the file paths and sets its variables.
Any number of target words could be chosen. Since I am interested in disciplinary change, I chose some prominent words that are integral to the Victorian taxonomy of logic, e.g., logic, reasoning, induction, syllogism, and inference. I would have also included deduction, but it was anomalously under-represented in the corpus, meaning that something about the word probably messed with the OCR. Here are the ranks of the terms in the processed corpus:
See table
Rank | Word | Freq. |
---|---|---|
1 | logic | 45.00 |
14 | reasoning | 27.60 |
28 | syllogism | 20.90 |
53 | induction | 16.06 |
128 | inference | 9.29 |
... | ... | ... |
653 | deduction | 2.65 |
It is an interesting question whether there is an obvious distinction between these and other terms. One criterion that not all of the above satisfy is that there exists an explicit definition given in the text; but, in the case especially of reasoning and inference, which seem to be jargon terms, this is not always or even typically present. Another potential criterion might be the frequency of the terms relative to a more general, non-disciplinary corpus. I think this would yield a lot of false positives, though, since even non-discipline specific language can be common, or more highly represented, in the discipline without it being strictly jargon, as might be indicated in the word frequencies listed in the previous section.
Another question of interest is whether lemmatizing or stemming the terms (e.g., logic>al|ally, syllogi>sm|ze|stic|stically) gives better or worse results; I'm inclined to think that that jargon terms are better left distinct, as I have left them in this part of the project (for now).
SCAN requires a concordance file with instances of the target word. The context window is set by the user. To generate concordances, I used Laurence Anthony's AntConc (Version 3.5.8) and set the context to 5L and 5R. This generates a set of concordances in the following format (with an example):
Hit | Keyword in Context (KWIC) with 2L/2R | Origin File |
---|---|---|
... | ... | ... |
4873 | knowledge science logic propositions judgments | 1850_Field_Analogy.txt |
... | ... | ... |
These original concordances are contained in the conc_original.zip file. However, SCAN requires the following format, where the date of publication comes first, then a tab character, then the concordance:
Year | Delimiter | Keyword in Context (KWIC) with 2L/2R |
---|---|---|
... | ... | ... |
1850 | \t | knowledge science logic propositions judgments |
... | ... | ... |
The translation was done with a simple Python script contained in antconc_to_scan.py. Furthermore -- another methodological issue -- it may be beneficial to avoid over-representing concordances from texts that are simply longer than others. One way to do this is to put a cap on the number of concordances drawn from any given text. I did not do that for for now, as I thought it might resolve itself given a larger corpus. The new concordances are found in the input folder.
For more on running the SCAN model and interpreting the results, as well as motivating and contextualizing the project, see the included report, Semantic Change in the Technical Terms of Victorian Logic.
Forthcoming.
Forthcoming.