Creating a Pharo application for MacOS

Packaging your code as an application is an important point. It is the last step that makes a difference between a bunch of packages and a complete application. 

Thanks to the beautiful work in the PharoApplicationGenerator package, developed by Pablo Tesone, it is now possible to quickly generate an application and an .dmg installer, specifying your Metacello baseline, so you can have an executable for your application.

Now you can use a one-line command to build an installer for your own Pharo macOS application.

If you just want the sample command, this is it:

wget -O- get.pharo.org/64/130+vm | bash; ./pharo Pharo.image metacello install github://hernanmd/the-note-taker/src BaselineOfTheNoteTaker --groups=Release; ./pharo Pharo.image eval "NTCommandLineHandler generateApplication"; chmod 755 build/build.sh; cd build; ./build.sh; open NoteTaker.app

Let’s dive into each step of this one-liner. We will create an installer for a sample application for the Note Taker project:

Downloading Pharo

wget -O- get.pharo.org/64/130+vm | bash

This part downloads Pharo 13.0 (64-bit) and its Virtual Machine (VM). The output is piped directly to bash for execution.

Installing the Project

./pharo Pharo.image metacello install github://hernanmd/the-note-taker/src BaselineOfTheNoteTaker --groups=Release

We’re using Metacello (Pharo’s package manager) to install “The Note Taker” project from its GitHub repository, specifically the Release group. The Release group should include a package that will be only loaded for the executable generation process and includes an NTCommandLineHandler class used to generate a build shell script.

Generating the Application

./pharo Pharo.image eval "NTCommandLineHandler generateApplication"

This step invokes a custom command line handler to generate the application. It’s worth noting that NTCommandLineHandler is a subclass of CommandLineHandler, and CommandLineHandler is already provided in vanilla Pharo images. You should implement a very simple generateApplication class method in your own “NTCommandLineHandler“. Check an example implementation for macOS in the project’s GitHub repository. (https://github.com/hernanmd/the-note-taker/blob/master/src/TheNoteTaker-Release/NTCommandLineHandler.class.st)

Here is the gist of the command line handler

NTCommandLineHandler class >> generateApplication 

  (self environment at: #AppGeneratorOSXGenerator) new
    properties: { 
    #AppName -> 'NoteTaker' .
    #InfoString -> 'A note taking application written in Pharo' .
    #BundleIdentifier -> 'org.pharo.notetaker' .
    #ShortVersion -> '1.0.0' .
    #DisplayName -> 'Note Taker' .
    #CommandLineHandler -> 'noteTaker' .
    #CompanyName -> 'INRIA' .
      #VMType -> 'Spur' } asDictionary;
    outputDirectory: FileLocator workingDirectory / 'build';
    generate.

NTCommandLineHandler >> activate

   AppGeneratorSupport errorHandler: AppGeneratorSDLMessageErrorHandler new.
   OSWindowDriver current startUp: true.

   OSPlatform current isMacOSX
    ifTrue: [ | main |
        main := CocoaMenu new.
        main    title: 'MainMenu'; "Only informative"
            addSubmenu: 'Application' with: [ :m |
                m
                    addItemWithTitle: 'Quit'
                    action: [ Smalltalk snapshot: false andQuit: true ]
                    shortcut: 'q';

                    addItemWithTitle: 'Restart'
                    action: [ NTSpApplication new start ]
                    shortcut: 's' ];
            addSubmenu: 'Help' with: [ :m |
                m
                    addItemWithTitle: 'Show Help'
                    action: [ self inform: 'Help' ]
                    shortcut: '' ].
        main setAsMainMenu.
        NTSpApplication new startFullScreen ]
    ifFalse: [ self inform: 'Not OSX' ].

Preparing the Build Script

chmod 755 build/build.sh

This command makes the build script executable.

Building the Application

cd build; ./build.sh

We change to the build directory and execute the build script. It’s important to note that the build.sh script requires the create-dmg package, which you can install via Homebrew (brew install create-dmg). This dependency is crucial for creating the macOS disk image (DMG) file.

Opening the Application

open NoteTaker.app

Finally, we will open the newly created NoteTaker application.

Conclusion

This command line showcases the power and flexibility of Pharo’s development ecosystem. It demonstrates how we can automate the entire process from downloading Pharo to building and launching a desktop application, all in one command. In future posts, we will explore how to expand this process for GNU/Linux and Windows systems.

Next Pharo Sprints @ Lille

For those interested, we are organizing monthly Pharo sprints in Lille, in the Inria Research Center every last friday of the month! Tell us if you would like to come!

What’s a Pharo Sprint?

Imagine coding in something fun. And having cookies. Pair-programming with a guy that does something completely different than you in your every-day job. And candies. Some coffee too.

Well, that’s the best I can describe a Pharo sprint :).

Organised usually internally in the Evref Inria team @ Lille, we talked about opening to anybody that would like to join! We can connect on discord and do some remote sessions.

The goal? Could be fixing bugs. Could be improving some part of Pharo. Could be just learn something new. You choose.

When?

These are the following Sprint dates:

  • 29/09
  • 27/10
  • 24/11
  • 22/12

If you want to come, find the address below, and please keep us updated on discord or mail so we can organise it better!

INRIA – Avenue Halley,
Parc Scientifique de la Haute Borne,
Bat.A, Park Plaza
France

G

Pharo @ ESUG’23

The yearly European Smalltalk User Group conference happened last week in Lyon, France.
Unofficial numbers tells it had more than 80 participants, and we had met with lots of people we hadn’t seen for years!
We are slowly recovering from COVID 🙂

Of course, you can find the full conference info and schedule in here: https://esug.github.io/2023-Conference/agenda/agenda.html.
However, the goal of this post is to be short and give some highlights on the Pharo activity @ESUG, to keep you updated and steer some discussions.

The Slides of all talks are already in the ESUG Archive, videos will be online soon. I’ll update this post when that happens!

Besides the typical news around Pharo, the announcement of Pharo11 release (from earlier this year), we have seen lot of work on tooling, AI, Databases, web frameworks, infrastructure, and people using Pharo to build different kind of (maybe even crazy in some cases) applications.

Infrastructure

Pharo 11 has improvements in the bytecode compiler, block closure optimizations, Ephemeron support.
Git integration has been stable and working during years, and we keep you updated with the latest libgit releases.

Packaging Pharo desktop applications is being improved a lot.
This even opens the door to build fancy desktop applications using the new versions of Bloc and Roassal!

Profiling and tooling

Pharo has seen in the last year lots improvements in the debugger infrastructure and the refactoring engine.
There is also interesting advances in program instrumentation, used for example to build memory consumption profilers.
There is also a fuzzing framework under development to perform automatic detection of bugs!

AI and BioInformatics

Pharo keeps moving in the AI front with performance improvements, library extensions and bindings to external libraries!
Lapack integration, better dataframes, fancy algorithms and numerical libraries.
All these power toolkits like Bio Smalltalk, which has lots of tools to do exploratory analysis on DNA and other biological analyses.
Imagine all these with the power of live debugging and interactive visualizations!

Databases

Besides the fancy and stable gemstone object databases that can iteract with Pharo, new things are appearing in this area.
Soil is a new object database written in Pharo, with fresh views on old concepts.
It’s even being used in production and apparently we can build sexy query systems on top!

Web Frameworks

The Pharo community keeps delivering high quality frameworks to help you build web applications.
Pharo JS has been updated for the latest Pharo and latest ECMA.
Also, the guys from Yesplan have integrated the Hotwire framework from Ruby-on-Rails into Seaside!

Applications in Pharo!

Of course, not only Pharo is changing, but it also allows people to do amazing applications with it.
We have seen amazing demos of custom desktop applications by the people from Thales, native UIs and command line tools by the people from SCHMIDT, super fun simulators using Cormas.
Also a big heads up for the amazing Glamourous toolkit built on top of Pharo that announced its v1.0 release during the conf!

Community and fun stuff

We had also slots to discuss about the Google Summer of Code results with Pharo, where students did an amazing job on the Pharo IDE, Roassal, the charting libraries, the virtual machine…
Also, we had some interesting and fun community input on the current IDE thanks to the YesPlan people that run a survey and presented the results.
Domenico (Lucrezio), the crazy DJ that makes live coding music, showed his work on synthetizing music from Pharo, now with integrated visualizations!
Finally, there was an announcement on a new MOOC on Advanced Object-Oriented Design.

Conclusion

Of course, there were also lots of nice projects presented in the show us your projects, in the technology awards, in the research workshop.
Put on top the coffee break discussions, the social events, and the nights out!

ESUG’23 was a success, and it was lots of fun.
Let’s hope ESUG’24 will be as fun as this one too!

See you around!
Guille

Writing benchmarks with SMark

Introduction

SMark is a benchmarking framework for Pharo written originally by Stefan Marr. It serves as an essential tool for Pharo developers, enabling them to benchmark their code, identify performance bottlenecks, and make informed decisions to enhance their software’s efficiency.

It is composed of four main components: A Suite (which represents the benchmarks), a Runner (responsible to execute benchmarks), a Reporter (which knows how and what data to report) and a Timer.

SMark behaves similar to the testing framework SUnit with setUp and tearDown methods. And the runner can do what it wants/needs to reach warmup. For instance, the SMarkCogRunner will make sure that all code is compiled before starting to measure. Both warmup and setup/teardown methods can be specified per-benchmark.

Installation

You can install SMark in Pharo evaluating the following expression:

Metacello new
    baseline: 'SMark';
    repository: 'github://guillep/SMark';
    load.

How to implement a benchmark

Create a subclass of SMarkSuite. Inheritance from SMarkSuite is not a requirement but a convenient way to use the multiple hooks which the Suite offers.

SMarkSuite subclass: #MyBenchmarkSuite
	instanceVariableNames: ''
	classVariableNames: ''
	package: 'MyProject'

Add a method named #bench<MyBenchmark>, like: #benchFactorial. This will be your real benchmark method. For example, a simple case repeating the benchmarking task would look like:

benchMyBenchmark

	self problemSize timesRepeat: [ … ]

and a benchmark which uses an index variable on each iteration, and the problem size:

benchMyBenchmark

	| i |
	i := self problemSize.
	[ i > 0 ] whileTrue: [
		" Some task using i "
		i := i - 1 ].

Hooks

There are multiple optional hooks to implement in your benchmark suite:

Define number of iterations in class side:

MyBenchmarkSuite class>>defaultNumberOfIterations
	^ 50

Define number of processes in class side:

MyBenchmarkSuite class>>defaultNumberOfProcesses
	^ 8

Define problem size in class side

MyBenchmarkSuite class>>defaultProblemSize
	^ 30

You can also override MyBenchmarkSuite>>setUp to set up the necessary environment for a benchmark and, if a specific benchmarks in your suite needs additional configuration, then create a method named MyBenchmarkSuite>>setUpBench<yourBenchmarkName> to set up the custom configuration (for example if you are benchmarking benchRegexDNA then you can create a setUpBenchRegexDNA if necessary).

Overriding MyBenchmarkSuite>>processResult:withTimer: will allow you to access the timer after a benchmark execution. And do not forget to override MyBenchmarkSuite>>tearDown to clean up the environment after a benchmark.

How to run benchmarks

To run a benchmark suite directly from Pharo, let’s say for example performing 100 Iterations:

MyBenchmarkSuite run: 100

However, that would only run the benchmark with the default settings in SMark. You can run more complex configurations of your suite using the Harness support, which is a conveninence executor around the runner strategies and the reporter, for example:

SMarkHarness run: { 
	'SMarkHarness'. 
	'SMarkLoops.benchIntLoop' . 
	1 . "The number of iterations"
	1 . "The number of processes"
	5   "The problem size"
	}.

Running built-in benchmarks

For instance to run the built-in benchmarks related to bioinformatics, often used to compare different programming languages, libraries, and algorithms for handling large-scale data processing tasks, these Harness can be evaluated:

SMarkHarness run: { 'SMarkHarness'. 'BenchmarkGameSuite.benchKNucleotide'. 20 . 1 . 2 }.
SMarkHarness run: { 'SMarkHarness'. 'BenchmarkGameSuite.benchFasta'. 25 . 1 . 10 }.
SMarkHarness run: { 'SMarkHarness'. 'BenchmarkGameSuite.benchRegexDNA'. 3 . 1 . 10 }.

The K-Nucleotide benchmark involves counting the occurrences of all k-length substrings (k-mers) in a given DNA sequence. The benchmark will be run 20 times, using 1 process

The FASTA benchmark is composed of 3 sub-benchmarks:

  • The first sub-benchmark writes DNA sequences to a special stream. The nucleotide sequences corresponds to a specific DNA repeated sequence called “ALU“. The benchmark also setup a problem size of 10 to instantiate a custom “repeat” stream, a stream for which its “end” is configured to a limit number and it automatically restarts its position when this end is reached as result of receiving #next. This stream is set to 20 repetitions as limit (2 * problemSize). Finally, the fasta is configured with a line length to write 60 nucleotide positions (columns) before a new line.
  • The second sub-benchmark is configured to 30 repetitions (3 * problemSize) as limit and performs additional calculations instead of just writing the sequence. It iterates another custom repeat stream (a random stream, which uses a naïve linear congruential generator to calculate a random number for each #next message it receives along with percentages – the cumulative probabilities to select each nucleotide) adding and storing the percentage of each ambiguity code (codes used in molecular biology to represent positions in a DNA or protein sequence where the exact nucleotide or amino acid is not known with certainty) in a sequence to provide information about the composition of the sequence in terms of ambiguous and non-ambiguous positions.
  • The third and last sub-benchmark is configured to 50 repetitions as limit but includes preconfigured frequencies instead of performing the ambiguity codes additions for each code.

Finally, the benchRegexDNA measures the performance of regular expression-based DNA sequence analysis on a given DNA sequence, including pattern matching and substitution operations for sequences with “degenerate” codes.

Running from command-line

To run the benchmark harness from CLI:

./pharo -headless Pharo.image --no-default-preferences eval "SMarkHarness run: { 'SMarkHarness'. 'BenchmarkGameSuite.benchFasta'. 1 . 1 . 25000000 }."

And the default output (the console) should look like:

Runner Configuration:
  iterations: 1
  processes: 1
  problem size: 25000000
Report for: BenchmarkGameSuite
Benchmark Fasta
Fasta total: iterations=1 runtime: 14508ms

Some benchmarks are already pre-configured with convenience accessors and default values for the benchmarks game, so it is easier to run them, however, by default, they will not use the SMark reporting support and thus runtime results are not written to the output. For example the FASTA benchmark which expects a 10 kb output file can be run as:

./pharo -headless Pharo.image --no-default-preferences eval "BGFasta fasta"

which output the FASTA format (truncated here):

'>ONE Homo sapiens alu
GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGA
TCACCTGAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACT
AAAAATACAAAAATTAGCCGGGCGTGGTGGCGCGCGCCTGTAATCCCAGCTACTCGGGAG
GCTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCGGAGGTTGCAGTGAGCCGAGATCGCG
CCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAGGCCGGGCGCGGT
... 

Parametrized benchmarks

If you have complex combinations of benchmarks, for example combining multiple data source with multiple readers and reading strategies, you can write your own benchmarkParameters method in your suite class, to access the built-in support for parametrized test matrix.

To see example code where this is used, check this implementation of JSON benchmarking using NeoJSON and SMark. You can evaluate the benchmark from Pharo with the following expression:

JSONSMarkSuite new benchReadJSON.

In this case, on a new execution, the benchmarks are executed by expanding the matrix of parameters. One important note is that results are collected after a warm up. This means that there will be some executions in what is called the “steady state”, a state where the JIT compiler already produced the code and represents a representative execution.

You can check a benchmark has started looking at the terminal, where output should be something reflecting your parameters:

Runner Configuration:
  iterations: 1
  processes: 1
  problem size: 1

Customize reporting

If you want to send the benchmark output to another destination instead of the console, you can subclass ReBenchReporter with your own reporter class:

ReBenchReporter subclass: #MyBenchmarkReporter
	instanceVariableNames: ''
	classVariableNames: ''
	package: 'MyProject'

and override the method #reportResult:for:of: where you have access to an output stream and an Array of results:

reportResult: aResultsArray for: aCriterion of: benchmark
	"Report duration"
	aResultsArray size < 2 ifTrue: [
		aResultsArray average printOn: stream.
		stream << 'ms'; << $, ].
	"Report iterations count"	
	aResultsArray size printOn: stream.
	stream << $,.
	"Report image name"
	Smalltalk image version printOn: stream.
	stream << $,.
	"Report VM name"
	Smalltalk vm interpreterClass substrings first printOn: stream.
	stream flush.

If this is your case, the harness also should be linked to your reporter:

ReBenchHarness subclass: #MyBenchmarkHarness
	instanceVariableNames: ''
	classVariableNames: ''
	package: 'MyProject'

and the defaultReporter method in class side should answer your reporter class. Also the method defaultOutputDestination should be overriden to configure the output:

defaultOutputDestination
	"Answer a <Stream> used to output receiver's contents"

	^ self defaultOutputFilename asFileReference writeStream

Conclusion

In conclusion, this articule has provided an overview of the SMark benchmark for Pharo. Continued exploration and utilization of SMark will lead to the creation of high-performance software in Pharo, ultimately benefiting the broader community. So, go forth, experiment, and strive for excellence in your Pharo projects with the insights gained from this tutorial.

Happy benchmarking!

Connecting the Dots: Class Names in Pharo Comments Revealed

In this article, I share the outcomes of our experiment aimed at identifying class names, that are references to classes that exist in Pharo environment, within comments in all Pharo classes and methods. The experiment was conducted on a Pharo 11 image, and the identification process relied on regular expressions to detect UpperCamelCases tokens. These tokens were expected to conform to Pharo’s naming conventions for class names. Once they were detected, the next step involved identifying the corresponding class definition for each one of them.

Analysis over class comments

I initiated the analysis on the comments within classes. The environment comprises a total of 9816 classes. Out of these, 8928 classes include comments that do not start with ‘Please comment me’, which is the beginning of default Pharo comments. This indicates that developers have modified 90 % of these comments to add their custom ones.

Using the following regular expression '\b[A-Z][a-zA-Z\d]+\b' asRegex in Pharo, I applied a search over all classes that have custom comments to detect all UpperCamelCase tokens like OrderedCollection. In total I found 3371 classes that have in their comments tokens matching this pattern, moreover, I found in total 7505 references to classes in comments, with duplicates removed in each comment. In other words, if same token is found multiple times in the same comment, it will be counted once to enhance the searching performance over classes in the system.

Having those results, I decided to choose 20 classes with the largest number of references in their comments, and I checked the recall and precision ratios. Below table summarizes 5 classes, to avoid overcrowding the blog post with all the 20 classes:

Class NameRecall ratioPrecision ratio
EncoderForSistaV110.72
IRBytecodeGenerator10.86
ClyNavigationEnvironment10.86
RSLocation10.25
SettingBrowser10.73
Class names in classes comments

On one hand, Recall ratio, also known as the true positive rate, measures the ability to accurately identify all true positive instances. It is high when the number of missing results (False Negatives) is high. On the other hand, Precision ratio focuses on the ability to make accurate positive predictions and avoid including irrelevant matches (false positives). It is high when the number of incorrect matches detected is high.

In the table provided above, the recall ratio for all classes is 1, indicating that the regular expression (regEx) successfully matched all instances of UpperCamelCases in the comments. The only potential scenario where results might be missed is if class names are misspelled or if they are mentioned without adhering to the specified naming convention. As for the precision ratio, results were not as good as the recall ratio because certain keywords, intended for other purposes, were incorrectly identified as classes, leading to more false positives in the classification. Some of them are used in comments to start a sentence (ex: Set the variable …) or a comment block (ex: “` Smalltalk “`).

Analysis over methods comments

Following the execution of the search algorithm within the environment 4,062 classes were found to contain 30,340 commented methods, with a total of 40,422 comments. Upon closer examination, it was determined that 7.5% of these commented methods belong to 368 classes and contain 3,270 references to existing classes in environment.

Having those results, I decided to choose 20 classes with the largest number of references in their methods comments, and I checked the recall and precision ratios. Below table summarizes 5 classes, to avoid overcrowding the blog post with all the 20 classes:

Class NameNumber of commented
methods with references
Recall ratioPrecision ratio
RSLineBuilder1210.94
RSNormalizer710.82
RSLabel910.86
Color2610.95
LargeInteger1711
Class names in methods comments
Boosting performance

Based on the previous research, I discovered that the accuracy of the results could be significantly improved by excluding certain tokens, such as Smalltalk, Set, True, False, Key, etc., which are known to produce false positive results. Additionally, enhancing the research algorithm by excluding code blocks from comments (i.e., everything between “` Smalltalk “` ), would not only optimize the search speed but also ensure that only the intended class names in comments are returned, leading to an overall improvement in precision ratio.

Our research has primarily concentrated on class names rather than method names. However, searching for method references in comments could prove to be more challenging due to two main reasons. Firstly, Pharo’s method naming convention follows camelCase, which increases the possibility of encountering false positive results when attempting to identify them within comments. This occurs because many tokens used in sentences to describe specific functionalities may coincide with method names, leading to potential confusion. Secondly, accurately linking tokens to their corresponding method definitions could also present difficulties, given that the same method might be defined in multiple classes. This situation adds complexity to the process of association and retrieval.

To address the problem, my initial idea involved utilizing AI to identify relevant tokens linked to classes and methods. I discussed this idea with AI experts, but they discouraged its usage for this particular project. They explained that a considerable amount of AI training data, tens of thousands of records, would be necessary to enable AI to accurately detect tokens in less than 10,000 methods and classes. Despite this challenge, the precision and recall ratios were already considered high, and implementing the previously suggested enhancements could further improve them.

Why we need this ?

Following this experiment, I believe that there are reasonable points of why we might need this:

  1. Enhancing Code Documentation: Programs are commented to help other developers understand the purpose and the usage of every code. When class names are clearly mentioned in comments, they provide additional context about the classes and their relationships, making it easier for developers to understand the codebase.
  2. Navigation: In Pharo, every class name that is enclosed between back-ticks like `OrderedCollection` within a comment, becomes clickable in view mode, thus helping the developer navigating to the class definition. However, during our search, we noticed that many classes are not marked in this manner. So why not benefiting from this search and enclosing every token we found in comments that has a valid link to an existing class ?
  3. Maintenance: Also during this search I found some tokens that conform to the naming convention of classes in Pharo, but actually the classes they refer to, do not exist anymore or are deprecated. I guess, this could be solved by enhancing the functionality of renaming a class in Pharo, to apply changes not only over references to these classes in source code, but also in comments, only for those enclosed between back-ticks. However, to be sure this could be done, a previous refactoring should be applied over comments to enclose class names correctly.

Takuzu, a new puzzle game using Bloc

Today I present to you Takuzu, a little puzzle game available for Pharo.

Takuzu is game I proposed to implement during my internship where I’m discovering Bloc, the low-level graphic framework for Pharo 11. The goal of this internship is to create little games with some basic UI using Bloc and after “finishing” the first MineSweeper project (you can find here: https://github.com/Enzo-Demeulenaere/MineSweeper), I started working on this new project.

The name Takuzu is the original Japanese name for this puzzle that is also commonly called ‘Binary Sudoku’ as you fill a grid with 0s and 1s following only 3 rules :
> You can’t have more than 2 cells of the same value aligned.
> There must be as many 0 as 1 on each row and column.
> Rows and columns must be all different one to another.

I first discovered this game around last year when searching for mobile puzzle games when I saw ‘0h h1’ on the Google Play Store (but you can also play it here: https://0hh1.com) and I immediately loved the logic behind it and wondered how could I implement it, what are the algorithms that seem simple but yet hide some complex thinking.

For the UI, I decided to entirely follow the look of 0hh1 with simple red and blue cells on a dark gray background, nothing too creative. Then I had to use Bloc to draw the UI while also using Toplo to have a menu with some widgets and this I what I’ve got so far:

You can discover this project by executing this snippet in Pharo 11:

Metacello new
baseline:'Takuzu';
repository: 'github://Enzo-Demeulenaere/Takuzu:main';
load: 'core'


and launch the application by executing : Takuzu openWithMenuBar

When in the main menu, you can try some pre-loaded levels from size 4×4 to 12×12 in the ‘Levels’ menu or choose the ‘Random’ menu to play with some randomly generated 4×4 and 6×6 levels (but the grids aren’t so good as my generation algorithm isn’t perfect at all)

You can also find all the informations about the project on its GitHub page: https://github.com/Enzo-Demeulenaere/Takuzu

Feel free to give me any feedback about this game and enjoy discovering Takuzu !

Enzo Demeulenaere

Undeclared Variable Reparation, An Epic Journey In a Compiler – Part III

Part III –
More fun with OCUndeclaredVariableWarning

Foo is a nice class, but could it become nicer if we
added an instance variable?

Let us click on the Foo tab of the Calypso window. We
get some code that declares the class.

Object subclass: #Foo
    instanceVariableNames: ''
    classVariableNames: ''
    package: ''

It is a legal Pharo expression that creates (or updates) a subclass
(named Foo) of the Object class (that is the
almost root of the class hierarchy). Creating or updating the
class (accept with Ctrl-S) simply evaluates the expression.
This is neat.

Unrelated note: there is a small check box Fluid in the
bottom right corner that switches to the modern fluid class
syntax.

Object << #Foo
    slots: {};
    package: ''

It is just a different syntax for almost the same stuff. It is still
neat, although a little less neat because while the syntax is better, it
is no more a sufficient expression to declare or update classes.

The thing is that to evaluate some random source code, we need to
compile it first. The evaluation of the class definition is done by
either
ClySystemEnvironment>>#defineNewClassFrom:notifying:startingFrom:
or
ClySystemEnvironment>>#defineNewFluidClassOrTraitFrom:notifying:startingFrom:
according to the fluid check box.

Both methods are really similar (and could be factorized), except
that one warned the developer in a comment about for now, a super
ugly patch
, so we shall look at the other one.

But let us play with the compiler a little without looking at the
code yet.

Object subclass: #Foo
    instanceVariableNames: +''
    classVariableNames: ''
    package: ''

I added a syntax error since + is a binary operator and
the left operand is missing. We evaluate (accept with Ctrl-S), and we
get the following.

class-err1

Object subclass: #Foo
    instanceVariableNames:  Variable or expression expected ->+''
    classVariableNames: ''
    package: ''

The -> way to present syntax error is very old school
(and I hate it) but that is not the point here. The point is that the
compilation process behaved in a sane and expected way:

  • OpalCompiler>>#compile is somewhat executed (cf
    the previous section for the content of the method);
  • it tries to parse, but a syntax error is detected, so an exception
    SyntaxErrorNotification is signaled;
  • OpalCompiler>>#compile catch the exception;
  • notify:at:in: is sent to the requestor (that injects
    the error message in the source code);
  • the failBlock is executed that terminates the method
    ClySystemEnvironment>>#defineNewClassFrom:notifying:startingFrom:.

Let us try with something else:

Object subclass: #Foo
    instanceVariableNames: baz
    classVariableNames: ''
    package: ''

The source code is an expression, and in expressions we can use
variables. Except that here baz is an undeclared variable,
so what happens when we evaluate?

The code is updated to:

class-err2

Object  Variable or expression expected ->subclass: #Foo
    instanceVariableNames: baz
    classVariableNames: ''
    package: ''

Wow. This is bad, and wrong, and bad again.

  • An error message is baldy placed and wrong and unrelated to
    baz.
  • There is no menu asking us what to do with the undeclared variable
    baz like we saw inside a method.
  • Since I work with a small screen, I also noticed an ephemeral
    notification popup box in the bottom left corner of the screen
    (poetically called a growl in Morphic) that stated the truthful
    information “Undeclared Variable in Class Definition”. Such
    growls are usually displayed by invoking the method
    Object>>#inform:.

ClySystemEnvironment>>#defineNewClassFrom:notifying:startingFrom:

Here is the code source of the method:

defineNewClassFrom: newClassDefinitionString notifying: aController startingFrom: oldClass

    "Precondition: newClassDefinitionString is not a fluid class"

    | newClass |
    [
    newClass := (self classCompilerFor: oldClass)
                    source: newClassDefinitionString;
                    requestor: aController;
                    failBlock: [ ^ nil ];
                    logged: true;
                    evaluate ]
        on: OCUndeclaredVariableWarning
        do: [ :ex | "we are only interested in class definitions"
            ex compilationContext noPattern ifFalse: [ ex pass ].
            "Undeclared Vars should not lead to the standard dialog to define them but instead should not accept"
            self inform: 'Undeclared Variable in Class Definition'.
            ^ nil ].

    ^ newClass isBehavior
          ifTrue: [ newClass ]
          ifFalse: [ nil ]

I won’t dive into all the details of this one. What is interesting is
the on: OCUndeclaredVariableWarning do: part that
intercepts the notification, thus preventing it from reaching the end of
the call stack, thus preventing it from executing its default action,
thus preventing it from displaying a menu, thus preventing the user to
repair the code having a semantic error.

Here we can see how it is possible to intercept the default error
reparation mechanism in case of undeclared variables in a specific
context where a temporary or an instance variable does not make much
sense.

What behavior do we get instead?

  • ex compilationContext noPattern ifFalse: [ ex pass ].
    if noPattern is false (double negation isn’t not bad) then
    process the notification anyway. Except that, here,
    noPattern isn’t unlikely to not be not true (nested
    negations are annoying, aren’t they?) because it is what distinguishes
    the compilation of an expression from the compilation of a method
    definition: a method definition starts with a name (and potential
    arguments) that is called the method pattern. But the first
    statement of the OpalCompiler>>#evaluate method that
    is called it to override noPattern with true, because one
    can only evaluate expressions, not method definitions.
  • self inform: 'Undeclared Variable in Class Definition'
    is responsible for the growl we get.

    You know what I hate? The -> error message
    insertions. You know what I hate more? Inconsistencies. Here, the source
    code error is reported as a (missable) popup with poor information,
    whereas all other errors are reported with ->.

Nevertheless, is the design legitimate? I’m not a fan of exceptions,
they make code comprehension harder and, in my humble opinion, should be
used with great reserve. Here there is also some breach of
encapsulation. But this is debatable. What is less debatable is that the
whole reparation of undeclared variables is bypassed completely,
including some legitimate needs.

For instance, we get no help if we try to evaluate
(Ctrl-S accept) Objectt subclass: #Bar, which
contains too much t in the identifier of the superclass
Object. All we get is an unhelpful growl and the same
wrongful -> error message insertion. As a matter of
fact, as Pharo users, we can bypass the accept (Ctrl-S) behavior and
just evaluate the code in place by selecting all the text (Ctrl-A) then
Doing It (Ctrl-D). The DoIt simply evaluates the
selected source code without (too much) hacking. So there is no
intercepting OCUndeclaredVariableWarning for instance, and
we get the menu “Unknown variable: Objectt please correct, or
cancel
” that presents Object (with a correct amount of
t) in the list of choice. We can select it and we get a
successful class definition and a new available class
Bar.

But why does the code signal a missable popup instead of a classic
text error insertion? Maybe because the
OCUndeclaredVariableWarning that is caught might not come
from the class definition syntax. When a random string is evaluated, it
can do a lot of things, like signaling exceptions. Unfortunately, a
broad error handling mechanism like exceptions has no way to distinguish
exceptions that come from the analysis of the code (syntactic and
semantic error) from the ones that come from the proper evaluation.

And that happens frequently. If you remove an instance variable
(attribute) from a class definition, then accept the new definition, the
system will recompile all the methods of this class. Methods that use
the removed instance variable will also be compiled and signal
OCUndeclaredVariableWarning. That exception will be caught
and growl “Undeclared Variable in Class Definition”. Note that
the message is misleading since the undeclared variable is not in the
class definition. So maybe it was not the original intention of the
growl and was just a random inconsistency.

Let us discuss the last statement of the method. It is a sanity
check. Because the initial class definition expression could have been
heavily edited by the programmer and replaced by anything else, the
method checks that the final result of the evaluation is a class-like
object. Otherwise, nil is returned.

ClyClassDefinitionEditorToolMorph>>#applyChangesAsClassDefinition

OK, we have an explanation for the absence of the menu, and an
explanation for the presence of the growl, but noting here is related to
the wrong -> syntax error insertion. Where does this one
come from?

First, there is no syntax error in the expression (it is a lie!). The
compiler manages to signal an OCUndeclaredVariableWarning
notification (the growl is the proof!) launched by the
OCASTSemanticAnalyzer, meaning that the parser can produce
an AST and not find any syntax error.

So, what is the deal?

  • We are at
    ClySystemEnvironment>>#defineNewClassFrom:notifying:startingFrom:,
  • that is called by
    ClySystemEnvironment>>#compileANewClassFrom:notifying:startingFrom:,
  • that is called by
    ClyFullBrowserMorph>>#compileANewClassFrom:notifying:startingFrom:,
  • that is called by
    ClyClassDefinitionEditorToolMorph>>#applyChangesAsClassDefinition,
  • that goes like this:
applyChangesAsClassDefinition

    | newClass oldText |
    oldText := self pendingText copy.
    newClass := browser
                    compileANewClassFrom: self pendingText asString
                    notifying: textMorph
                    startingFrom: editingClass.

    "This was indeed a class, however, there is a syntax error somewhere"
    textMorph text = oldText ifFalse: [ ^ true ].

    newClass ifNil: [ ^ false ].

    editingClass == newClass ifFalse: [ self removeFromBrowser ].
    browser selectClass: newClass.
    ^ true

We see the invocation of the compilation (the
newClass := browser compileANewClassFrom: thing). Because
it failed, newClass is nil (instead of a class).

What follows is interesting:
textMorph text = oldText ifFalse: [ ^ true ]. This states
that if the text in the code editor was changed during the compilation,
then there was an error. Interesting and sooo wrooong on sooo many
leveeels:

  • Detecting syntax error should not be done thanks to string
    comparison.
  • The method should not assume that error reporting to the user
    changed the source code (even if it is the old school way and is the
    active tradition of the present and previous millennium). For instance,
    instead of an ugly -> something might have preferred to
    display a growl (even if inconsistencies are bad, and I hate them, here
    the culprit is not the inconsistency).
  • Code change might be related to some code reparation that fixed an
    error, so exactly the opposite of an error.

ClyClassDefinitionEditorToolMorph>>#applyChanges

But wait, there is more, because newClass is nil, the
method returns false to its caller, that is
ClyClassDefinitionEditorToolMorph>>#applyChanges and
is defined by:

applyChanges

    | text |
    text := self pendingText copy.
    ^ self applyChangesAsClassDefinition or: [
          self pendingText: text.
          self applyChangesAsMethodDefinition ]

What the heck is that? I do not even understand what is the objective
of this thing! The defining class is
ClyClassDefinitionEditorToolMorph that is the widget whose
sole job as a text editor is to define new classes and to update
existing classes. And the Pharo way to do that is by evaluating
expressions that define or update classes. It seems to be an easy job
that even an unaware DoIt action can manage
successfully.

So what is this insane method doing:

  • saves the source code’s content (to avoid code change in the editor
    due to syntax error or code reparation);
  • tries to evaluate as a class definition (an expression);
  • if the result is false (and it is, in our case), then
    tries to compile the original pristine source code as a method
    definition.

OK.

Let’s just do that. Remember that the class definition we try to
process is:

Object subclass: #Foo
    instanceVariableNames: baz
    classVariableNames: ''
    package: ''

Let us try to parse this source code as a method definition instead
of as an expression.

  • A method starts with a method pattern that can be many
    things, but for simple unary methods, they are simple and plain
    identifiers. Do we have a simple and plain identifier? Yes, the token
    Object (RBIdentifierToken).
  • A method then have a body, with statements, that usually start with
    an expression. What follows is the token subclass:
    (RBKeywordToken) which is not the beginning or any correct
    expression. But the parser wants an expression right now! So it
    reports the error
    Variable or expression expected ->subclass:.

It’s the beauty of computer science. Whatever insane behavior we
might witness, there is always a rational explanation.

A Final Experiment

Can we bypass the bypass of OCUndeclaredVariableWarning
with the undefined variable baz? Let’s try the
DoIt way, it made wonder with the superfluous t of
Objectt some sections ago.

  • Select the full text (Ctrl-A);
  • Do It (Ctrl-D);
  • The menu “Unknown variable: baz please correct, or cancel
    appears and proposes: a new temporary variable, a new instance variable
    or to cancel;
  • Chose the “temporary variable”;
  • A debugger window appears: “Instance of ClyTextEditingMode did
    not understand #textMorph
    ”. What?
  • The error is caused by the line
    theTextString := self requestor textMorph editor paragraph text.
    of
    OCUndeclaredVariableWarning>>#declareTempAndPaste:.
    We discussed this (rather ugly) line in a previous section, stating that
    it is fragile. What a coincidence

Conclusion and Perspective

During this exploration of the
OCUndeclaredVariableWarning we discovered a lot of classes
and methods with a very variable quality of code and design. Obviously,
the present article focuses on the discussable parts that could be
improved, because it forces us to understand why things are bad, and how
they could be improved. It is also fun to see concrete effects of how
things can go bad when software design is not as tidy as it should
be.

Pharo is a wonderful dynamically typed programming language with
great features, abstractions and powerful semantics. And with great
power comes great responsibility.

An example could be the requestor thing. Adding a dependency between
UI and the compiler work is enough to raise some eyebrows (independently
of the programming language or its paradigm). But in Pharo this
dependency appears as an unwritten API (orality-based API?) with some
inconsistent or fragile hacks: notify:at:in:,
Object>>#bindingOf:,
requestor respondsTo: #interactive,
requestor textMorph,
requestor class name = #RubEditingArea, etc. This also
causes subtitle breakages when someone tries to fix things, breakages
that are often hard to catch because, for instance, they could be only
related to untested or rare UI interactions.

A lot of change is currently ongoing for Pharo 12 on the compiler. We
are in the early part of its development cycle, so it’s the best moment
to try large and disruptive hacks. At the time of publishing, most of
the design issues discussed here are already fixed. But they represent a
specific use case, and a lot of work is still needed. The full
meta-issue is available at
https://github.com/pharo-project/pharo/issues/12883.

Undeclared Variable Reparation, An Epic Journey In a Compiler – Part II

Part II – The Return Journey

undeclvar

Welcome to the next step in the compiler journey.

As a simple recap, we were compiling baz := 42 in a
method bar, except that baz is not declared.
We are currently in
OCSemanticWarning>>#defaultAction, the default action
of an uncaught Notification, that is ready to open a graphical menu by
calling openMenuIn:.

OCUndeclaredVariableWarning>>#openMenuIn:

The method is long; let us review it in small pieces.

openMenuIn: aBlock
    | alternatives labels actions lines caption choice name interval requestor |

A bunch of temporary variables.

    "Turn off suggestions when in RubSmalltalkCommentMode
    This is a workaround, the plan is to not do this as part of the exception"
    requestor := compilationContext requestor.
    ((requestor class name = #RubEditingArea) and: [
        requestor editingMode class name = #RubSmalltalkCommentMode])
                    ifTrue: [ ^UndeclaredVariable named: node name ].

These are some type checks. Type checks are usually bad. Those are
bad.

They prevent the menu thing if the requestor is a
RubEditingArea in a RubSmalltalkCommentMode
“mode”. Where RubSmalltalkCommentMode is used?

  • By ClyRichTextClassCommentEditorToolMorph, which seems
    never used in the system (dead class?)
  • By RubEditingArea>>#beForSmalltalkComment, that
    is called only by FileList, a basic file explorer, but it’s
    not clear when or why the compiler is called by the file explorer, nor
    why a “workaround” is needed here (quite deep) in the compiler
    especially since it’s the only workaround of this type in the whole
    source code for RubEditingArea or
    RubSmalltalkCommentMode.

It could be just dead code, so less problematic: a dead workaround is
a less technical dept than a live one. It also illustrates why type
checks can be bad; it reverses the responsibility:
RubEditingArea and RubSmalltalkCommentMode
here are not involved at all in the workaround, so code evolution
related to one of these two classes might likely miss the present
hack.

Moreover, such a workaround is fragile. The compiler should not care
about specific clients, and especially not care about their names, and
should behave equitably. E.g. imagine renaming classes or using
subclasses of the blacklisted ones, they could likely pass the check and
cause really subtle bugs.

Anyway, let us continue with the method:

    interval := node sourceInterval.
    name := node name.
    alternatives := self possibleVariablesFor: name.
    labels := OrderedCollection new.
    actions := OrderedCollection new.
    lines := OrderedCollection new.

All those are the initialization of the temporary variables.

OCUndeclaredVariableWarning>>#possibleVariablesFor:
provides a list of existing names usable as a replacement (sorted from
the best match to the worst match). See
String>>#correctAgainst:continuedFrom: for the
details and the scoring system.

    name first isLowercase
        ifTrue: [
            labels add: 'Declare new temporary variable'.
            actions add: [ self declareTempAndPaste: name ].
            labels add: 'Declare new instance variable'.
            actions add: [ self declareInstVar: name ] ]

The two first items of the menu. If the name looks like a temporary
or an instance variable, because it starts with a lowercase letter, then
maybe the programmer wants a new temporary or instance variable?

Note that there are two parallel lists, one of the labels (shown in
the menu) and the other of actions (here blocks), that is evaluated if
the user chose the corresponding label.

In the scenario, the option
Declare new temporary variable is selected, so
self declareTempAndPaste: name is eventually called. We
detail it in the next section. For now, we continue to read the
method.

        ifFalse: [
            labels add: 'Leave variable undeclared'.
            actions add: [ self declareUndefined ].
            lines add: labels size.
            labels add: 'Define new class'.
            actions
                add: [
                    [ self defineClass: name ]
                        on: Abort
                        do: [ self openMenuIn: aBlock ] ].
            labels add: 'Declare new global'.
            actions add: [ self declareGlobal ].
            compilationContext requestor isScripting ifFalse:
                [labels add: 'Declare new class variable'.
                actions add: [ self declareClassVar ]].
            labels add: 'Define new trait'.
            actions
                add: [
                    [ self defineTrait: name ]
                        on: Abort
                        do: [ self openMenuIn: aBlock ] ] ].

For names that start with an uppercase, they look like global
variables, and that includes all the named classes, so the proposed
items in the menu are different. A first curiosity, there is the choice
to “leave variable undeclared” that is absent in the previous code
snippet. Another curiosity, defining a new class (or a new trait) opens
a new window, but if, for some reason, the entity creation fails or is
canceled, then a recursive call is used to open the same menu again.

    lines add: labels size.
    alternatives
        do: [ :each |
            labels add: each.
            actions
                add: [
                    ^self substituteVariable: each atInterval: interval ] ].
    lines add: labels size.
    labels add: 'Cancel'.
    caption := 'Unknown variable: ' , name , ' please correct, or cancel:'.

We have the addition of the possible variables (computed at the
beginning of the method), a cancel item, and the window title. The next
last two lines are the fun ones.

    choice := aBlock value: labels value: lines value: caption.
    ^choice ifNotNil: [ self resume: (actions at: choice ifAbsent: [ compilationContext failBlock value ]) value ]
  • aBlock is the parameter of the method, it was more
    than 50 lines ago, so we almost forgot about it. It is always
    [:labels :lines :caption | UIManager default chooseFrom: labels lines: lines title: caption]
    that just calls the UI and returns the index number of the selected item
    starting at 1 (or 0 if the cancel button is used).
  • The selector value:value:value: is used to evaluate
    the block with 3 supplied arguments (it is a Pharo thing, do not
    judge).
  • ^choice ifNotNil: ... returns nil if the choice is
    nil (unlikely according to the API of chooseFrom, but
    better safe than sorry). In the scenario, the first choice is selected
    (declare new temporary variable). Therefore choice is 1, which is not
    nil, so we look at the ifNotNil: part.
  • self resume: cause the signal to finish
    is execution with the given value. Hopefully, a Variable
    object to bind to baz (look back at the section
    OCASTSemanticAnalyzer>>#undeclaredVariable: if you
    need to see the original signal method invocation). Here,
    the call to resume feels superfluous as the result of the
    current method is used as the result of defaultAction that
    is used as the value of the automatic resume call performed
    on Notification objects (see
    UndefinedObject>>#handleSignal:).
  • actions at: choice return the action (the block)
    associated to the corresponding choice number. Ordered collections in
    Pharo are 1-based; therefore 1 is the first block action. Here, the
    block [ self declareTempAndPaste: name ].
  • ifAbsent: is for what to do when there is no
    corresponding action for the given choice. This happens when the user
    chooses the cancel button (no action for 0) or chooses the cancel item
    (no action for 3 in our scenario).
  • compilationContext failBlock value is therefore
    executed on a “cancel”. It evaluates the failBlock that, in the
    scenario, comes from the ClassDescription>>#compile
    method and contains [ ^ nil ] (a non-local return).
    Evaluating this failBlock cause the unwinding of many methods in the
    call stack (something around 30 or 40 frames) and the return of the
    ClassDescription>>#compile method with nil.

    Note that there is a potential weakness here if the failBlock does
    not perform a non-local return, then the result of the block evaluation
    is used as the return of openMenuIn: and eventually used as
    a Variable object to bind baz to. Callers of
    the compiler might forget to do that and just provide
    [nil], for instance (without a ^).

  • value evaluates the action block (since it exists in
    the list), that has the responsibility to provide a
    Variable instance.

OCUndeclaredVariableWarning>>#declareTempAndPaste:

We selected “declare new temporary variable” in the menu, thus
executing this method. We’ll cover this large method (35 lines) piece by
piece.

declareTempAndPaste: name
    | insertion delta theTextString characterBeforeMark tempsMark newMethodNode |

Some temporary variables.

    "Below we are getting the text that is actually seen in the morph.
     This is rather ugly. Maybe there is a better way to do this."
    theTextString := self requestor textMorph editor paragraph text.

Indeed, this is rather ugly. This leads to many
questions:

  • Why is the text (source code) of the method bar
    needed?
  • Why does it assume that the requestor has a textMorph
    method?
  • Why ask for something so deep? Demeter is likely rolling over in its
    grave (it’s a joke on the Law of Demeter. Demeter is not dead
    and is not even a person, it was a project named after the Greek goddess
    of Agriculture).
  • Why? We are still in a (deep) part of the compiler,
    self should have a better way to get the source code
    currently compiled.
    "We parse again the method displayed in the morph.
     The variable methodNode has the first version of the method,
     without temporary declarations. "
    newMethodNode := RBParser parseMethod: theTextString.

Let us take a breath.

We are doing a semantic analysis on an already parsed source code of
a method bar trying to get a variable to bind to
baz. And we parse the full source code again? Don’t we have
it? Just call self node methodNode or something?

The hint might be “without temporary declarations” from the
comment. Does that mean we do not trust the current AST to be genuine?
Why? Maybe the previous interactive code error reparation changed the
current AST? Is this actually true in some possible scenarios? Is this
just leftover code?

Let us just continue… we must continue…

    "We check if there is a declaration of temporary variables"
    tempsMark :=  newMethodNode body  rightBar ifNil: [ self methodNode body start ].

It’s getting warm here, isn’t it?

  • newMethodNode body rightBar gets the position (an
    integer) of the closing | character of the temporary
    variable declaration syntax, or nil if there is no temporary variable
    declared (like in the current scenario) an AST is useful for this task
    since it knows which part of the source code is really a block of
    temporary variable declarations.
  • self methodNode body start is the position (an integer)
    of the beginning of the main body of the method, that position is
    therefore used when there are no declarations of temporary
    variables.
    characterBeforeMark := theTextString at: tempsMark-1 ifAbsent: [$ ].

gets the character before the closing | or before the
main body. The ifAbsent might only occur if the source code
is empty, and the compiler let us progress until here because an empty
method is a syntax error, a method name is minimally needed (the name
(selector) and parameters are called the “method pattern” in Pharo
parlance). But better safe than sorry.

    (theTextString at: tempsMark) = $| ifTrue:  [
        "Paste it before the second vertical bar"
        insertion := name, ' '.

        characterBeforeMark isSeparator ifFalse: [ insertion := ' ', insertion].
        delta := 0.
    ] ifFalse: [

Some temporary variables are declared, and we want to add the new
variable after the last one. The code mainly manages spacing to avoid
concatenating the new variable and a previous one, or injecting
superfluous spaces.

In our scenario, there is no temporary variable (yet), so the
ifFalse: part interests us more.

        "No bars - insert some with CR, tab"
        insertion := '| ' , name , ' |',String cr.
        delta := 2. "the bar and CR"
        characterBeforeMark = Character tab ifTrue: [
            insertion := insertion , String tab.
            delta := delta + 1. "the tab" ]
        ].

Here we prepare the text to insert in the source code and compute a
delta thing, we’ll discuss that later. The code tries to
care about preserving the indentation, if any.

    tempsMark := tempsMark +
        (self substituteWord: insertion
            wordInterval: (tempsMark to: tempsMark-1)
            offset: 0) - delta.

Err… it’s getting cold here, isn’t it?

self substituteWord: insertion wordInterval: (tempsMark to: tempsMark-1) offset: 0
asks to insert the new string in the source code (because the interval
is empty, it is an insertion and not a replacement).

How does
OCUndeclaredVariableWarning>>#substituteWord:wordInterval:offset:
do that? By simply calling #correctFrom:to:with: on the
requestor and doing some math, then doing more math to update
tempsMark.

    " we can not guess at this point where the tempvar should be stored,
    tempvars vs. tempvector therefore -> reparse"
    (ReparseAfterSourceEditing new newSource: self requestor text) signal

And it is the end of the method. I think it’s getting humid here,
isn’t it?

The code is altered. We have no idea what really happened, there is a
new source code in town. The full AST might need to be rebuilt as there
are new potential AST nodes. The semantic analysis might need to be
redone, as the new temporary variable might conflict with other
variables declared further in the code. So at this point, it seems
better to just call it a day and run the compilation again.

It’s the point of the ReparseAfterSourceEditing class
that is a subclass of Notification (we are now experts in
notifications and no afraid no more of them!).

There are still some questions about the behavior of the program and
some of its design decisions:

  • The math thing about offset, delta, and
    tempsMark update is completely unused. Possible leftover of
    previously removed code.
  • Where does the signal on ReparseAfterSourceEditing
    go?
  • Why the new source code should be passed around in the notification?
    We did already update it in the requester.
  • Why does the ugly (indeed)
    self requestor textMorph editor paragraph text at the
    beginning of the method exists, since apparently
    self requestor text give the same damn source code (while
    not ideal, it is still better).
  • What happens when the notification is resumed? The point of a
    notification is to be resumable. Here it clearly appears that such an
    endeavor is not supposed to happen.
  • Why so much coupling?
  • Why so little cohesion?
  • And more specifically, why it is the job of
    OCUndeclaredVariableWarning to perform this menu and string
    based code reparation and hijack the requestor as if there were friends
    in some abusive relationship? Shouldn’t a notification be just a means
    of sending some kind of signal to the previous method in the call
    stack?

There are a lot of symptoms of schizophrenia in the responsibilities
here.

OpalCompiler>>#parse

Where does ReparseAfterSourceEditing go? I
(intentionally) skipped some steps between
OpalCompiler>>#compile and
OCASTSemanticAnalyzer>>#undeclaredVariable:.

Here is the source code of
OpalCompiler>>#parse:

parse
    | parser |
    [
        parser := self createParser.
        ast := self semanticScope parseASTBy: parser.

        ast methodNode compilationContext: self compilationContext.
        self callParsePlugins.
        self doSemanticAnalysis ]
    on: ReparseAfterSourceEditing do: [:notification |
            self source: notification newSource.
            notification retry ].

    ^ ast

The job of this method (as explained in a previous section) is to do
the frontend part of the compilation and produce a fully annotated and
analyzed AST of the method so that (virtual) machine code can be
generated. The content of the method is mostly straightforward.

What is interesting is the on:do: method call used for
exception (and thus, notification) handling. When a
ReparseAfterSourceEditing is intercepted, we update the
source code and run the protected block from the beginning again (see
Exception>>#retry).

Some potential exits of this loop are:

  • The source code is good enough and no more
    ReparseAfterSourceEditing are signaled.
  • The source code is bad enough that either the failBlock
    is invoked, or another unrecoverable exception occurs. Remember
    SyntaxErrorNotification in
    OpalCompiler>>#compile, for instance.
  • Someone is tired enough and terminates the process.

The first alternative is what happens in our scenario:

  • A declaration of the temporary variable baz is added to
    the source code.
  • The content editor window is updated to reflect that.
  • The new code source is fully parsed and correctly analyzed and a
    legitimate CompiledMethod is produced.

What a happy ending!

Undeclared Variable Reparation, An Epic Journey In a Compiler – Part I

In this series of posts, I present how the current implementation of
Pharo handles compilation errors on undeclared variables and the
interactive reparation to fix them. Targeted readers are people
interested in compilers or object-oriented programming. Non-Pharo
developers are welcome since knowledge of the language or the developing
environment is not required. Some parts of Pharo are explained when
needed in the article.

We illustrate with a small and specific corner case of the code
edition and compilation subsystems of Pharo. It shows how complex
software has to deal with complex situations, requirements, usage and
history. And why design choices matter.

Disclaimer, some parts of the presented code can be qualified as
“awesome”, where “awe” still means “terror”. Maybe I should rename the
article “The Code of Cthulhu” or something, but I’m often bad at
names.

The first and the second parts are a deep-down journey. We start from the GUI and do down (go up?) in the call stack, with very few shortcuts or branching. Explanation, comments, and discussion are done during the visit.

Note also that the presented code is the one of Pharo11 and that most issues should be solved (or working on) for Pharo12. The meta-issue that tracks my work in progress is available at https://github.com/pharo-project/pharo/issues/12883 — warning, it contains spoilers.

Special thanks go to Hugo Leblanc for his thorough review.

Undeclared Variables

Compiling a method in Calypso (the current class browser), in
StDebugger (the current debugger) or in any place that accepts the
edition and installation of methods is an everyday task of Pharo
developers, and most of the time an everyminute task. It’s something
Pharoers do naturally without thinking much about it(possibly to
preserve their own sanity).

One specific picturesque experience is having a menu window pop up
when trying to compile code that contains an undefined variable. The
presented menu contains various options depending on the variable name
and the context: new temporary variable (Pharo name for “local
variable”), new instance variable (Pharo name for “attribute” or
“field”), new class if the name starts with an uppercase letter and some
proposal of existing variable (local, global or other) with a similar
name in case of an obvious typing error. Selecting one or the other of
these options updates the code in the editor and resumes the compilation
(or pops up a similar menu if some other undefined variable
remains).

Note that in Pharo, variables can also remain undeclared, for a lot
of good reasons, but it is a story for another day.

Let us illustrate with a single concrete scenario used in this
article’s first parts. You are in a Calypso editor, on the instance
side, on a class Foo trying to implement a new method
bar.

bar
    baz := 42

The method might not be finished yet and baz is not even
declared, but let’s install it with a classic Ctrl-S
(accept). We get the menu window “Unknown variable: baz please
correct, or cancel:
” with some choices:

  • “Declare new temporary variable”;
  • “Declare new instance variable”;
  • “Cancel”;
  • and also an additional “Cancel” button.
undeclvar

We select the first option (temporary variable) and the code is
automatically repaired as

bar
    | baz |
    baz := 42

the method is also compiled, installed in the class Foo
and fully usable.

Note: the | thing is the Pharo syntax to declare
temporary variables (i.e. local variables).

Part I – Falling Down the
Rabbit Hole

Let’s try to understand what just happened. Is the whole thing
(black) magic or simple object-oriented (black) design?

This first post is down from the compiation request to the menu. The
next post will be about code repair.

We have the Calypso window and its nested text editor component. I
skip the complex graphical UI sequence of calls — there are some
observer design patterns and even a sub-process forked (Pharo processes
are, in fact, green threads) — and for the sake of simplicity and
without loss of generality, I start the story at
ClyMethodCodeEditorToolMorph>>#applyChanges.

ClyMethodCodeEditorToolMorph>>#applyChanges

Note: ClyMethodCodeEditorToolMorph>>#applyChanges
means the method applyChanges of the class
ClyMethodCodeEditorToolMorph. Where Cly stands
for Calypso, the name of the tool. And Morph
is the name of the low-level graphical toolkit currently used by Pharo.
So, basically, the current receiver of the method (self) is
a graphical window.

I do not show the full code of the method. The interesting statement
is:

selector := methodClass
    compile: self pendingText
    classified: editingMethod protocol
    notifying: textMorph.

that is a message send (method invocation) of the selector (method
name) compile:classified:notifying: because, in Pharo, and
in most other Smalltalk dialects, arguments can be syntactically placed
inside the name of the method to invoke.

The method asks the class to compile and install a new method.
Receiver and arguments are:

  • methodClass here the class Foo (instance
    of Foo class subclass of ClassDescription that
    implements the called method
    compile:classified:notifying:)
  • self pendingText is the full source code (an instance
    of the Text class).
  • editingMethod protocol is the selected protocol (group
    of methods) to put the new method. It is nil here, so the
    method might remain unclassified, not a big deal.
  • textMorph is the graphical component (widget) that
    corresponds to the part of the tool that contains the source code
    editor. Here, we have an instance of RubScrolledTextMorph
    that is the common morph widget to represent an editable text area.

Now, why would the compiler need to know about some internal UI
component? Well, we shall see.

ClassDescription>>#compile

ClassDescription>>#compile:classified:notifying:
eventually calls
ClassDescription>>#compile:classified:withStamp:notifying:logSource:
that adds two new parameters:

  • changeStamp that is the current time and date (as a
    String, not a DateAndTime)
  • logSource a Boolean flag set to true.

The important statement of this method is:

    method := self compiler
        source: text;
        requestor: requestor;
        failBlock:  [ ^nil ];
        compile.

Where

  • self compiler return a new compiler instance, already
    configured to compile a method of the class Foo and with
    the default environment (Smalltalk globals, the big
    dictionary of global variables and constants of the system that,
    especially, contains all the class names and their associated class
    objects).
  • text the source code of the method to compile.
  • requestor the RubScrolledTextMorph
    instance (the UI component).
  • [ ^nil ] the on error block, which the
    compiler (or one of its minions) might use in case of a fatal error.
    Note: passing blocks (somewhat equivalent to lambdas in other languages)
    is a popular Pharo way to deal with error management. Here, evaluating
    the block might unwind many methods in the call stack and forces the
    method
    ClassDescription>>#compile:classified:withStamp:notifying:logSource:
    to return nil because ^ means
    “return” (this one is called a “non-local return” in Pharo
    parlance).
  • finally, compile that starts the real compilation
    work.

OpalCompiler>>#compile

The Pharo compiler class is named OpalCompiler and the
invoked method is simply OpalCompiler>>#compile. Here
is the full body of the method:

compile
    ^[
        self parse.
        self semanticScope compileMethodFromASTBy: self
    ] on: SyntaxErrorNotification do: [ :exception |
            self compilationContext requestor
                ifNotNil: [
                        self compilationContext requestor
                            notify: exception errorMessage , ' ->'
                            at: exception location
                            in: exception errorCode.
                    ^ self compilationContext failBlock value ]
                ifNil: [ exception pass ]]

Wow. It’s scarier than it is.

  • ^[ aaaa ] on: SyntaxErrorNotification do: [ :exception | bbbb ]
    means return (^) the result of aaaa but if an
    exception SyntaxErrorNotification occurs, return the result
    of bbbb (where exception is the exception
    object, : and | are simply the Pharo syntax
    for block parameters. Exceptions are another popular Pharo way to deal
    with error management.

    Note: the name SyntaxErrorNotification hints that this
    exception is special; it is a Notification. We discuss them
    in a few sections. The management of syntax errors in Pharo also
    deserves its own story (involving adventures, characters and plot
    development).

  • The job of self parse is simple; it calls the
    parser, does the semantic analysis and tries to produce a valid
    annotated AST of the given source code, or might fail trying if there is
    a syntax or a semantic error in the provided code.
  • self semanticScope compileMethodFromASTBy: self is
    more straightforward than the statement suggests. It transforms the AST
    into Pharo bytecode (maybe a story for another day) and produces the
    result of the compilation as an instance of CompiledMethod.
    CompiledMethod is a very important class, as its instances
    are natively executable by the Pharo Virtual Machine.
  • self compilationContext requestor ifNotNil: is a
    simple if that checks (when a
    SyntaxErrorNotification occurs, since we are in the
    do: block of the exception syntax) if the requestor is not
    nil. Here the requestor is the
    RubScrolledTextMorph object, so not nil. The method
    RubScrolledTextMorph>>#notify:at:in: is called and is
    used to present the error to the user.
  • Then self compilationContext failBlock value invokes
    the failBlock (it is [ ^nil ] from the
    previous section) that terminates the method invocation.

Here, we get part of the answer to our design question: The compiler
has the responsibility to explicitly call the text editor (if any) to
present an error message. It might not be the best design decision,
since it is difficult to argue that the compiler’s responsibility is to
notify UI components in case of errors. Especially here since there are
two levels of error management: an exception and a fail block that could
have been used by Calypso to manage errors and decide by
itself of its specific ways to report errors to the user.

We can also notice the string '->' that is
systematically concatenated at the end of the error message associated
with the caught exception. Why? Because Calypso, for historical reasons,
presents the error message as an insertion directly in the text area in
the editor, in front of the location of the error. For instance, the
syntax error in the code 1 + + 3 (we assume the 2 was
fumbled) appears as
1 + Variable or expression expected ->+ 3 in the
editor.

It’s a second bad design decision, as not only was the compiler
responsible for calling the editor, but it also made some presentation
decisions. In fact, the alternative code editor component, provided in
the Spec2-Code package, strips the ->
string before presenting the error in its own and less intrusive way.
See SpCodeInteractionModel>>#notify:at:in:.

OCASTSemanticAnalyzer>>#undeclaredVariable:

Now we enter the classical compilation frontend work: scanning
(lexical analysis, done by RBScanner), parsing (syntactic
analysis, done by RBParser) and finally the semantic
analysis (done by OCASTSemanticAnalyzer, the Opal Compiler
AST Semantic Analyzer).

Our input, the source code of the bar method, is quite
simple and everything is fine, except that, during the semantic
analysis, the variable name baz is analyzed by
OCASTSemanticAnalyzer>>#visitAssignmentNode: (as a
nice compiler, it processes its AST with visitors), that calls
OCASTSemanticAnalyzer>>#resolveVariableNode: but
which cannot resolve baz thus calls
OCASTSemanticAnalyzer>>#undeclaredVariable: whose
responsibility is to deal with the situation of undeclared
variables.

Note: resolving variables can be a complex task because, in Pharo,
methods and expressions can be used in various contexts with, sometimes,
particular rules. For instance, the playground (workspace) has some
specific variables lazily declared; and the debugger has to deal with
methods currently executed, thus runtime contexts (frames) that require
a non-trivial binding process. Under the hood, the requestor can also be
involved in such symbol resolution. However, I chose to skip this
complexity in this article.

Here is its source code of
OCASTSemanticAnalyzer>>#undeclaredVariable:

undeclaredVariable: variableNode
    compilationContext optionSkipSemanticWarnings
        ifTrue: [ ^UndeclaredVariable named: variableNode name asSymbol ].
    ^ OCUndeclaredVariableWarning new
        node: variableNode;
        compilationContext: compilationContext;
        signal

If we are in a specific mode optionSkipSemanticWarnings
then just resolve as a special undefined variable. Since it’s not the
case currently, I won’t give more detail (yet).

What follows is more interesting.

OCUndeclaredVariableWarning is a subclass of
Notification, a basic class of the kernel of the Pharo
language that is a subclass of Exception (the same kind of
exception we discussed in the previous section). Exceptions in Pharo
work more or less like what you get in many other programming languages.
You catch them with the on:do: method of blocks (that we
have already explained) and throw them with the signal
method.

What is noticeable here is the ^ (a return) in
front of the exception signalment. Notification is a
special kind of Exception that have the ability to be
resumed. Once resumed, the execution of the program continues after the
signal message send. The second special feature of
Notification is that when unhandled (no on:do:
catch them and the notification “goes through” the whole call stack)
then signal has no particular effect and just returns
nil. This is explicit in the method
Notification>>#defaultAction:

defaultAction
    "No action is taken. The value nil is returned as the value of
    the message that signaled the exception."

    ^nil

In summary, Notification instances are just
notifications; if nothing cares, then signal has no
effect.

Let’s go back to
OCASTSemanticAnalyzer>>#undeclaredVariable:. A
notification OCUndeclaredVariableWarning is signaled, and
if some method in the call stack cares and catches the notification, it
can choose to do something and possibly resume the execution with a
Variable object that shall be used to bind
baz.

Is this design decision sound? Let’s discuss this.

There are some drawbacks in the use of such notifications. First, the
link between the signaler
(OCASTSemanticAnalyzer>>#undeclaredVariable:) and the
potential catchers is indirect in the code: it is circumstantial.
Second, a given catcher might unwarily catch a notification it did not
expect (from another compiler, for instance), especially with
Notification because they are silent by default. But the
advantage is that some grandparent callers have more latitude to set up
the kind of execution environment it requires and deal with potential
notifications. We shall explore this possibility later.

An alternative design could be callback based: give the compiler some
objects to call when such decisions have to be made. It could be a block
(lambda) or, for instance, the requestor since we already have one. This
design has the advantage of making the subordination relationship more
obvious in the code, but it might require more management (to store and
pass objects around).

A part of another approach could be to have a set of alternative
behaviors in the compiler that can be activated or configured by the
client (with boolean flags, for instance) This offers a certain control
by the client (that sets up the configuration) and gives the
responsibility of implementing them to the compiler. The drawbacks are
that the effect of flags is limited and that the space of available
combinations on configuration can become large with possible complex
interactions or conflicts.

Another approach could be to silently use place-holded for the
variable of baz (let’s call it
UndeclaredVariable), then continue the compilation and
produce a CompiledMethod instance as the result of the
compile method. The caller is then free to inspect this
CompiledMethod instance, detect the presence of undeclared
variables, then choose to act. The obvious issue is that maybe the
compilation (including byte code generation) was just done for nothing,
wasting precious CPU time and Watts. The advantage is that the compiler
is simpler (no need to try to repair or even report errors) and that the
caller can easily manage multiple error conditions at the same time,
whereas the two other approaches basically impose the caller to solve
each error situation one by one.

Readers might look again at the
optionSkipSemanticWarnings at the beginning of the method
and realize that it feels like these two last alternatives are
implemented here. UndeclaredVariable are a real thing and,
for instance, are used when source codes are analysed for highlighting.
UndeclaredVariable are also used in two other cases:
package loading (because cycles in dependecies are hard) and code
invalidation (because you can always remove classes or instance
variable).

OCUndeclaredVariableWarning>>#defaultAction

So, since baz is not declared,
OCASTSemanticAnalyzer signals an
OCUndeclaredVariableWarning hoping that something can catch
it with the task to provide a Variable object to be bound
to the name baz.

But in the scenario, the notification is not caught by anyone. Is
nil associated with baz? This is not what we
need, nor
OCASTSemanticAnalyzer>>#resolveVariableNode: by the
way.

The answer is in
OCUndeclaredVariableWarning>>#defaultAction (see code
below) which overrides the default
Notification>>#defaultAction that is shown in the
previous section.

defaultAction
    | className selector |
    className := self methodClass name.
    selector := self methodNode selector.

    NewUndeclaredWarning signal: node name in: (selector
        ifNotNil: [className, '>>', selector]
            ifNil: ['<unknown>']).

    ^super defaultAction ifNil: [ self declareUndefined ]

The first part just creates a system notification. You can see them
in the Transcript (basically the system console of Pharo),
or in the standard output in command line mode (search them in the build
log produced by Jenkins, they are numerous, thus hard to miss).

The second part delegates to the superclass, and if the superclass
does not care, fallback to
OCUndeclaredVariableWarning>>#declareUndefined that
is:

declareUndefined
    ^UndeclaredVariable registeredWithName: node name

So an UndeclaredVariable object, shall make
OCASTSemanticAnalyzer happy since it is a very acceptable
thing to bind baz to.

OCSemanticWarning>>#defaultAction

The superclass of OCUndeclaredVariableWarning is
OCSemanticWarning, what does it offer?

defaultAction

    compilationContext interactive ifFalse: [ ^nil ].
    ^self openMenuIn:
        [:labels :lines :caption |
        UIManager default chooseFrom: labels lines: lines title: caption]
  • compilationContext interactive is true if
    there is a requestor and is interactive, false otherwise.
    Our requestor is still the instance of RubScrolledTextMorph
    and is interactive, so we continue.
  • UIManager>>#chooseFrom:lines:title: is a standard
    UI abstract method to pop up a selection window according to the current
    system UI (here MorphicUIManager), or a launch a
    command-line menu when in command line mode, or even produce a warning
    and select the default when in non-interactive mode (asking for things
    in non-interactive mode deserves a warning).

What is openMenuIn:? There are 3 implementations:

  • OCSemanticWarning>>#openMenuIn: (the method
    introduction), that just call self subclassResponsibility.
    This is the Pharo way to declare the method abstract (and signals an
    error if executed).
  • OCShadowVariableWarning>>#openMenuIn: (a subclass
    that is not part of the scenario), that just call
    self error: 'should not be called' that also just signal an
    error.
  • OCUndeclaredVariableWarning>>#openMenuIn:, a
    large Pharo method of 55 lines that is discussed in the next
    section.

What uses openMenuIn:? There are 2 senders:

  • OCSemanticWarning>>#defaultAction
    (obviously),
  • OCUndeclaredVariableWarning>>#openMenuIn:. A
    recursive call? We shall see.

This leads to some more questions:

  • Is it reasonable that the compiler cares about the interactiveness
    of the requestor? Note that it could have been a recent addition since
    most requestors are not aware of that part of the API. See
    CompilationContext>>#interactive that uses the
    questionable message respondTo:.
  • Why such polymorphism if there is only one effective implementation?
    Code leftover? Future-proofing?
  • Why pass a block as an argument if no other sender exists? It seems
    superfluous.
  • Is it the responsibility of a Notification object to
    call UI with a menu?

In the next post, we will present the menu, do the reparation and try
to get out of here (the compiler is far away in the call stack) to
finish the compilation successfully.

Sentiment Analysis in Pharo using a real data set

You are a movie reviewer, and a colleague has just sent to you a set of files with hundreds of reviews to determine their sentiments, for example classify them into positive or negative. You read that machine learning can help here processing massive amounts of data by using a classifier. But computers are not good with textual data, so all these reviews needs to be converted into a friendly format for the machine (hint: vectors of numbers). How do we go from hundreds of text files to an object which can predict new inputs? Meet TF-IDF + Naive Bayes: an algorithm which penalizes words that appear frequently in most of the texts, and a machine learning classifier which has proven to be useful for natural language processing.

The whole idea of the TF-IDF invention is to measure the importance of words in documents (so-called “corpus” in the vocabulary). So if we just can “teach” the machine what words are important for sentiment analysis, then we could classify sentiments in your colleague’s reviews. Teaching means that something was learned before. This is our dataset, which was enriched with knowledge. Fortunately, there were people who already annotated sentiments of IMDB reviews to help in our task.

Probably you would also like to do other high-level analysis of text, like hot topics detection, or any quantitative analysis (meaning: which can be ranked). Although there is no all-in-one recipe, most chances are that there is a standardized workflow for you, which could include: Lowercasing words, remove stop words, punctuations, abbreviations, apostrophe, single characters, stemming or term recognition.

So the basic idea is to go from text to vectors (with TF-IDF) so it can be applied to a predictor algorithm. Later, in a second part, we will use Naïve Bayes as classifier and, of course, you can try to generalize to other types of algorithms like SVM or Neural Networks.

Dataset

We are going to use the IMDB Large Movie Review Dataset with 50,000 reviews where 1 review = 1 file. They are divided in two folders: one for training (25k) and another one for testing (25k). Additionally, both the training and testing sets are sub-divided into positive (12,5k) and negative (12,5k) annotated reviews. The reviews here are ranked between 1 and 10 stars. A review is considered positive if it has more than 7 stars, and negative if it has less than 4 stars, there are no reviews with 5 or 6 stars.

The IMDB dataset is commonly used in a Natural Language Processing (NLP) task named “binary sentiment classification”. Summarizing: It is used when you want to build something to identify between two types or classes of “sentiments”, positive or negative. You could also expand the classification into as many classes as you could get. In this case you could consider to classify using up to 8 classes.

To start working with the dataset, download and uncompress the files to the Pharo image directory (which is where your .image file is located) as follows:

wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar zxvf aclImdb_v1.tar.gz

Now you should have a folder named aclImdb/ with all the files ready to analyze.

Setup

You can launch Pharo, for this article we use Pharo 10 but the process should work for Pharo 11 too:

./pharo-ui Pharo.image &

Let’s first install the AI packages in Pharo:

EpMonitor disableDuring: [ 
  Metacello new
    baseline: 'AIPharo';
    repository: 'github://pharo-ai/ai/src';
    onWarningLog;
    load ]

Data Exploration

To bring some context, we could say that in the Data Science pipeline there are some typical steps for classification tasks. They can be grouped into 3 big stages: Data Engineering (Exploration, Wrangling, Cleansing, Preparation), Machine Learning (Model Learning, Model Validation) and Operations (Model Deployment, Data Visualization).

Now let’s begin the stage commonly named as “Data wrangling”. This is what popular libraries like pandas does. A first step here is data exploration and data sourcing. The uncompressed dataset has the following directory structure:

acImdb\
    test\
        neg\
        pos\
    train\
        neg\
        pos\

With the following expression, open a Pharo Inspector on the result of the train reviews (highlight and evaluate with Cmd + I or Ctrl + I):

('aclImdb/train/' asFileReference childrenMatching: 'neg;pos')
  collect: [ : revFileDir | revFileDir children collect: #contents ].

And it looks like this:

You have there two main containers. One contains the negative reviews (very funny to read indeed), and the other one the positive ones. Hold on this information for later.

Sourcing the annotations

Classification tasks includes some kind of annotation somewhere, which you can use as “predictor” to train a model. Hopefully, your raw data includes a column with it. In this case the stars (i.e. the classes) are in the file name of each review (which has the pattern reviewID_reviewStarRating.txt) so if you want to enrich your classifier with more classes, you could check the file name star rating depending if it’s greater than 7 or lesser than 4. We will adapt our previous expression to add a sentiment polarity value of value 1 (positive sentiment) and with value 0 if it’s negative. But we do not need to check the file name star rating, this information is already available in the directory name, so we adapt our script to associate the polarity to each review:

| reviews |
reviews := (#('train' 'test') collect: [ : setName | 
  (('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
    collect: [ : revFileDir | 
	| polarity |
        polarity := (revFileDir basename endsWith: 'neg') 
                           ifTrue: [ 0 ] ifFalse: [ 1 ].			
	revFileDir children 
	     collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).

In a real project now it could be a good time to create a ReviewsCollector class, and create a basic protocol for loading and reading reviews. You could also consider using internally a DataFrame instead of “plain” Collections, specially if you want to augment each review in the dataset with features to be calculated. Here we will concentrate in the raw workflow rather than building an object model.

Note : A Pharo/Smalltalk session to typically involves evaluation of expressions directly in the Inspector evaluator. You can copy & paste scripts from this post and re-evaluate the whole workflow from the start each time (if you have enough time), but I encourage to use the Inspector, which is more in line with Exporatory Data Analysis (EDA). At the end of your working session, you can save the image, or just build a script for reproducibility. In this post we will also checkpoint each step for better reproducibility, using the built-in Pharo serializer.

Duplicates removal

To start cleaning the dataset, one of the first tasks we could do is to check if there are duplicates, and remove them from our dataset. We use the message #asSet to remove duplicates:

| reviews dedupReviews |
reviews := (#('train' 'test') collect: [ : setName | 
  (('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
    collect: [ : revFileDir | 
	| polarity |
        polarity := (revFileDir basename endsWith: 'neg') 
                           ifTrue: [ 0 ] ifFalse: [ 1 ].			
	revFileDir children 
	     collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).
dedupReviews := reviews deepFlatten asSet.

Special artifacts removal

After manual inspection we can see our dataset contains artifacts, such as HTML tags. In this case it means the data was scrapped from HTML web pages, so it would not be detected by our word tokenizer which can recognize separators and special characters but not HTML tags. You could discover tags by exploring with the Pharo Inspector (Cmd + I or Ctrl+ I) with a script like this:

| reviews dedupReviews |
reviews := (#('train' 'test') collect: [ : setName | 
  (('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
    collect: [ : revFileDir | 
	| polarity |
        polarity := (revFileDir basename endsWith: 'neg') 
                           ifTrue: [ 0 ] ifFalse: [ 1 ].			
	revFileDir children 
	     collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).
dedupReviews := reviews deepFlatten asSet.
dedupReviews anySatisfy: [ : assoc | 
	| reviewText |
	reviewText := assoc key.
	(reviewText findTokens: ' ') anySatisfy: [ : word | word beginsWith: '<br' ] ]

So, if we pick a random review, our idea is to go from:

Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.< br/>Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form.

to

Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form.

And we can do it with a simple expression which splits the whole sentence String by the HTML BR pattern and then join the splitted substrings:

| reviews dedupReviews cleanedReviews |
reviews := (#('train' 'test') collect: [ : setName | 
	(('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
		collect: [ : revFileDir | 
			| polarity |
		    polarity := (revFileDir basename endsWith: 'neg') ifTrue: [ 0 ] ifFalse: [ 1 ].			
			revFileDir children 
		     collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).
dedupReviews := reviews deepFlatten asSet.
cleanedReviews := dedupReviews collectDisplayingProgress: [ : docAssoc | 
	(docAssoc key findBetweenSubstrings: #('<br />')) joinUsing: '' ].

So #findBetweenSubstrings: can detect multiple patterns, tokenize the receiver, and then we join them again to get rid of noise patterns. Of course you can adapt and play with the expression to your own needs. I feel it is a good starting point and it avoids nasty regular expressions. Other non-sense text artifacts you might want to check are: ‘\n’, EOL, ‘^M’, ‘\r’.

To generalize for other artifacts, use the #removeSpecialArtifacts: method.

| reviews dedupReviews cleanedReviews tokenizedReviews |
reviews := (#('train' 'test') collect: [ : setName | 
 (('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
   collect: [ : revFileDir | 
     | polarity |
     polarity := (revFileDir basename endsWith: 'neg') 
        ifTrue: [ 0 ] ifFalse: [ 1 ].		
     revFileDir children 
	collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).
dedupReviews := reviews deepFlatten asSet.
cleanedReviews := dedupReviews collectDisplayingProgress: #removeSpecialArtifacts.

"This serialization step is optional and could take some time to complete"
FLSerializer 
	serialize: cleanedReviews 
	toFileNamed: 'acImdb_49582_nodups_noartfcts.fuel'.

cleanedReviews

Notice you cannot clean such artifacts directly with a (typical) tokenizer, because tokenization involves detection of punctuation Characters: If you apply tokenization first, you could lose common (written) language expressions which includes punctuation, for example a smiley 🙂

Punctuation, Special characters (Tokenization)

The next logical step is to transform each of the cleaned reviews Collection (which is composed of Strings “rows”, where a row = a document), into sequences of words, a process called whitespace tokenization, so they only contain words without “noise”.

When it comes to analysis of special characters and punctuation is when things become very interesting. From a näive point of view, just removing all separators would be simple, clean and enough. But language systems are much more complicated, specially when you bring into the analysis variables such as idiom, alphabet types, or even noise. For example: If you are doing more finer semantic (linguistic) analysis then punctuation could be significative, because the target language affects the meaning of a sentence.

Removal of punctuation and special characters is done sending the #tokenize message to any Collection of String. We can see it in action evaluating the following expression :

| reviews dedupReviews cleanedReviews wordTokenizer tokenizedReviews |
reviews := (#('train' 'test') collect: [ : setName | 
	(('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
		collect: [ : revFileDir | 
			| polarity |
		    polarity := (revFileDir basename endsWith: 'neg') ifTrue: [ 0 ] ifFalse: [ 1 ].			
			revFileDir children 
		     collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).
dedupReviews := reviews deepFlatten asSet.
wordTokenizer := AIWordTokenizer specialArtifacts.
cleanedReviews := dedupReviews collectDisplayingProgress: [ : docAssoc | 
	(docAssoc key removeSpecialArtifacts: wordTokenizer) -> docAssoc value ].
tokenizedReviews := cleanedReviews collectDisplayingProgress: [ : docAssoc | 
	docAssoc key tokenize -> docAssoc value ].

"This serialization step is optional and could take some time to complete"
FLSerializer 
	serialize: tokenizedReviews 
	toFileNamed: 'acImdb_49582_nodups_noartfcts_tokenized.fuel'.

tokenizedReviews

Stopwords

Words such as “the”, “of”, “a”, etc could be removed in two ways: By hand (using premade stopwords lists) or by the automagical (statistical) use of TF-IDF. But read, here there are two excellent different opinions from the pros and cons of removing stop words. TL;DR: Removing stopwords with TF-IDF depends of the context and the goal of your task. We can check if the TF-IDF algorithm will “automatically” rank low the very frequent terms which appear in many documents.

If you decided to go with the stopword removal, the stopwords package in Pharo which provides multiple stopword premade lists. We can use a default list of stopwords, but you can use another one you prefer.

AIStopwords forEngish.
AIStopwords forSpanish.

To explore other lists

AIStopwords listSummary.

So our script so far with stopword removal:

| reviews dedupReviews cleanedReviews wordTokenizer tokenizedReviews |
reviews := (#('train' 'test') collect: [ : setName | 
 (('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
   collect: [ : revFileDir | 
     | polarity |
     polarity := (revFileDir basename endsWith: 'neg') 
        ifTrue: [ 0 ] ifFalse: [ 1 ].		
     revFileDir children 
	collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).
dedupReviews := reviews deepFlatten asSet.
wordTokenizer := AIWordTokenizer specialArtifacts.
cleanedReviews := dedupReviews collectDisplayingProgress: [ : docAssoc | 
	(docAssoc key removeSpecialArtifacts: wordTokenizer) -> docAssoc value ].
tokenizedReviews := cleanedReviews collectDisplayingProgress: [ : docAssoc | 
	docAssoc key tokenizeWithoutStopwords -> docAssoc value ].

"This serialization step is optional and could take some time to complete"
FLSerializer 
	serialize: tokenizedReviews 
	toFileNamed: 'acImdb_49582_nodups_noartfcts_tokenized.fuel'.

tokenizedReviews

To ignore stopwords removal just replace #tokenizeWithoutStopwords with #tokenize.

So far a first part covering reading and cleaning data for a classification task. In a next article we will see how to classify these reviews with a classifier.