American Sign Language (ASL) Recognition Based On Hough

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Expert Systems

with Applications
Expert Systems with Applications 32 (2007) 24–37
www.elsevier.com/locate/eswa

American sign language (ASL) recognition based on Hough


transform and neural networks
Qutaishat Munib *, Moussa Habeeb, Bayan Takruri, Hiba Abed Al-Malik
Department of Computer Information System, King Abdullah II School for Information Technology,
University of Jordan, University Street, Amman 11942, Jordan

Abstract

The work presented in this paper aims to develop a system for automatic translation of static gestures of alphabets and signs in Amer-
ican sign language. In doing so, we have used Hough transform and neural networks which is trained to recognize signs. Our system does
not rely on using any gloves or visual markings to achieve the recognition task. Instead, it deals with images of bare hands, which allows
the user to interact with the system in a natural way. An image is processed and converted to a feature vector that will be compared with
the feature vectors of a training set of signs. The extracted features are not affected by the rotation, scaling or translation of the gesture
within the image, which makes the system more flexible.
The system was implemented and tested using a data set of 300 samples of hand sign images; 15 images for each sign. Experiments
revealed that our system was able to recognize selected ASL signs with an accuracy of 92.3%.
2005 Elsevier Ltd. All rights reserved.

Keywords: American sign language; Neural network; Hough transform; Canny edge detection; Sobel edge detection; Feature extraction

1. Introduction sions corresponding to letters and words in natural


languages.
The sign language is the fundamental communication A gesture is defined as a dynamic movement, such as
method between people who suffer from hearing defects. waving hi, hello or good-bye. Simple gestures are made
In order for an ordinary person to communicate with deaf in two ways (Sturman & Zeltzer, 1994; Watson, 1993).
people, a translator is usually needed to translate sign The first way involves a simple or complex posture and
language into natural language and vice versa (Inter- change in the position or orientation of the hand, such as
national Bibliography of Sign Language, 2005; Interna- making a pinching posture and changing the hand’s posi-
tional Journal of Language & Communication Disorders, tion. The second way entails moving the fingers in some
2005). way with no change in the position and orientation of
As a primary component of many sign languages the hand, for example, moving the index and middle finger
and in particular the American Sign Language (ASL), back and forth to urge someone to move closer. A complex
hand gestures and finger-spelling language plays an gesture is one that includes finger; wrist or hand movement
important role in deaf learning and their communication. (i.e. changes in the position and orientation). There are two
Therefore, sign language can be considered as a collec- types of gesture interaction: communicative gestures work
tion of gestures, movements, postures, and facial expres- as a symbolic language (which is our focus in this research)
and manipulative gestures provide multi-dimensional con-
trol. Moreover, we can divide gestures into static gestures
(hand postures) and dynamic gestures (Cutler & Turk,
*
Corresponding author. 1998; Hong et al., 2000). Indeed the hand motion conveys
E-mail address: [email protected] (Q. Munib). as much meaning as their posture does.

0957-4174/$ - see front matter 2005 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2005.11.018
Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37 25

A static sign is determined by a certain configuration of postures of the body. ASL also has its own grammar that
the hand, while a dynamic gesture is a moving gesture is different from other sign languages such as English and
determined by a sequence of hand movements and config- Swedish. ASL consists of approximately 6000 gestures of
urations. Dynamic gestures are sometimes accompanied common words with finger spelling used to communicate
with body and facial expressions. unclear words or proper nouns. Finger spelling uses one
The aim of sign language recognition is to provide an hand and 26 gestures to communicate the 26 letters of
easy, efficient and accurate mechanism to transform sign the alphabet. The 26 alphabets of ASL are shown in Fig. 1.
language into text or speech. With the help of computer-
ized digital image processing (Gonzalez, Woods, & Eddins, 1.2. Related work
2004) and neural networks techniques (Haykin, 1999), the
system that can recognize the alphabet flow can recognize Attempts to automatically recognize sign language
and interpret ASL words and phrases. For a gesture recog- began to appear in the literature in the 90s. Research on
nition system, there are four main components: gesture hand gestures can be classified into two categories first cat-
modeling, gesture analysis, gesture recognition and ges- egory, relies on electromechanical devices that are used to
ture-based application systems. measure the different gesture parameters such as the hand’s
position, angle, and the location of the fingertips. Systems
1.1. American sign language that use such devices are usually called glove-based systems
(e.g. the work of (Grimes, 1983) at AT&T Bell Labs devel-
American Sign Language (ASL) (International Bibliog- oped the ‘‘Digital Data Entry Glove’’). Major problems
raphy of Sign Language, 2005; National Institute on with such systems, that they force the singer to wear cum-
Deafness & Other Communication Disorders, 2005) is a bersome and inconvenient devices. As a result, the way by
complete language that employs signs made with the hands which the user interacts with the system will be complicated
and other gestures, including facial expressions and and less natural.

Fig. 1. The American sign language finger spelling alphabet.


26 Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37

The second category exploits machine vision and image A. Feature extraction, statistics, and models.
processing techniques to create visual based hand gesture This technique can be classified into six sub-categories:
recognition systems. Visual-based gesture recognition 1. Template matching (e.g. research work of Darrell
systems are further divided into two categories. The first & Pentland, 1993; Newby, 1993; Sturman, 1992;
one relies on using specially designed gloves with visual Watson, 1993; Zimmerman, Lanier, Blanchard,
markers called ‘‘visual-based gesture with glove–markers Bryson, & Harvill, 1987).
(VBGwGM)’’ that help in determining hand postures 2. Feature extraction and analysis, (e.g. research
(Dorner & Hagen, 1994; Fels & Hinton, 1993; Starner, 1995). work of Rubine, 1991; Sturman, 1992; Wexelblat,
A summary of selected research efforts listed in Table 1. 1994, 1995).
But using gloves and markers do not provide the natu- 3. Active shape models ‘‘smart snakes’’ (e.g. research
ralness required in human–computer interaction systems. work of Heap & Samaria, 1995).
Besides, if colored gloves are used, the processing complex- 4. Principal component analysis (e.g. research work
ity is increased. of Birk, Moeslund, & Madsen, 1997; Martin &
As an alternative, the second kind of visual based hand James, 1997; Takahashi & Kishino, 1991).
gesture recognition systems can be called ‘‘pure visual- 5. Linear fingertip models (e.g. research work of
based gesture (PVBG)’’ (i.e. visual-based gesture without Davis & Shah, 1993; Rangarajan & Shah, 1991).
glove–markers). This type tries to achieve the ultimate con- 6. Causal analysis (e.g. research work of Brand &
venience naturalness by using images of bare hands to rec- Irfan, 1995).
ognize gestures. B. Learning algorithms.
Among many factors, five important factors must be This technique can be classified into three sub-categories:
considered for the successful development of a vision-based 1. Neural network (e.g. research work of Banarse,
solution to collecting data for hand posture and gesture 1993; Fels, 1994; Fukushima, 1989; Murakami &
recognition (Ong & Ranganath, 2005; Starner, 1995; Stur- Taguchi, 1991).
man & Zeltzer, 1994; Watson, 1993). 2. Hidden Markov Models (e.g. research work of
Charniak, 1993; Liang & Ouhyoung, 1998; Nam
• The placement and number of cameras used. & Wohn, 1996; Starner, 1995).
• The visibility of the object (hand) to the camera for sim- 3. Instance-based learning (research work of Kadous,
pler extraction of hand data/features. 1995; also see Aha, Dennis, & Marc, 1991).
• The extraction of features from the stream or streams of C. Miscellaneous techniques.
raw image data. This technique can be classified into three sub-categories:
• The ability of recognition algorithms to extracted 1. The linguistic approach (e.g. research work of
features. Hand, Sexton, & Mullan, 1994).
• The efficiency and effectiveness of the selected algo- 2. Appearance-based motion analysis (e.g. research
rithms to provide maximize accuracy and robustness. work of Davis & Shah, 1993).
3. Spatio-temporal vector analysis (e.g. research
A number of recognition techniques are available and in work of Quek, 1994).
some cases they can be applied for the two types of vision-
based solutions (i.e. VBGwGM and PVBG). In general Regardless of the approach used (i.e. VBGwGM or
these recognition techniques can be categorized into three PVBG etc.), many researchers have been trying to intro-
broad categories: duce hand gestures to Human–Computer Interaction field.

Table 1
A summary of gloves used
Research Gloves used
Dorner and Hagen (1994) They use a cotton glove, with various areas of it painted different
colors to enable tracking (i.e. gloves with rings of colors around each joint)
Starner (1995) Two colored gloves: an orange glove on the left hand and a yellow glove on the right hand
Fels and Hinton (1993) and Fels (1994) VPL DataGlove Mark II with a Polhemus tracker as input devices;
wearing a glove that the user moves in certain ways, users would learn
to generate vocal sounds (Glove-Talk)
Takahashi and Kishino (1991) VPL DataGlove
Murakami and Taguchi (1991) VPL DataGlove
Kramer and Leifer (1990) CyberGlove
Vamplew (1996) A single CyberGlove with position tracking
Rung-Huei and Ouhyoung (1996) DataGlove as input devices.
Kadous (1996) Power gloves
Grobel and Assan (1996) Colored gloves
Wexelblat (1994, 1995) CyberGlove on each hand
Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37 27

Charayaphan and Marble (1992) investigated a way using spelt American sign language. Their approach based on
image processing to understand American Sign Language one-state transitions of the English Language which are
(ASL). Their system can correctly recognize 27 out of the projected into shape space for tracking and model predic-
31 ASL symbols. Fels and Hinton (1993) developed a sys- tion using an HMM like approach. Symeonidis (2000) used
tem using a VPL DataGlove Mark II with a Polhemus orientation histograms to recognize static hand gestures,
tracker as input devices. In their system, the neural net- specifically, a subset of American Sign Language (ASL).
work method was employed for classifying hand gestures. A pattern recognition system used a transform that con-
Another system using neural networks developed by Ban- verts an image into a feature vector, which will then be
arse (1993) was vision-based and recognized hand postures compared with the feature vectors of a training set of ges-
using a neocognitron network, a neural network based on tures. The system was implemented with a perceptron net-
the spatial recognition system of the visual cortex of the work. The main problem with this technique is how good
brain. Heap and Samaria (1995) extend active shape mod- differentiation one can achieve. This of course is dependent
els, or ‘‘smart snakes’’ technique to recognize hand pos- upon the images but it comes down to the algorithm as
tures and gestures using computer vision. In their system, well. It may be enhanced using other image processing
they apply an active shape model and a point distribution techniques like edge detection. For farther information
model for tracking a human hand. Starner and Pentland and hot topics on this issue a modern and an excellent sur-
(1995) used a view-based approach with a single camera vey can be found in (Ong & Ranganath, 2005).
to extract two-dimensional features as input to HMMs.
The correct rate was 91% in recognizing the sentences com- 2. System design and implementation
prised 40 signs. Kadous (1996) demonstrated a system
based on power gloves to recognize a set of 95 isolated 2.1. System overview
Auslan signs with 80% accuracy, with an emphasis on com-
putationally inexpensive methods. Grobel and Assan Our system is designed to visually recognize all static
(1996) used HMMs to recognize isolated signs with signs of the American Sign Language (ASL), all signs of
91.3% accuracy out of a 262-sign vocabulary. They the ALS alphabets, single digit numbers used in ASL
extracted the features from video recordings of signers (e.g. 3, 5, 7) and a sample of words (e.g. love, meet, more)
wearing colored gloves. Vogler and Metaxas (1997) used using bare hands. The users/signers are not required to
computer vision methods to extract the three-dimensional wear any gloves or to use any devices to interact with the
parameters of a signer’s arm motions, coupled the com- system. However, different signers vary their hand shape
puter vision methods and HMMs to recognize continuous size, body size, operation habit and so on, which bring
American sign language sentences with a vocabulary of 53 about more difficulties in recognition. Therefore, we real-
signs. They modeled context-dependent HMMs to alleviate ized the necessity to investigate the signer-independent sign
the effects of movement epenthesis. An accuracy of 89.9% language recognition to improve the system robustness and
was observed. Yoshinori, Kang-Hyun, Nobutaka, and practicability in the future by using Hidden Markov Model
Yoshiaki (1998) used colored gloves and have shown that (HMM) (Seymore, McCallum, & Rosenfeld, 1999). The
using solid colored gloves allows faster hand features combination of powerful Hough transformation with
extraction than simply wearing no gloves at all. Liang excellent image processing and neural networks capabilities
and Ouhyoung (1998), used HMMs for continuous recog- has led to the successful development of ASL recognition
nition of Taiwan sign language with a vocabulary between system using MATLAB (Gonzalez et al., 2004). Our
71 and 250 signs with DataGlove as input devices. How- method relies on presenting the gesture as a feature vector
ever, their system required that gestures performed by the that is translation, scale and rotation invariant.
signer be slow to detect the word boundary. Yang and The system has two phases: the feature extraction phase
Ahuja (1999) investigated dynamic gestures recognition and the classification, as shown in Fig. 2. Images were pre-
as they utilized skin colour detection and affine transforms pared using Portable Document Format (PDF) form so the
of the skin regions in motion to detect the motion trajec- system will deal with images that have a uniform back-
tory of ASL signs. Using a time delayed neural network, ground (PDF Reference, 2004). The feature extraction
they recognised 40 ASL gestures with a success rate around applied an image processing technique which involves
96%. But their technique potentially has a high computa- using algorithms to detect and isolate various desired por-
tional cost when false skin regions are detected. A local fea- tions of the digitized sign. During this phase, each colored
ture extraction technique is employed to detect hand image is resized and then converted from RGB to grayscale
shapes in sign language recognition by Imagawa, Matsuo, one. This is followed by an edge detection technique the so-
Taniguchi, Arita, and Igi (2000). They used an appearance- called Canny edge detection (Canny, 1986).
based eigen method to detect hand shapes. Using a cluster- The goal of edge detection is to mark the points in an
ing technique, they generate clusters of hand shapes on an image (sign image) at which the intensity changes sharply.
eigenspace with accuracy achieved a round 93% recogni- Sharp changes in image properties usually reflect important
tion of 160 words. Bowden and Sarhadi (2002) developed events and changes in world properties. The Canny
a non-linear model of shape and motion for tracking finger (Canny, 1986) operator was originally designed to be an
28 Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37

mine the values of pixels in the output image but other


interpolation methods can be specified. We use ‘bicubic’
method because if the specified output size is smaller than
the size of the input image, ‘‘imresize’’ applies a lowpass fil-
ter before interpolation to reduce aliasing. Therefore, we
get the default filter size is 11-by-11.

2.2.2. Convert from RGB to Grayscale


To alleviate the problem of different lighting conditions
of signs taken and the HSV color space ‘‘(Hue, Saturation,
Value) also called HSB (Hue, Saturation, Brightness)’’
non-linearity by eliminating the hue and saturation infor-
mation while retaining the luminance. The RGB color
space (Red, Green and Blue which considered the primary
colors of the visible light spectrum) is converted through
gray scale image to a binary image.
Binary images are images whose pixels have only two pos-
sible intensity values. They are normally displayed as black
and white. Numerically, the two values are often 0 for black,
and either 1 or 255 for white. Binary images are often pro-
duced by thresholding a grayscale or color image, in order
to separate an object in the image from the background.
The color of the object (usually white) is referred to as the
foreground color. The rest (usually black) is referred to as
the background color. However, depending on the image
which is to be thresholded, this polarity might be inverted,
in which case the object is displayed with 0 and the back-
Fig. 2. System overview. ground is with a non-zero value. This conversion resulted
in sharp and clear details for the image (as shown in Fig. 3).
optimal edge detector (according to particular criteria— It is obvious, as shown in Fig. 3 that the RGB color
there are other detectors around such as Sobel and Roberts space conversion to HSV color space then to a binary
cross operators that also claim to be optimal with respect image produced images that lack many features of the sign.
to slightly different criteria). In general most of these oper-
ators take as input a gray scale image, and produce as out-
put an image showing the positions of tracked intensity
discontinuities.
The next important step is the application of Hough
transform (Hough, 1962). The Hough transform is a fea-
ture extraction technique used in digital image processing.
The classical transform identifies lines in the image, but it
has been extended to identifying positions of arbitrary
shapes. The transform universally used today was invented
by Richard Duda and Peter Hart in 1972 (Duda & Hart,
1972), who called it a ‘‘generalized Hough transform’’ after
the related 1962 patent of Paul Hough.
In the classification stage, a 3-layer, feed-forward back
propagation neural network (Haykin, 1999) is constructed.
It consists of (160) inputs, (214 * 3) neurons for the first
hidden layer and (214 * 2) neurons for the second hidden
layer, and (214 * 1) output neurons in this classification
network.

2.2. Feature extraction phase

2.2.1. Resize the image


Images of signs were resized to 80 by 64, by default,
‘‘imresize’’ uses nearest neighbor interpolation to deter- Fig. 3. Transformation of RGB color image into a binary image.
Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37 29

2.2.3. Edge detection Table 3


Hough transform is used to identify the parameters of a Canny edge detector with/without threshold for some signs
curve that best fits a set of given edge points. Edge detec- Sign Original image Canny without Canny with
tion is obtained from feature detector using Canny edge threshold threshold (.25)
detector. We used Canny edge detection technique because
it provides the optimal edge detection solution. Here in the
proposed system an ‘optimal’ edge detector means:
Sign ‘A’
• good detection—the algorithm should mark as many
real edges in the image as possible,
• good localisation—edges marked should be as close as
possible to the edge in the real image,
• minimal response—a given edge in the image should
only be marked once, and where possible, image noise
should not create false edges. Sign ‘M’

Canny edge detector results in a better edge detection


compared to Sobel edge detector as shown in Table 2.
The output of the edge detector defines ‘where’ features
are in an image, whereas, Hough transform determines
both ‘what’ the features are and ‘how many’ of them exists
in the image.
As shown in Table 2 Sobel method was not appropriate Sign ‘U’
method to use because it hides many important details
needed to make a representative feature vector while
Canny method is better, but in some cases it produces extra
details more than needed. To solve this problem we decide

Table 2 to use a threshold of (0.25) after testing different threshold


Choosing appropriate edge detection method
values and observing its results on the overall recognition
Sign Original image Sobel Canny system. Table 3 shows some samples of signs with thresh-
old and other without.

2.2.4. Hough transform


Sign ‘A’ We have used radon function which represents an image
as a collection of projections along various directions. We
have applied radon on two different ranges of theta (h), (0–
360) and (180 to 180) degrees. The range (0–360) led to
misclassification in the classification phase. While the range
(180 to 180) was better, therefore, it has been selected.
Our algorithm is based on studying the distribution of
variation for each line layers by computing the mean (aver-
age) and standard deviation along the range (180 to 180)
Sign ‘M’
for each radius. The feature vector that results by applying
this algorithm is called ‘Theta Radius Distribution Matrix’.
This feature vector is the input for the classification phase.
The main advantage of using Hough transform tech-
nique is that it is tolerant to gaps in feature boundary
descriptions and it is relatively unaffected by image noise.
For simplicity let us introduce Hough transform for
straight lines which is in fact a special case of Radon trans-
Sign ‘U’ form. Consider a function f(x, y) on a 2-D Euclidean plane
is defined as
Z þ1 Z þ1
Rðq; hÞ ¼ f ðx; yÞdðq x cos h y sin hÞ dx dy
1 1
30 Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37

where d is the Dirac delta function. The d term forces inte- we agree on taking the angle in the range (0, p] and to some
gration of f along the line specified by q and h and, and is precision due to our computational constraints. Let y be a
equivalent to the Hough transform for binary images. point chosen arbitrarily as the reference point for the
For shapes other than straight lines the Radon trans- shape—it is convenient, though not necessary, to choose
form is expressed by replacing the argument of d by a func- a point close to the centroid of the shape—and the vector
tion which forces integration of the image along the shape r = y x displacement vector.
(that is, its boundary). Note that the shape boundary may contain other points
For the proposed system to consider sign language rec- at which the gradient (tangent) has the same direction—
ognition, the following parameters for a generalized signs e.g., x 0 as shown in Fig. 5, and the corresponding displace-
can be defined: a = {y, s, h} where y = (xr, yr) is a reference ment vector is different r 0 5 r. We shall store the displace-
origin for a shape in the image space, h is shape orientation, ment vectors as function mapping each possible / to a set
and s = (sx, sy) describes two orthogonal scale factors, of displacement vectors. The resulting data structure is
along x- and y-axis respectively. For the sake of simplicity called the ‘‘Theta Radius Distribution Matrix’’. Its format
we shall discuss the specific (but significant) case where s is is illustrated by the table below, where y stands for the ref-
a scalar, and the parameter space has four dimensions. erence point chosen for the shape, B for shape boundary, /
For simplicity, consider the sign ‘‘O’’ taken from a side (x) denotes the gradient (tangent) direction at x.
angle (shown in Fig. 4) which represents a circle with fixed
radius r0; since the radius is fixed, we do not need the scale
parameter, and due to the symmetry the orientation
parameter is redundant as well, so the accumulator is con- i /i R/i
gruent to the image.
1 D/ {r j y r = x, x 2 B, /(x) = D/}
But this is not the case if to consider an arbitrary shape
recognition as shown in Fig. 5. Let the tangent line at the
2.
..
2D/
.. .. j y r = x, x 2 B, /(x) = 2D/}
{r
. .
point x be directed by angle / (note that this angle is mea-
p/D/ p {r j y r = x, x 2 B, /(x) = p}
sured in respect to some fixed direction in the image space);

Fig. 4. Simplification of ‘‘O’’ sign.

Fig. 5. Coordination in image space.


Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37 31

One may choose to measure directions rather from p2 2.3.3. Creating the network
to p2 to, or in any other range of the same magnitude— Creating a network object is accomplished by training a
the table rows will be just cyclically shifted. feed-forward back propagation network, using the MAT-
The reference point inside the image sign can be chosen LAB built-in function (newff). It requires four input and
according to the following heuristics: returns the network object, the first input is (R, 2) matrix
! of minimum and maximum values for each of the R ele-
1 X X ments of the input vector. The second input is an array
y ¼ ðy 1 ; y 2 Þ ¼ x1 ; x2 containing the sizes of each layer; the third input is a cell
jBj B B
array containing the names of the transfer functions used
in each layer. The final input contains the names of the
that is, the reference is made to the mean of the pattern
training functions used. The following command was used
points. This will keep r relatively small (not much larger
to create a three-layer network:
than necessary), consequently reducing the error.
net1 ¼ newffðinput range; ½214 3 214
2.3. Classification phase 2 214; f‘logsig’ ‘logsig’ ‘logsig’g; ‘traingdx’Þ
There are (214 * 3) neurons in the first hidden layer,
The classification neural network is shown in Fig. 6, the
(214 * 2) neurons in the second hidden layer, and
neural network has (200) instances as its input vector, and
(214 * 1) neurons in the output layer. For the three-layer
214 output neurons in the output layer.
Log-Sigmoid (‘logsig’) transfer function was used.
After executing the above command, a network object is
2.3.1. Network architecture created, weights and biases of the network are initialized
When working with neural networks it is hard to predict and the network is ready for training.
what the result will be. Sometimes practice represents the
best solution. Decision in this field is very difficult; you 2.3.4. Training the network
have to examine different architectures and decide accord- The training process starts by a set of examples of
ing to their results. proper network behavior-network inputs and target out-
Therefore, after several experiments, it has been decided puts. During training, the weight and biases of the network
that the proposed system should based on supervised learn- are iteratively adjusted to minimize the network perfor-
ing in which the learning rule is provided with the set of mance (net.performfcn). The default performance function,
examples (the training set). for feed-forward networks, is the mean square error
(MSE), which is defined as the average squared error
2.3.2. Target architecture between the networks outputs and target outputs.
The size of our target matrix is 214 by 200. Fig. 7 The training function that was used is traingdx. It works
depicted a part of the target where each row specifies the accurately with noise patterns in training, and they increase
class to which an instance belongs to. the network accuracy on unseen samples. That is why we

Fig. 6. Classification network.

Fig. 7. Part of target matrix for the training set.


32 Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37

choose it in the final implementation. The number of 3. Experiments results and analysis
epochs was 10,000 and the goal was 0.0001.
In this section, we evaluate the performance of our rec-
2.3.5. Testing and simulating the network ognition system by testing its ability to classify signs for
The system was tested with (100) images, (five images for both training and testing set of data. The effect of the num-
each sign) untrained images; previously unseen for the test- ber of inputs to the neural network is considered. In addi-
ing phase. The MATLAB built-in function (sim) simulates tion we discuss some problems in the performance of some
a network. The behavior of (sim) takes the network input, signs due to the similarities between them.
and the network object, then returning the network out-
puts. More than one network was trained and simulated. 3.1. Data set
For user convenience and simplicity, we have created a
GUI that helps in finding out the results. An example is The data set used for training and testing the recognition
shown in Figs. 8 and 9 respectively. system consists of grayscale images for all of the 20 signs

Fig. 8. Example on GUI simulating sign ‘L’.

Fig. 9. Example on GUI simulating sign ‘meet’.


Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37 33

used in the experiment, these 20 signs are shown in Fig. 10. for training purpose, while the remaining five signs were
Also 15 samples for each sign were taken from 15 different used for testing. The samples were taken from different dis-
volunteers. For each sign, 10 out of 15 samples were used tances by digital camera, and with different orientations.

Fig. 10. ASL signs used in the proposed system.

Fig. 11. Training chart for six samples for each sign.
34 Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37

This way, we were able to obtain a data set with cases that tion rate) of the tested data is very low (51%) and the num-
have different sizes and orientations, so we can examine the ber of training data is too small that does not have different
capabilities of our feature extraction scheme. reasonable orientations.
In order to obtain a more satisfactory result, we
3.2. Recognition rate trained the network on eight samples for each sign and
with (0.15) threshold for Canny edge detector. The training
We evaluate the performance of the system based on its chart for this network is shown in Fig. 12. The testing
ability to correctly classify samples to their corresponding results for both training and testing data are shown in
classes. The recognition rate is defined as the ratio of the Table 5.
number of correctly classified samples to the total number As shown, the results above were much better than the
of samples, i.e. previous. Another experiment which was the most satisfac-
number of correctly classified signs tory, was the trained network on 10 samples for each sign
Recognition rate ¼ and with (0.25) threshold for Canny edge detector. The
Total number of signs
training chart for this network is shown in Fig. 13. The
100% testing results for both training and testing data are shown
Through the experiment of the proposed system, firstly we in Table 6. However, the details results for the last network
trained our system on six samples for each sign and with no are shown in Table 7.
threshold for Canny edge detector. The training chart for
this network is shown in Fig. 11. The testing results, for 3.3. Hardware and software
both training and testing data are shown in Table 4.
Although the overall system has a good performance it The system was implemented in MATLAB version 6.5.
cannot be guaranteed because the performance (recogni- The recognition training and tests were run on a modern

Table 5
Table 4 Results of training eight samples for each sign and with (0.15) Canny
Results of training six samples for each sign and without Canny threshold threshold
Data No. of samples Recognized samples Recognition rate (%) Data No. of samples Recognized samples Recognition rate (%)
Training 120 120 100.0 Training 160 158 98.75
Testing 100 51 51.00 Testing 100 80 80.00
Total 220 171 77.72 Total 260 238 91.53

Fig. 12. Training chart for a network trained on eight samples for each sign and with (0.15) Canny threshold.
Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37 35

Fig. 13. Training chart for a network trained on 10 samples for each sign and with (0.25) Canny threshold.

Table 6 4. Conclusions and future work


Results of training 10 samples for each sign and with (0.25) Canny
threshold
In this project, we developed a system for the purpose of
Data No. of samples Recognized samples Recognition rate (%) the recognition of a subset of the American sign language.
Training 200 197 98.50 The system has two phases: the feature extraction phase
Testing 100 80 80.00 and the classification phase. The work was accomplished
Total 300 277 92.33
by training a set of input data (feature vectors). Without
the need of any gloves, an image for the sign is taken by
a camera. After processing, feature extracting phase
Table 7 depends on Hough transformation which is tolerant to
Results obtained when training 10 samples for each size with a Canny gaps in feature boundary descriptions and it is relatively
threshold of 0.25 unaffected by image noise. These vectors are fed to the neu-
Sign Recognized samples Misclassified samples Recognition rate (%) ral network.
A 15 0 100 The proposed system proved to be robust against
B 14 1 93.3 changes in gestures, position, size and direction. This is
D 15 0 100 because the extracted features method used proved to be
E 14 1 93.3 translation, scale, and rotation invariant. The proposed
F 15 0 100
system was able to reach a recognition rate of about
I 12 3 80.0
K 15 0 100 98.5% for training data and 80% for testing data.
L 13 2 86.7
M 14 1 93.3 4.1. Future work
R 13 2 86.7
U 13 2 86.7
• The work presented in this project dealt with static signs
V 14 1 93.3
W 14 1 93.3 of ASL only. Extending the system to be able to deal
Y 11 4 73.3 with dynamic signs is an attractive point for future
3 15 0 100 work.
5 15 0 100 • The system deals with images that have a uniform back-
7 13 2 86.7
ground. Getting rid of this limitation will make the sys-
Love 12 3 80.0
More 15 0 100 tem more flexible and suitable for real life applications—
Meet 15 0 100 as there is no control on the environment (i.e.
Total 277 23 92.33
background).
• It is important to consider increasing the data size, so
we can have more accurate and highly performance
standard PC (1.8 GHz AMD processor, 128 MB of RAM) system.
running under Windows 2000. • Training the network to other types of images.
36 Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37

• Looking for possible changes in the environment by Hough, P. (1962). Method and means for recognizing complex patterns,
designing a new system that works in real-time US Patent (3,069,654).
Imagawa, K., Matsuo, H., Taniguchi, R., Arita, D., & Igi, S. (2000).
environment. Recognition of local features for camera-based sign language recog-
nition system. In International conference on pattern recognition
References (ICPR), pp. 4849–4853.
International Bibliography of Sign Language, (2005). http://www.sign-
Aha, D. W., Dennis, K., & Marc, A. (1991). Instance-based learning lang.uni-hamburg.de/bibweb/F-Journals.html.
algorithms. Machine Learning, 6, 37–66. International Journal of Language & Communication Disorders, (2005).
Banarse, D. S. (1993). Hand posture recognition with the neocognitron Available from http://www.newcastle.edu.au/renwick/ROL/Jnlcon-
network. Master’s thesis, School of Electronic Engineering and tents/000mgmgf.htm.
Computer Systems, University College of North Wales, Bangor. Kadous, W. (1995). GRASP: recognition of Australian sign language
Birk, H., Moeslund, T. B., & Madsen, C. B. (1997). Real-time recognition using instrumented gloves. Bachelor’s thesis, University of New South
of hand alphabet gestures using principal component analysis. In Wales.
Proceedings of the 10th Scandinavian conference on image analysis. Kadous, W. (1996). Machine recognition of Auslan signs using Power-
Bowden, R., & Sarhadi, M. (2002). A non-linear model of shape and Glove: towards large-lexicon recognition of sign language. In Pro-
motion for tracking finger spelt American sign language. Image and ceeding of the workshop on the integration of gesture in language and
Vision Computing, IVC(20)(9–10), 597–607. speech, Wilmington, DE, pp. 165–174.
Brand, M., & Irfan, E. (1995). Causal analysis for visual gesture Kramer, J., & Leifer, L. (1990). A ‘‘Talking Glove’’ for nonverbal deaf
understanding, MIT Media Laboratory Perceptual Computing Section individuals. Technical Report CDR TR 1990 0312, Centre For Design
Technical Report No. 327. Research, Stanford University.
Canny, J. (1986). A computational approach to edge detection. IEEE Liang, R. H., & Ouhyoung, M. (1998). A real-time continuous gesture
Transactions on Pattern Analysis and Machine Intelligence, 8(6), 1986, recognition system for sign language. In Proceeding of the third
November. international conference on automatic face and gesture recognition,
Charayaphan, C., & Marble, A. (1992). Image processing system for Nara, Japan, pp. 558–565.
interpreting motion in American sign language. Journal of Biomedical Martin, J., & James L. C. (1997). An appearance-based approach to
Engineering, 14, 419–425. gesture recognition. In Proceedings of the ninth international confer-
Charniak, E. (1993). Statistical language learning. Cambridge: MIT Press. ence on image analysis and processing, pp. 340–347.
Cutler, R., & Turk, M. (1998). View-based interpretation of real-time Murakami, K., & Taguchi, H. (1991). Gesture recognition using recurrent
optical flow for gesture recognition. IEEE International Conference on neural networks. In Proceedings of CHI’91 human factors in comput-
Automatic Face and Gesture Recognition. ing systems, pp. 237–242.
Darrell, T., & Pentland, A. (1993). Recognition of space–time gestures Nam, Y., & Wohn, K. Y. (1996). Recognition of space–time hand-gestures
using a distributed representation. MIT Media Laboratory Vision and using hidden Markov model. In Proceedings of the ACM symposium on
Modeling Group Technical Report No. 197. virtual reality software and technology ’96 (pp. 51–58). ACM Press.
Davis J., & Shah M. (1993). Gesture recognition, Technical Report, National Institute on Deafness and Other Communication Disorders,
Department of Computer Science, University of Central Florida, CS- (2005). Available from: http://www.nidcd.nih.gov/health/hearing/
TR-93-11. asl.asp.
Dorner, B., & Hagen, E. (1994). Towards an American sign language Newby, G. (1993). Gesture recognition using statistical similarity. In
interface. Artificial Intelligence Review, 8(2–3), 235–253. Proceedings of virtual reality and persons with disabilities.
Duda, R. O., & Hart, P. E. (1972). Use of the Hough transformation to Ong, S., & Ranganath, S. (2005). Automatic sign language analysis: a
detect lines and curves in pictures. Communications of the ACM, 15, survey and the future beyond lexical meaning. IEEE Transactions on
11–15. Pattern Analysis and Machine Intelligence, 27(6).
Fels, S. (1994). Glove-TalkII: mapping hand gestures to speech using PDF Reference, (2004). Addison-Wesley, 5th ed.
neural networks—an approach to building adaptive interfaces. PhD Quek, Francis (1994). Toward a vision-based gesture interface. In
thesis, Computer Science Department, University of Toronto. Proceedings of the ACM symposium on virtual reality software and
Fels, S., & Hinton, G. (1993). GloveTalk: a neural network interface technology ’94 (pp. 17–31). ACM Press.
between a DataGlove and a speech synthesizer. IEEE Transactions on Rangarajan, K., & Shah, M. (1991). Establishing motion correspondence.
Neural Networks, 4, 2–8. CVGIP: Image Understanding, 54, 56–73.
Fukushima, K. (1989). Analysis of the process of visual pattern recogni- Rubine, D. (1991). Specifying gestures by example. In Proceedings of
tion by the neocognitron. Neural Networks, 2, 413–420. SIGGRAPH’91 (pp. 329–337). ACM Press.
Gonzalez, R. C., Woods, R. E., & Eddins, S. L. (2004). Digital image Rung-Huei, L., & Ouhyoung, M. (1996). A sign language recognition
processing using MATLAB. Prentice Hall. system using hidden Markov model and context sensitive search. In
Grimes, G. (1983). Digital data entry glove interface device, Patent Proceedings of the ACM symposium on virtual reality software and
4,414,537, AT & T Bell Labs. technology ’96 (pp. 59–66). ACM Press.
Grobel, K., & Assan, M. (1996). Isolated sign language recognition using Seymore, K., McCallum, A., & Rosenfeld, R. (1999). Learning hidden
hidden Markov models. In Proceedings of the international conference Markov model structure for information extraction. AAAI 99 work-
of system, man and cybernetics, pp. 162–167. shop on machine learning for information extraction.
Hand, C., Sexton, I., & Mullan, M. (1994). A linguistic approach to the Starner, T. (1995). Visual recognition of American sign language using
recognition of hand gestures. In Proceedings of the designing future hidden Markov models. Master’s thesis, Massachusetts Institute of
interaction conference, University of Warwick, UK. Technology.
Haykin, S. (1999). Neural networks: a comprehensive foundation. Prentice Starner, T., & Pentland, A. (1995). Visual recognition of American sign
Hall. language using hidden Markov models. International Workshop on
Heap, A. J., & Samaria, F. (1995). Real-time hand tracking and gesture Automatic Face and Gesture Recognition, Zurich, Switzerland, pp.
recognition using smart snakes. In Proceedings of interface to real and 189–194.
virtual worlds, Montpellier. Sturman, D. (1992). Whole-hand Input. Ph.D dissertation, Massachusetts
Hong, P. et al. (2000). Gesture modeling and recognition using finite state Institute of Technology.
machines. IEEE international conference on automatic face and Sturman, David, & Zeltzer, David (1994). A Survey of glove-based Input.
gesture recognition, pp. 410–415. IEEE Computer Graphics and Applications, 14(1), 30–39.
Q. Munib et al. / Expert Systems with Applications 32 (2007) 24–37 37

Symeonidis, K. (2000). Hand gesture recognition using neural networks. Wexelblat, A. (1994). A feature-based approach to continuous-gesture
Master’s thesis, University of Surrey. analysis. Master’s thesis, Massachusetts Institute of Technology.
Takahashi, T., & Kishino, F. (1991). Hand gesture coding based on Wexelblat, A. (1995). An approach to natural gesture in virtual environ-
experiments using a hand gesture interface device. SIGCHI Bulletin, ments. ACM Transactions on Computer–Human Interaction, 2(3), 179–200.
23(2), 67–73. Yang, M., & Ahuja, N. (1999). Recognizing hand gestures using motion
Vamplew, P. (1996). Recognition of sign language using neural networks. trajectories. In IEEE international conference on computer vision and
PhD Thesis, Department of Computer Science, University of pattern recognition (CVPR), pp. 466–472.
Tasmania. Yoshinori, K., Tomoyuki I., Kang-Hyun, J., Nobutaka, S., & Yoshiaki, S.
Vogler, C., & Metaxas, D. (1997). Adapting hidden Markov models for (1998). Vision-based human interface system: selectively recognizing
ASL recognition by using three-dimensional computer vision methods. intentional hand gestures. In Proceedings of the IASTED international
In Proceedings of the IEEE international conference on systems, man conference on computer graphics and imaging, pp. 219–223.
and cybernetics, Orlando, pp. 156–161. Zimmerman, T., Lanier, J., Blanchard, C., Bryson, S., & Harvill, Y.
Watson, R. (1993). A survey of gesture recognition techniques. Technical (1987). A hand gesture interface device. In Proceedings of CHI + GI’87
Report TCD-CS-93-11, Department of Computer Science, Trinity human factors in computing systems and graphics interface
College Dublin. (pp. 189–192). ACM Press.

You might also like