Skip to content

Commit bac518c

Browse files
authored
Merge pull request #36 from neil-lindquist/arff-reader
Arff reader
2 parents c4af329 + 4264151 commit bac518c

File tree

13 files changed

+396
-78
lines changed

13 files changed

+396
-78
lines changed

.travis.yml

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ env:
1313
- PATH=$HOME/.roswell/bin:$PATH
1414
- ROSWELL_INSTALL_DIR=$HOME/.roswell
1515
matrix:
16-
# - LISP="sbcl-bin --dynamic-space-size 2560"
16+
- LISP=sbcl-bin
1717
- LISP=ccl-bin
1818
# allow_failures:
1919
# - LISP=ccl-bin
@@ -40,6 +40,10 @@ cache:
4040
- $HOME/.config/common-lisp
4141

4242
script:
43-
- ros -s prove -e '(ql:quickload :clml.test)(in-package :clml.test)
44-
(unless (run-all-tests)
45-
(uiop:quit 1))'
43+
- ros dynamic-space-size=2560 -e '(ql:quickload :clml.test)(in-package :clml.test)
44+
(let ((results (run-all-tests)))
45+
(print-failures results)
46+
(print-errors results)
47+
(when (or (lisp-unit:failed-tests results)
48+
(lisp-unit:error-tests results))
49+
(uiop:quit 1)))'

README.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@
4040
# CL Machine-Learning [![Build Status](https://travis-ci.org/mmaul/clml.svg?branch=master)](https://travis-ci.org/mmaul/clml)
4141

4242
CL Machine-Learning is high performance and large scale statistical
43-
machine learning package written in Common Lisp developed at
43+
machine learning package written in Common Lisp developed at
4444
[MSI](http://cl-www.msi.co.jp).
4545

4646
This repository contains is a authorized fork of the original CLML with the following goals in mind:
@@ -178,7 +178,7 @@ This library is organized as a hierarchical tree of systems.
178178
- clml.graph.graph-utils
179179
- clml.graph.read-graph
180180
- clml.graph.shortest-path
181-
- clml.hjs
181+
- clml.hjs
182182
- clml.hjs.k-means
183183
- clml.hjs.read-data
184184
- clml.hjs.vars
@@ -247,24 +247,24 @@ be done before loading the systems.
247247
Here is a quick demonstration:
248248

249249
CL-USER (ql:quickload :clml)
250-
250+
251251
CL-USER (clml.text.utilities:calculate-levenshtein-similarity "Howdy" "doody")
252252
0.6
253-
CL-USER
254-
CL-USER (setf *syobu* (clml.hjs.read-data:read-data-from-file
253+
CL-USER
254+
CL-USER (setf *syobu* (clml.hjs.read-data:read-data-from-file
255255
(clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/syobu.csv")
256256
:type :csv :csv-type-spec '(string integer integer integer integer)))
257-
258-
257+
258+
259259
#<HJS.LEARN.READ-DATA:UNSPECIALIZED-DATASET >
260260
DIMENSIONS: 種類 | がく長 | がく幅 | 花びら長 | 花びら幅
261261
TYPES: UNKNOWN | UNKNOWN | UNKNOWN | UNKNOWN | UNKNOWN
262262
NUMBER OF DIMENSIONS: 5
263263
DATA POINTS: 150 POINTS
264-
264+
265265
CL-USER (setf *tree* (clml.decision-tree.decision-tree:make-decision-tree *syobu* "種類"))
266-
267-
266+
267+
268268
(((("花びら長" . 30)
269269
(("花びら幅" . 18) ("花びら幅" . 23) ("花びら幅" . 20) ("花びら幅" . 19) ("花びら幅" . 25)
270270
("花びら幅" . 24) ("花びら幅" . 21) ("花びら幅" . 14) ("花びら幅" . 15) ("花びら幅" . 22)
@@ -278,7 +278,7 @@ Here is a quick demonstration:
278278
("花びら幅" . 21) ("花びら幅" . 14) ("花びら幅" . 15) ("花びら幅" . 22) ("花びら幅" . 16)
279279
("花びら幅" . 17) ("花びら幅" . 13) ("花びら幅" . 11) ("花びら幅" . 12) ("花びら幅" . 10)
280280
...
281-
281+
282282
)))
283283
CL-USER
284284
CL-USER (clml.decision-tree.decision-tree:print-decision-tree *tree*)
@@ -323,7 +323,7 @@ about the tests previously ran.
323323
(print-failures myrun) ; prints details of test failures
324324

325325
Individual tests can be ran by the run-tests form. Individual test
326-
being dests defined with the form define-test.
326+
being dests defined with the form define-test.
327327

328328
(run-tests '(matrix-vecs-conversion-test matrix-transpose-test))
329329

@@ -354,13 +354,13 @@ CLML for packages matching the following prefix patterns:
354354

355355
Documentation is in the form of Org files where one Org file per package is placed in
356356
**clml/docs/api**. A package index file containing Org INCLUDE
357-
directives that include
357+
directives that include
358358
Org files generated by the form **generate-clml-api-docs** are placed
359359
in **clml/docs/api/index.org**.
360360

361361
The CLML users manual includes the generated API documentation file index.org,
362362
HTML documentation can then be generated by opening the clml-manual.org file in Emacs
363-
and entering the Org mode export mode with `C-c C-e` and selecting file export with `h h`
363+
and entering the Org mode export mode with `C-c C-e` and selecting file export with `h h`
364364

365365
The README.md file is generated by the org-mode export function.
366366
Which can be done by opening the README.org file in emacs and
@@ -370,7 +370,7 @@ selecting the markdown export option as shown below.
370370
M-x org-md-export-as-markdown
371371
C-x-C-w README.md
372372

373-
The CMLM manual and API documentation can be exported to the desired
373+
The CMLM manual and API documentation can be exported to the desired
374374
format by opening the docs/clml-manual.org and using the org-mode
375375
export `C-c C-e` cord.
376376

@@ -386,8 +386,8 @@ export `C-c C-e` cord.
386386
## Contributing
387387

388388
All contributions are welcome. If the contribution is to resolve and
389-
problem with CLML, please open an issue in the github repository
390-
accompanied by a pull request.
389+
problem with CLML, please open an issue in the github repository
390+
accompanied by a pull request.
391391

392392
If you would like to contribute new functionality, again open an issue
393393
at the clml github repository, describe the proposed functionality and

hjs/src/read-data.lisp

Lines changed: 62 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,8 @@
5151
:accessor dimension-metadata
5252
:type list
5353
:initform '()
54-
:documentation "An alist that stores some useful information, e.g. the hashtable (for equality) for a category dimension."
55-
)))
54+
:documentation "An alist that stores some useful information, e.g. the hashtable (for equality) for a category dimension.")))
55+
5656

5757
(define-condition dimension-unknown-type-error (simple-error)
5858
((dimension :initarg :dimension)))
@@ -492,59 +492,75 @@
492492
(csv-delimiter #\,)
493493
(missing-value-check t)
494494
missing-values-list)
495-
"Reads CSV data from file. The normal convention is first line is column name.
496-
However if CSV-HEADER-P is a list of strings then CSV-HEADER-P specifies the column names"
497-
(assert (member type '(:sexp :csv)))
495+
"Reads an unspecialized dataset from file.
496+
- If TYPE is :SEXP or NIL (the default), a list is read from a s-expression
497+
file. The first element is a list of column names and the rest of the elements
498+
are the data. EXTERNAL-FORMAT defaults to :DEFAULT for opening the file.
499+
- If TYPE is :CSV, the data is read as csv. The normal convention is first
500+
line is column name. However if CSV-HEADER-P is a list of strings then
501+
CSV-HEADER-P specifies the column names. EXTERNAL-FORMAT defaults to shift-jis."
502+
(when (and (not external-format-p)
503+
(eql type :csv))
504+
(setf external-format clml.utility.csv::*csv-default-external-format*))
505+
(with-open-file (stream filename :external-format external-format)
506+
#+allegro (setf (excl:eol-convention f) :anynl-dos)
507+
(read-data-from-stream stream
508+
:type type
509+
:csv-type-spec csv-type-spec
510+
:csv-header-p csv-header-p
511+
:csv-delimiter csv-delimiter
512+
:missing-value-check missing-value-check
513+
:missing-values-list missing-values-list)))
498514

499-
(ecase type
500-
((:sexp nil)
501-
(let (tmp)
502-
(with-open-file (f filename :external-format external-format)
503-
(with-standard-io-syntax
504-
(let ((*read-eval* nil)
505-
(*read-default-float-format* 'double-float))
506-
(setf tmp (read f)))))
507-
508-
(make-unspecialized-dataset
509-
(first tmp)
510-
(map 'vector
511-
(lambda (p)
512-
(coerce p 'vector))
513-
(rest tmp))
514-
:missing-value-check missing-value-check
515-
:missing-values-list missing-values-list)))
516-
(:csv
517-
(multiple-value-bind (data header)
518-
(clml.utility.csv:read-csv-file filename :header csv-header-p :type-spec csv-type-spec :delimiter csv-delimiter
519-
:external-format (if external-format-p external-format
520-
#+allegro :932
521-
#+ccl :Windows-31j
522-
#+sbcl :sjis
523-
#+lispworks :sjis
524-
))
525-
(make-unspecialized-dataset (coerce header 'list) data
526-
527-
:missing-value-check missing-value-check
528-
:missing-values-list missing-values-list)))))
529515

530516
;;;; read and process data
531517
;;@ function-type: string -> unspecialized-dataset
532518
(defun read-data-from-stream (stream &key
519+
(type :csv)
533520
csv-type-spec
534521
(csv-header-p t)
535522
(csv-delimiter #\,)
536523
(missing-value-check t)
537-
missing-values-list)
538-
"Reads CSV data from a stream. The normal convention is first line is column name.
539-
However if CSV-HEADER-P is a list of strings then CSV-HEADER-P specifies the column names"
540-
(multiple-value-bind (data header)
541-
(clml.utility.csv:read-csv-stream stream :header csv-header-p :type-spec csv-type-spec :delimiter csv-delimiter)
542-
(make-unspecialized-dataset (coerce header 'list) data
543-
544-
:missing-value-check missing-value-check
545-
:missing-values-list missing-values-list))
546-
)
547-
524+
(missing-values-list nil missing-values-list-p))
525+
"Reads an unspecialized dataset from file.
526+
- If TYPE is :SEXP or NIL, a list is read from a s-expression file.
527+
The first element is a list of column names and the rest of the elements are the
528+
data.
529+
- If TYPE is :CSV (the default), the data is read as csv. The normal
530+
convention is first line is column name. However if CSV-HEADER-P is a list
531+
of strings then CSV-HEADER-P specifies the column names.
532+
- If TYPE is :ARFF, the data is read as arff. By default, NIL and \"?\"
533+
are used as missing values with arff format."
534+
(assert (member type '(:sexp :csv :arff)))
535+
(ecase type
536+
((:sexp nil)
537+
(let ((tmp (with-standard-io-syntax
538+
(let ((*read-eval* nil)
539+
(*read-default-float-format* 'double-float))
540+
(read stream)))))
541+
(make-unspecialized-dataset
542+
(first tmp)
543+
(map 'vector
544+
(lambda (p)
545+
(coerce p 'vector))
546+
(rest tmp))
547+
:missing-value-check missing-value-check
548+
:missing-values-list missing-values-list)))
549+
(:csv
550+
(multiple-value-bind (data header)
551+
(clml.utility.csv:read-csv-stream stream :header csv-header-p :type-spec csv-type-spec :delimiter csv-delimiter)
552+
(make-unspecialized-dataset (coerce header 'list) data
553+
:missing-value-check missing-value-check
554+
:missing-values-list missing-values-list)))
555+
(:arff
556+
;arff uses ? as a missing value symbol, so use that unless told otherwise
557+
(unless missing-values-list-p
558+
(setf missing-values-list '(nil "?")))
559+
(multiple-value-bind (data header)
560+
(clml.utility.arff:read-arff-stream stream)
561+
(make-unspecialized-dataset (coerce header 'list) data
562+
:missing-value-check missing-value-check
563+
:missing-values-list missing-values-list)))))
548564

549565
;;; function-type: unspecialized-dataset -> specialized-dataset
550566

test/clml.test.asd

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
:components ((:file "package")
3434
(:file "test-groups")
3535
(:file "test-utils")
36+
(:file "test-arff")
3637
(:file "test-stat")
3738
(:file "test-decision-tree")
3839
(:file "test-random-forest")

test/src/package.lisp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@
4444
:clml.association-rule
4545
:clml.som
4646
:clml.pca
47+
:clml.utility.arff
4748
)
4849
(:shadow :make-polynomial-kernel :make-svm-learner :make-rbf-kernel :make-svm-validation
4950
:svm-validation :load-svm-learner :make-svm-kernel :make-linear-kernel
@@ -88,4 +89,3 @@
8889
*pca-tests*
8990
*hjs-tests*)
9091
)
91-

test/src/test-arff.lisp

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
2+
(in-package clml.test)
3+
4+
(define-test parse-arff
5+
(let* ((raw-data "% 1. Title: Iris Plants Database
6+
%
7+
% 2. Sources:
8+
% (a) Creator: R.A. Fisher
9+
% (b) Donor: Michael Marshall (MARSHALL%[email protected])
10+
% (c) Date: July, 1988
11+
12+
@RELATION iris
13+
14+
@ATTRIBUTE sepallength numeric
15+
@ATTRIBUTE sepalwidth REAL
16+
@ATTRIBUTE petallength reaL
17+
@ATTRIBUTE petalwidth Real
18+
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
19+
20+
@DATA
21+
4.8,3.0,1.4,0.3,Iris-setosa
22+
5.1,3.8,1.6,0.2,Iris-setosa
23+
4.6,3.2,1.4,0.2,Iris-setosa
24+
5.3,3.7,1.5,0.2,Iris-setosa
25+
5.0,3.3,1.4,0.2,Iris-setosa
26+
5.7,3.0,4.2,1.2,Iris-versicolor
27+
5.7,2.9,4.2,1.3,Iris-versicolor
28+
6.2,2.9,4.3,1.3,Iris-versicolor
29+
5.1,2.5,3.0,1.1,Iris-versicolor
30+
5.7,2.8,4.1,1.3,Iris-versicolor
31+
6.7,3.0,5.2,2.3,Iris-virginica
32+
6.3,2.5,5.0,1.9,Iris-virginica
33+
6.5,3.0,5.2,2.0,Iris-virginica
34+
6.2,3.4,5.4,2.3,Iris-virginica
35+
5.9,3.0,5.1,1.8,Iris-virginica
36+
%")
37+
(parsed-data (read-arff-stream (make-string-input-stream raw-data))))
38+
(assert-equal 15 (length parsed-data))
39+
(loop for row across parsed-data
40+
do (assert-equal 5 (length row))
41+
do (assert-true (floatp (aref row 0)))
42+
do (assert-true (floatp (aref row 1)))
43+
do (assert-true (floatp (aref row 2)))
44+
do (assert-true (floatp (aref row 3)))
45+
do (assert-true (stringp (aref row 4))))))

test/src/test-groups.lisp

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,14 +75,17 @@
7575
; test-spline error
7676
))
7777

78+
(defparameter *util-tests* '(parse-arff))
79+
7880
(defparameter *all-tests* (append *statistics-tests* *decision-tree-tests* *clustering-tests*
7981
*time-series-tests* *svm-tests* *classifiers-tests*
8082
; *association-rule-tests* error result key is cons not string
8183
*som-tests* *text-tests*
8284
; *pca-tests* takes too long
8385
*hjs-tests*
8486
*nonparametric-tests*
85-
*nearest-search*))
87+
*nearest-search*
88+
*util-tests*))
8689
#| #'test-sample-assoc #'test-sample-cluster-validation
8790
#'test-decision-tree #'test-sample-expl-smthing
8891
#'test-hc #'test-sample-k-means

test/src/test-k-means.lisp

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
(define-test test-sample-k-means
55
(let (dataset centroids result)
66
(assert
7-
(setf dataset (read-data-from-file (clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/pos.sexp") :external-format #+allegro :932 #+sbcl :sjis #+ccl :Windows-31 #+lispworks :sjisj)))
7+
(setf dataset (read-data-from-file (clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/pos.sexp") :external-format #+allegro :932 #+sbcl :sjis #+ccl :Windows-31j #+lispworks :sjisj)))
88
(assert
99
(setf dataset
1010
(pick-and-specialize-data dataset :range '(2 3) :data-types '(:numeric :numeric))))
@@ -20,4 +20,3 @@
2020
do (assert-eql (c-size cluster) (length pts))
2121
(loop for pt across pts
2222
do (assert-true (find pt (mapcar #'p-pos (c-points cluster)) :test #'point-equal))))))
23-

test/src/test-k-nn.lisp

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -48,11 +48,11 @@
4848
(assert-points-equal
4949
(map 'vector (lambda (pts) (map 'clml.hjs.meta:dvec
5050
(lambda (val m)
51-
#+sbcl ; sbcl bug
51+
#+(or sbcl ccl)
5252
(if (zerop m)
5353
0d0
5454
(/ val m))
55-
#-sbcl
55+
#-(or sbcl ccl)
5656
(handler-case (/ val m)
5757
(division-by-zero (c) (declare (ignore c)) 0d0)))
5858
(subseq pts 1)
@@ -78,5 +78,3 @@
7878
"116" "80" "128" "188" "97" "167" "197" "196" "196" "196")
7979
(map 'list (lambda (vec) (svref vec 0)) (dataset-points result))
8080
:test #'string=)))))
81-
82-

test/src/test-ts-burst-detection.lisp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
(:INDEX 2 :START 450.0 :END 570.0) (:INDEX 3 :START 565.0 :END 570.0)
3636
(:INDEX 1 :START 710.0 :END 780.0)))
3737
(assert-equality
38-
#'equal
38+
#'string-equal
3939
(let ((str (make-array 0 :element-type 'character
4040
:adjustable t :fill-pointer t)))
4141
(with-output-to-string (s str)

0 commit comments

Comments
 (0)