Skip to content

supermlorg/superml-java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SuperML Logo

By SuperML.org with SuperML.dev

SuperML Java

Build Status Maven Central Performance Tests Version

A comprehensive, modular Java Machine Learning Framework inspired by scikit-learn, developed by the SuperML community.

Featured On

SuperML Java listed on SuperML

Overview

SuperML Java 3.1.2 is a sophisticated 21-module machine learning library for Java that delivers enterprise-grade performance with 400K+ predictions/second and 21/21 modules compiling successfully. Now published on Maven Central. The framework provides:

  • 🎯 Supervised Learning: 15+ algorithms including Logistic Regression, Linear Regression, Ridge, Lasso, Decision Trees, Random Forest, XGBoost with lightning-fast training
  • 🤖 Transformer Models: Complete implementation with Encoder-Only (BERT), Decoder-Only (GPT), and Full Transformer architectures
  • 🔍 Unsupervised Learning: K-Means clustering with k-means++ initialization and advanced convergence criteria
  • ⚙️ Data Preprocessing: Feature scaling, normalization, encoding, and comprehensive transformation utilities
  • 🔧 Model Selection: Cross-validation, hyperparameter tuning (Grid/Random Search), and automated optimization
  • 🚀 Pipeline System: Seamless chaining of preprocessing and models like scikit-learn
  • 🤖 AutoML Framework: Automated algorithm selection and hyperparameter optimization with ensemble methods
  • 📊 Dual-Mode Visualization: Professional XChart GUI with ASCII terminal fallback
  • 🌐 Kaggle Integration: One-line training on any Kaggle dataset with automated workflows
  • ⚡ Inference Engine: High-performance model serving with microsecond predictions, caching, and monitoring
  • 📈 Comprehensive Metrics: Complete evaluation suite for classification, regression, and clustering
  • 💾 Model Persistence: Save/load models with automatic statistics capture and version management
  • 🔄 PMML Export: Complete PMML 4.4 support for cross-platform model deployment
  • 🔄 Cross-Platform Export: ONNX and PMML support for enterprise deployment
  • 📱 Drift Detection: Real-time model and data drift monitoring with automated alerts
  • 📚 Professional Logging: Configurable Logback/SLF4J logging framework
  • 🏭 Production Ready: Enterprise-grade error handling, validation, and concurrent processing

🚀 Quick Start

Basic Classification with Visualization

import org.superml.datasets.Datasets;
import org.superml.linear_model.LogisticRegression;
import org.superml.pipeline.Pipeline;
import org.superml.preprocessing.StandardScaler;
import org.superml.model_selection.ModelSelection;
import org.superml.visualization.VisualizationFactory;

// Load data and create pipeline
Datasets.Dataset dataset = Datasets.loadIris();
Pipeline pipeline = new Pipeline()
    .addStep("scaler", new StandardScaler())
    .addStep("classifier", new LogisticRegression());

// Train and evaluate
ModelSelection.TrainTestSplit split = ModelSelection.trainTestSplit(dataset.X, dataset.y, 0.2, 42);
pipeline.fit(split.XTrain, split.yTrain);
double[] predictions = pipeline.predict(split.XTest);

// Professional visualization (GUI + ASCII fallback)
VisualizationFactory.createDualModeConfusionMatrix(split.yTest, predictions, 
    new String[]{"Setosa", "Versicolor", "Virginica"}).display();

AutoML - One Line Training

import org.superml.autotrainer.AutoTrainer;

// Automated algorithm selection and optimization
AutoTrainer.AutoMLResult result = AutoTrainer.autoML(dataset.X, dataset.y, "classification");
System.out.println("Best Algorithm: " + result.getBestAlgorithm());
System.out.println("Best Score: " + result.getBestScore());

🌐 Kaggle Integration

Train on any Kaggle dataset with one line:

import org.superml.datasets.KaggleTrainingManager;
import org.superml.datasets.KaggleIntegration.KaggleCredentials;

KaggleCredentials credentials = KaggleCredentials.fromDefaultLocation();
KaggleTrainingManager trainer = new KaggleTrainingManager(credentials);

// Configure training with model saving
KaggleTrainingManager.TrainingConfig config = new KaggleTrainingManager.TrainingConfig()
    .setSaveModels(true)
    .setModelsDirectory("kaggle_models")
    .setAlgorithms("logistic", "ridge")
    .setGridSearch(true);

List<KaggleTrainingManager.TrainingResult> results = trainer.trainOnDataset("titanic", "titanic", "survived", config);
System.out.println("Best model: " + results.get(0).algorithm);
System.out.println("Model saved to: " + results.get(0).modelFilePath);

⚡ Performance Highlights

SuperML Java 3.1.2 delivers enterprise-grade performance across all 21 modules:

🏗️ Build & Deployment

  • 21/21 modules compile successfully (100% build success rate)
  • ~4 minute full framework build time
  • 🧪 172+ tests pass across all modules with comprehensive coverage
  • 📦 Published to Maven Central at org.superml:superml-core:3.1.2

🚀 Runtime Performance

  • 400,000+ predictions/second - XGBoost batch inference
  • 🔥 35,714 predictions/second - Production pipeline throughput
  • ⚙️ ~6.88 microseconds - Single prediction latency
  • 🧠 Real-time neural networks - MLP/CNN/RNN with epoch-by-epoch training

🎯 Algorithm Benchmarks

  • XGBoost: Lightning-fast training (2.5 seconds) with early stopping & hyperparameter optimization
  • Neural Networks: Full training cycles with comprehensive loss tracking (46 tests passed)
  • Random Forest: Superior accuracy (89%+) with parallel tree construction
  • Linear Models: Millisecond training times with L1/L2 regularization (34 tests passed)

🌟 Advanced Capabilities

  • 🎲 Cross-Validation: Robust 5-fold CV with parallel execution
  • 🔍 AutoML: Automated hyperparameter tuning with grid/random search
  • 📊 Kaggle Integration: Complete competition workflows from data to submission
  • 💾 Model Persistence: High-speed serialization with automatic statistics capture
  • 📈 Production Monitoring: Real-time drift detection and performance tracking

All benchmarks verified on comprehensive test suite with synthetic and real-world datasets.

💾 Model Persistence

Save and load trained models with automatic training statistics capture:

import org.superml.persistence.ModelPersistence;
import org.superml.persistence.ModelManager;

// Train a model
LogisticRegression model = new LogisticRegression().setMaxIter(1000);
model.fit(X_train, y_train);

// Save with automatic performance evaluation and statistics
ModelPersistence.saveWithStats(model, "my_model", 
                               "Production iris classifier", 
                               X_test, y_test);

// Load model with type safety
LogisticRegression loadedModel = ModelPersistence.load("my_model", LogisticRegression.class);
double[] predictions = loadedModel.predict(X_test);

// The framework automatically captures:
// - Performance metrics (accuracy, precision, recall, F1)
// - Dataset statistics and hyperparameters
// - System information and timestamps

// Manage multiple models with automatic statistics
ModelManager manager = new ModelManager("models");
String savedPath = manager.saveModel(model, "iris");
List<String> allModels = manager.listModels();

🎯 Features

Algorithms (12+ Implementations)

  • Linear Models (6 algorithms):

    • Logistic Regression with automatic multiclass support and L1/L2 regularization
    • Linear Regression with normal equation and closed-form solution
    • Ridge Regression with L2 regularization
    • Lasso Regression with L1 regularization and coordinate descent
    • SGD Classifier/Regressor with stochastic optimization
    • Advanced regularization and convergence strategies
  • Tree-Based Models (5 algorithms):

    • Decision Tree with CART implementation (classification & regression)
    • Random Forest with bootstrap aggregating and parallel training
    • Gradient Boosting with early stopping and validation monitoring
    • Advanced ensemble methods with feature importance
    • Optimized splitting criteria and pruning strategies
  • Clustering (1 algorithm):

    • K-Means with k-means++ initialization, multiple restarts, and convergence monitoring

Data Processing & Pipeline

  • Advanced Preprocessing: StandardScaler, MinMaxScaler, RobustScaler, LabelEncoder
  • Data Management: CSV loading, synthetic data generation, built-in datasets (Iris, Wine, etc.)
  • Pipeline System: Seamless chaining of preprocessing steps and models
  • Feature Engineering: Comprehensive transformation utilities

Model Selection & AutoML

  • Hyperparameter Optimization: Grid Search and Random Search with parallel execution
  • Cross-Validation: K-fold validation with comprehensive metrics and statistical analysis
  • AutoML Framework: Automated algorithm selection, hyperparameter tuning, and ensemble building
  • Parameter Spaces: Discrete, continuous, and integer parameter configurations

Visualization & Monitoring

  • Dual-Mode Visualization: Professional XChart GUI with ASCII terminal fallback
  • Interactive Charts: Confusion matrices, scatter plots, cluster visualizations
  • Performance Monitoring: Real-time inference metrics and model performance tracking
  • Drift Detection: Automated data and model drift monitoring with statistical tests

Production & Enterprise

  • High-Performance Inference: Microsecond predictions with intelligent caching and batch processing
  • Model Persistence: Save/load models with automatic training statistics and metadata capture
  • Cross-Platform Export: ONNX and PMML support for enterprise deployment
  • Kaggle Integration: Direct dataset download and automated competition workflows
  • Professional Logging: Structured logging with Logback and SLF4J
  • Thread Safety: Concurrent prediction capabilities after model training

📦 Installation

Maven Dependency (Complete Framework)

<dependency>
    <groupId>org.superml</groupId>
    <artifactId>superml-bundle-all</artifactId>
    <version>3.1.2</version>
</dependency>

Modular Installation (Select Components)

<!-- Core + Linear Models (Minimal) -->
<dependency>
    <groupId>org.superml</groupId>
    <artifactId>superml-core</artifactId>
    <version>3.1.2</version>
</dependency>
<dependency>
    <groupId>org.superml</groupId>
    <artifactId>superml-linear-models</artifactId>
    <version>3.1.2</version>
</dependency>

<!-- Add Tree Models -->
<dependency>
    <groupId>org.superml</groupId>
    <artifactId>superml-tree-models</artifactId>
    <version>3.1.2</version>
</dependency>

<!-- Add Visualization -->
<dependency>
    <groupId>org.superml</groupId>
    <artifactId>superml-visualization</artifactId>
    <version>3.1.2</version>
</dependency>

<!-- Add AutoML -->
<dependency>
    <groupId>org.superml</groupId>
    <artifactId>superml-autotrainer</artifactId>
    <version>3.1.2</version>
</dependency>

Build from Source

git clone https://github.com/superml/superml-java.git
mvn clean install

💻 Usage

Basic Classification

import org.superml.datasets.Datasets;
import org.superml.linear_model.LogisticRegression;
import org.superml.tree.RandomForest;
import org.superml.tree.GradientBoosting;
import org.superml.metrics.Metrics;
import org.superml.model_selection.ModelSelection;
import org.superml.preprocessing.StandardScaler;

// Load dataset
Datasets.Dataset dataset = Datasets.loadIris();
ModelSelection.TrainTestSplit split = ModelSelection.trainTestSplit(dataset.X, dataset.y, 0.2, 42);

// Preprocessing
StandardScaler scaler = new StandardScaler();
double[][] XTrainScaled = scaler.fitTransform(split.XTrain);
double[][] XTestScaled = scaler.transform(split.XTest);

// Train multiple models
LogisticRegression lr = new LogisticRegression().setMaxIterations(1000);
RandomForest rf = new RandomForest().setNEstimators(100);
GradientBoosting gb = new GradientBoosting().setNEstimators(100).setLearningRate(0.1);

lr.fit(XTrainScaled, split.yTrain);
rf.fit(XTrainScaled, split.yTrain);
gb.fit(XTrainScaled, split.yTrain);

// Compare performance
double lrAccuracy = Metrics.accuracy(split.yTest, lr.predict(XTestScaled));
double rfAccuracy = Metrics.accuracy(split.yTest, rf.predict(XTestScaled));
double gbAccuracy = Metrics.accuracy(split.yTest, gb.predict(XTestScaled));

System.out.printf("Logistic Regression: %.3f\n", lrAccuracy);
System.out.printf("Random Forest: %.3f\n", rfAccuracy);
System.out.printf("Gradient Boosting: %.3f\n", gbAccuracy);

Cross-Validation and Model Evaluation

import org.superml.model_selection.CrossValidation;

// Basic cross-validation
LogisticRegression classifier = new LogisticRegression();
CrossValidation.CrossValidationResults results = 
    CrossValidation.crossValidate(classifier, X, y);

System.out.println("Accuracy: " + results.getMeanScore("accuracy") + 
                   " ± " + results.getStdScore("accuracy"));

// Custom cross-validation configuration
CrossValidation.CrossValidationConfig config = 
    new CrossValidation.CrossValidationConfig()
        .setFolds(10)
        .setShuffle(true)
        .setRandomSeed(42L)
        .setMetrics("accuracy", "precision", "recall", "f1");

CrossValidation.CrossValidationResults detailedResults = 
    CrossValidation.crossValidate(classifier, X, y, config);

// Regression cross-validation
Ridge regressor = new Ridge();
CrossValidation.CrossValidationResults regressionResults = 
    CrossValidation.crossValidateRegression(regressor, X, y, 
        new CrossValidation.CrossValidationConfig());

Advanced Hyperparameter Tuning

import org.superml.model_selection.HyperparameterTuning;

// Grid Search for Classification
HyperparameterTuning.TuningResults gridResults = HyperparameterTuning.gridSearch(
    new LogisticRegression(),
    X, y,
    HyperparameterTuning.ParameterSpec.discrete("learningRate", 0.01, 0.1, 0.5),
    HyperparameterTuning.ParameterSpec.discrete("maxIter", 500, 1000, 1500)
);

System.out.println("Best parameters: " + gridResults.getBestParameters());
System.out.println("Best score: " + gridResults.getBestScore());

// Grid Search for Regression
HyperparameterTuning.TuningResults regressionGrid = 
    HyperparameterTuning.gridSearchRegressor(
        new Ridge(),
        X, y,
        HyperparameterTuning.ParameterSpec.discrete("alpha", 0.1, 1.0, 10.0),
        HyperparameterTuning.ParameterSpec.continuous("tolerance", 1e-6, 1e-3, 5)
    );

// Random Search with Custom Configuration
HyperparameterTuning.TuningConfig advancedConfig = 
    new HyperparameterTuning.TuningConfig()
        .setScoringMetric("f1")
        .setCvFolds(5)
        .setParallel(true)
        .setVerbose(true)
        .setRandomSeed(123L);

HyperparameterTuning.TuningResults randomResults = 
    HyperparameterTuning.RandomSearch.search(
        new LogisticRegression(),
        X, y,
        Arrays.asList(
            HyperparameterTuning.ParameterSpec.discrete("learningRate", 0.001, 0.01, 0.1, 0.5),
            HyperparameterTuning.ParameterSpec.integer("maxIter", 100, 2000)
        ),
        advancedConfig
    );

// Parameter Specifications
// Discrete values
HyperparameterTuning.ParameterSpec.discrete("param", "A", "B", "C");

// Continuous range with specified steps
HyperparameterTuning.ParameterSpec.continuous("learning_rate", 0.001, 0.1, 10);

// Integer range
HyperparameterTuning.ParameterSpec.integer("max_depth", 1, 20);

Model Persistence and Management

import org.superml.persistence.ModelPersistence;
import org.superml.persistence.ModelManager;

// Train and save a pipeline
Pipeline pipeline = new Pipeline()
    .addStep("scaler", new StandardScaler())
    .addStep("classifier", new LogisticRegression());

pipeline.fit(X_train, y_train);

// Save with rich metadata
Map<String, Object> metadata = Map.of(
    "accuracy", Metrics.accuracy(y_test, pipeline.predict(X_test)),
    "features", X_train[0].length,
    "samples", X_train.length,
    "created_by", "SuperML_Demo"
);

ModelPersistence.save(pipeline, "production_model", "Main classification pipeline", metadata);

// Later, load and use the model
Pipeline loadedPipeline = ModelPersistence.load("production_model", Pipeline.class);
double[] predictions = loadedPipeline.predict(X_new);

// Model management
ModelManager manager = new ModelManager("models");
List<ModelManager.ModelInfo> models = manager.getModelsInfo();
for (ModelManager.ModelInfo info : models) {
    System.out.println(info); // Shows class, size, save time, etc.
}

🚀 Inference Layer

Deploy models in production with high-performance inference capabilities:

import org.superml.inference.InferenceEngine;
import org.superml.inference.BatchInferenceProcessor;

// Create inference engine and load model
InferenceEngine engine = new InferenceEngine();
engine.loadModel("classifier", "models/trained_model.superml");

// Single prediction
double prediction = engine.predict("classifier", features);

// Batch prediction with monitoring
double[] batchPredictions = engine.predict("classifier", batchFeatures);

// Asynchronous inference
CompletableFuture<Double> future = engine.predictAsync("classifier", features);

// Performance metrics
InferenceMetrics metrics = engine.getMetrics("classifier");
System.out.println("Throughput: " + metrics.getThroughputSamplesPerSecond() + " samples/sec");

// Batch processing for large datasets
BatchInferenceProcessor processor = new BatchInferenceProcessor(engine);
BatchResult result = processor.processCSV("input.csv", "output.csv", "classifier");

📚 Documentation

🤝 Contributing

We welcome contributions to SuperML Java! Please see our Contributing Guide for details.

Ways to Contribute

  • Code: Implement new algorithms, improve performance, fix bugs
  • Documentation: Improve guides, add examples, write tutorials
  • Testing: Add test cases, improve coverage, performance testing
  • Community: Help others, report issues, suggest features

Development Setup

git clone https://github.com/superml/superml-java.git
mvn clean compile
mvn test

Code Coverage

SuperML Java includes comprehensive code coverage analysis using JaCoCo:

# Run tests and generate coverage report
mvn clean test jacoco:report

# Use the provided coverage script for detailed analysis
./coverage.sh --summary    # Show coverage summary
./coverage.sh --open       # Open HTML report in browser

Coverage Reports:

  • HTML Report: target/site/jacoco/index.html (visual coverage report)
  • Coverage Summary: Use ./coverage.sh --summary for quick overview
  • Detailed Analysis: See docs/CODE_COVERAGE_REPORT.md

Current Status:

  • -> Multiclass Classification: 85%+ coverage (LogisticRegression, SoftmaxRegression)
  • ⚠️ Tree Algorithms: 0% coverage (new v2.0 features needing tests)
  • ⚠️ Linear Models: 0% coverage (LinearRegression, Ridge, Lasso need tests)

🌟 Community & Support

🏆 Attribution

SuperML Java is developed and maintained by the SuperML Community:

This project is inspired by scikit-learn and aims to bring the same ease of use and comprehensive functionality to the Java ecosystem.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary

  • -> Commercial use - Use in commercial projects
  • -> Modification - Modify and distribute
  • -> Distribution - Distribute original or modified
  • -> Private use - Use for private projects
  • License and copyright notice - Include in all copies
  • Liability - No warranty provided
  • Trademark use - SuperML trademarks not included

🎯 Project Status

Build Status Java Version License Version

Current Version: 1.0-SNAPSHOT
Stability: Beta - Core features complete, API may change
Java Compatibility: Java 11+
Dependencies: Minimal - only essential libraries


Made with ❤️ by the SuperML Community | Visit superML.dev for more projects

About

Modular machine learning framework foreign java for ML model training

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages