ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement LearningPreferred Networks
Introduction of Deep Reinforcement Learning, which was presented at domestic NLP conference.
言語処理学会第24回年次大会(NLP2018) での講演資料です。
http://www.anlp.jp/nlp2018/#tutorial
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...Toru Tamaki
Jaejun Yoo, Namhyuk Ahn, Kyung-Ah Sohn; Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8375-8384
https://openaccess.thecvf.com/content_CVPR_2020/html/Yoo_Rethinking_Data_Augmentation_for_Image_Super-resolution_A_Comprehensive_Analysis_and_CVPR_2020_paper.html
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...Toru Tamaki
Jaejun Yoo, Namhyuk Ahn, Kyung-Ah Sohn; Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8375-8384
https://openaccess.thecvf.com/content_CVPR_2020/html/Yoo_Rethinking_Data_Augmentation_for_Image_Super-resolution_A_Comprehensive_Analysis_and_CVPR_2020_paper.html
Makoto Yui will give a talk about Apache Hivemall, an open-source machine learning library for Apache Hive, Spark, and Pig. He will discuss his experience with open-source software (OSS) and how Hivemall was accepted into the Apache Software Foundation incubator. The talk will cover what Hivemall is, its features, use cases at Treasure Data, and lessons learned from the incubation process.
Introduction to Apache Hivemall v0.5.2 and v0.6Makoto Yui
This document discusses Apache Hivemall, an open-source machine learning library for SQL-on-Hadoop. It can be used with Apache Hive, Spark SQL, Spark DataFrame API, and Pig Latin to add machine learning capabilities to SQL queries. The presentation describes Hivemall's capabilities, how it works with different SQL query engines, and new features in versions 0.5.2 and 0.6 like field-aware factorization machines, XGBoost support, and word2vec. Future work is also outlined, including multi-class logistic regression and hyperparameter tuning.
The document discusses the idea behind Apache Hivemall, which is an open-source machine learning library that allows running machine learning on large datasets stored in data warehouses. It addresses concerns about scalability, data movement, and tools when performing machine learning on big data. It suggests pushing more machine learning logic, like data preprocessing, back to the database where the data resides for better performance and stability. Hivemall provides machine learning functions that can be used within SQL queries on Hadoop systems like Hive and Spark SQL, enabling parallel and distributed machine learning.
Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.
Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive.
We have released the first Apache release (v0.5.0-incubating) on Mar 5, 2018 and the project plans to release v0.5.2 in Q2, 2018.
We will first give a quick walk-through of features, usages, what's new in v0.5.0, and future roadmaps of Apache Hivemall. Next, we will introduce Hivemall on Apache Spark in depth such as DataFrame integration and Spark 2.3 supports in Hivemall.
Hivemall v0.5.0 was released on March 5, 2018 with new features including anomaly and change point detection algorithms, topic modeling capabilities, support for Spark 2.0/2.1/2.2, and a generic classifier/regressor. It also improved hyperparameter tuning, introduced feature hashing and binning, and added evaluation metrics and visualization tools. Future releases will focus on algorithms like XGBoost, LightGBM, and gradient boosting.
This document summarizes a presentation about the Apache Hivemall machine learning library. Hivemall is a scalable machine learning library built as a collection of Hive UDFs. It allows SQL developers to easily build and run machine learning models in parallel on large datasets. The presentation highlights new features in version 0.5.0 such as anomaly detection algorithms and topic modeling, as well as support for Spark and other platforms. It also demonstrates how to use Hivemall algorithms like random forests and change point detection.
This document discusses B+-trees, which are commonly used to index data in databases. It provides an overview of the structure and functionality of B+-trees, including keys, pointers, fanout, leaf nodes, and internal nodes. It also describes Btree4j, an open source Java implementation of B+-trees that supports features like paging, prefix indexing, and bulk loading. The document aims to revisit the disk-based structure and implementation of B+-trees.
Hivemall is an open source machine learning library built as a collection of Hive UDFs. It provides over 100 machine learning algorithms and functions for tasks like feature engineering, evaluation, and recommendation. Hivemall entered the Apache Incubator in 2016 and the first Apache release (v0.5.0) is upcoming. It supports platforms like Hive, Spark, and Pig for scalable parallel processing.
This document provides an overview of Hivemall, an open-source machine learning library built as a collection of Hive UDFs (user-defined functions). It can be used for scalable machine learning on large datasets using SQL queries. The document discusses Hivemall's supported algorithms, features, and industry use cases. It also provides examples of how to use Hivemall for tasks like classification, recommendation, and anomaly detection directly from SQL.
This document summarizes a presentation about Apache Hivemall, a scalable machine learning library for Apache Hive, Spark, and Pig. Hivemall provides easy-to-use machine learning functions that can run efficiently in parallel on large datasets. It supports various classification, regression, recommendation, and clustering algorithms. The presentation outlines Hivemall's capabilities, how it works with different big data platforms like Hive and Spark, and its ongoing development including new features like XGBoost integration and generalized linear models.
Podling Hivemall in the Apache IncubatorMakoto Yui
Hivemall is a scalable machine learning library built as a collection of Hive UDFs. It was accepted into the Apache Incubator in September 2016. Hivemall can be used across multiple platforms like Hive, Spark, Pig, and is designed to be easy to use, versatile, and scalable for large datasets. It allows SQL developers to perform machine learning tasks in a parallel and scalable way on Hadoop clusters.
This document provides an overview and summary of Apache Hivemall, which is a scalable machine learning library built as a collection of Hive UDFs (user-defined functions). Some key points:
- Hivemall allows users to perform machine learning tasks like classification, regression, recommendation and anomaly detection using SQL queries in Hive, SparkSQL or Pig Latin.
- It provides a number of popular machine learning algorithms like logistic regression, decision trees, factorization machines.
- Hivemall is multi-platform, so models built in one system can be used in another. This allows ML tasks to be parallelized across clusters.
- It has been adopted by several companies for applications like click-through prediction, user
This document discusses Hivemall, an open source machine learning library for Apache Hive, Spark, and Pig. It provides an overview of Hivemall, describing its key features and algorithms. These include classification, regression, recommendation, and other machine learning methods. The document also outlines Hivemall's integration with technologies like Spark, Hive, and Pig and its use cases in industries like online advertising and real estate.
This document discusses Hivemall, a machine learning library for Apache Hive and Spark. It was developed by Makoto Yui as a personal research project to make machine learning easier for SQL developers. Hivemall implements various machine learning algorithms like logistic regression, random forests, and factorization machines as user-defined functions (UDFs) for Hive, allowing machine learning tasks to be performed using SQL queries. It aims to simplify machine learning by abstracting it through the SQL interface and enabling parallel and interactive execution on Hadoop.
This document summarizes the recent progress and future roadmap of Hivemall, an open source machine learning library for Hive.
Recent updates to Hivemall include support for Spark 1.6, a matrix factorization algorithm for recommendation systems, and a Japanese tokenizer. The roadmap includes entering Apache incubation, adding support for Spark 2.0, change point detection, evaluation metrics, and feature binning. Future releases will focus on testing, encoding, factorization machines, regularization, and other algorithms. The goal is to submit Hivemall to the Apache incubator in September 2016.
The document provides an overview of using Hivemall, an open source machine learning library built for Hive, for recommendation tasks. It begins with an introduction to Hivemall and its vision of enabling machine learning on SQL. It then covers recommendation 101, discussing explicit versus implicit feedback. Matrix factorization and Bayesian probabilistic ranking algorithms for recommendations from implicit feedback are described. Key aspects covered include data preparation in Hive, model training, and prediction. The document concludes with considerations for building recommendations on large, implicit feedback datasets in Hivemall.
1. Hivemall is a scalable machine learning library built as a collection of Hive UDFs that allows users to perform machine learning tasks using SQL queries.
2. The document discusses why Hivemall was created, as the creator found existing frameworks like Mahout and Spark MLlib difficult to use for SQL users and not scalable. Hivemall allows machine learning tasks like training, prediction, and feature engineering to be done with SQL queries.
3. The document provides examples of how to use Hivemall for tasks like data preparation, feature engineering, model training using algorithms like logistic regression and confidence weighted classification, and prediction. It also discusses how models can be exported for real-time prediction on databases.
Hivemall is a scalable machine learning library built as a collection of Hive UDFs. It allows users to perform machine learning tasks like classification, regression, recommendation, and anomaly detection using SQL queries. This provides an easy and scalable way to do machine learning without needing to code in other languages or move data outside of Hive. Hivemall implements many common algorithms as UDFs and UDTFs so that machine learning can be performed interactively on large datasets stored in Hive.
12. UDTF (parameter-mix)
HadoopのInputSplitSizeの設定に応じたmapperが
select 立ち上がる(map-only)
feature,
CAST(avg(weight) as FLOAT) as weight
from
( select
TrainLogisticSgdUDTF(features,label,..) as (feature,weight)
from train
)t
group by feature;
どうやってiterative parameter mixさせよう???
古いmodelを渡さないといけない
毎行渡すのはあれだし…
12
13. UDTF(iterative parameter mix)
create table model1sgditor2 as
select
feature,
CAST(avg(weight) as FLOAT) as weight
from (
select
TrainLogisticIterUDTF(t.features, w.wlist, t.label, ..)
as (feature, weight)
from
training t join feature_weight w on (t.rowid = w.rowid)
)t
group by feature;
ここで必要なのは、各行の素性ごとに古いModel
Map<feature, weight>, label相当を渡せばよいので、
Array<feature>に対応するArray<weight>をテーブルを作って
inner joinで渡す
13
14. Pig版のフローの一例
training_raw = load '$TARGET' as (clicks: int, impression: int, displayid: int, adid: int, advertiserid: int, depth: int, position: int, queryid: int, keywordid: int,
titleid: int, descriptionid: int, userid: int, gender: int, age: int);
training_bin = foreach training_raw generate flatten(predictor.ctr.BinSplit(clicks, impression)), displayid, adid, advertiserid, depth, position, queryid,
keywordid, titleid, descriptionid, userid, gender, age;
training_smp = sample training_bin 0.1;
training_rnd = foreach training_smp generate (int)(RANDOM() * 100) as dataid, TOTUPLE(*) as training;
training_dat = group training_rnd by dataid;
model = foreach training_dat generate predictor.ctr.TrainLinear(training_rnd.training.training_smp);
store model into '$MODEL';
model = load '$MODEL' as (mdl: map[]);
弱学習
model_lmt = limit model 10;
testing_raw = load '$TARGET' as (dataid: int, displayid: int, adid: int, advertiserid: int, depth: int, position: int, queryid: int, keywordid: int, titleid: int,
descriptionid: int, userid: int, gender: int, age: int);
testing_with_model = cross model_lmt, testing_raw;
result = foreach testing_with_model generate dataid, predictor.ctr.Pred(mdl, displayid, adid, advertiserid, depth, position, queryid, keywordid, titleid,
descriptionid, userid, gender, age) as ctr;
result_grp = group result by dataid;
result_ens = foreach result_grp generate group as dataid, predictor.ctr.Ensemble(result.ctr);
result_ens_ord = order result_ens by dataid;
result_fin = foreach result_ens_ord generate $1;
store result_fin into '$RESULT';
アンサンブル学習
14