This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
Estimating Mutual Information for Discrete‐Continuous Mixtures 離散・連続混合の相互情報量の推定Yuya Takashina
NIPS論文読み会@PFN
Gao, Weihao, et al. "Estimating mutual information for discrete-continuous mixtures." Advances in Neural Information Processing Systems. 2017.
https://arxiv.org/abs/1709.06212
粒子フィルタ入門です.
References
- http://www.jstatsoft.org/v30/i06/paper
私はこのライブラリを使っています.
- Sequential Monte Carlo Methods in Practice (Springer)
この1章がとてもよくまとまっていておすすめです. 他にも応用例が色々書いてあるので実用向きという印象があります.
E-learning Development of Statistics and in Duex: Practical Approaches and Th...Joe Suzuki
This document discusses the development of e-learning courses in statistics through the Duex program. Duex is a consortium of Japanese universities and companies focused on data-related human resource development. It produces online statistics and data science courses using a low-cost, high-quality approach involving individual instructors creating video lectures using PowerPoint, scripts, and video editing software. The document outlines Duex's funding and participating institutions, and provides tips for instructors to efficiently create online video courses themselves with minimal budget and assistance from others.
E-learning Design and Development for Data Science in Osaka UniversityJoe Suzuki
This document discusses the development of e-learning courses for data science through the Kansai Data related Human Resource Development Consortium (KDC). KDC was established in 2017 with funding from the Japanese Ministry of Education and includes several universities. It aims to develop online statistics courses to make education more accessible and help train data science professionals. The document outlines KDC's goals, challenges in creating high-quality online courses, and strategies for increasing student enrollment and participation over the next five years as funding is scheduled to end.
1. The document proposes a regular quotient score for Bayesian network structure learning that allows for more efficient branch-and-bound search compared to the existing BDeu score.
2. The existing BDeu score violates regularity, meaning that Markov equivalent structures do not necessarily share the same BDeu score.
3. The authors propose a regular quotient score based on Jeffreys' prior that satisfies regularity, ensuring Markov equivalent structures share the score, enabling more efficient searching during branch-and-bound learning of Bayesian network structures.
- The document discusses estimating mutual information and using it to learn forests and Bayesian networks from data. It presents methods for estimating mutual information, finding independence between variables, and using Kruskal's and Chow-Liu algorithms to learn tree structures that approximate joint distributions. Experiments apply these methods to Asia and Alarm datasets to learn Bayesian networks.
This document outlines a two-part course on Bayesian approaches to data compression. Part I on July 17th will cover data compression for known and unknown sources over 90 minutes, including a 45-minute exercise. Part II on July 24th will focus on learning graphical models from data based on the concepts from Part I.
A Conjecture on Strongly Consistent LearningJoe Suzuki
1. The document presents a conjecture about the error probability of overestimating the true order k* when learning autoregressive moving average (ARMA) models from samples.
2. The conjecture states that if the estimated order k is greater than the true order k*, the error probability is equal to the probability that a chi-squared distributed random variable with k - k* degrees of freedom is greater than (k - k*)dn, where dn is related to the sample size n.
3. The author provides evidence that a sum of squared estimated ARMA coefficients could be chi-squared distributed, lending credibility to the conjecture.
A Generalization of Nonparametric Estimation and On-Line Prediction for Stati...Joe Suzuki
This document presents a generalization of Ryabko's measure for universal coding of stationary ergodic sources. The generalization allows constructing a measure νn that achieves universal coding for sources without a density function, such as those represented by a measure μn on a measurable space. νn is defined by projecting the source onto increasing finer partitions and weighting the projections. If the Kullback-Leibler divergence between the source and weighting measure converges across partitions, νn achieves universal coding for any stationary ergodic source μn. Examples demonstrate how the approach extends Ryabko's histogram weighting to new source types.
Bayesian Criteria based on Universal MeasuresJoe Suzuki
The document presents Joe Suzuki's work on generalizing Bayesian criteria to settings beyond discrete or continuous distributions. It introduces generalized density functions based on Radon-Nikodym derivatives that allow defining universal measures gn approximating true densities f. These generalized densities enable extending Bayesian criteria like comparing pgnXgnY to (1-p)gXY to assess independence, to any sample space without assuming a specific form. The approach unifies Bayesian and MDL methods under a framework of universality, with various applications like Bayesian network structure learning.
The Universal Measure for General Sources and its Application to MDL/Bayesian...Joe Suzuki
1) The document presents a new theory for universal coding and the MDL principle that is applicable to general sources without assuming discrete or continuous distributions.
2) It constructs a universal measure νn that satisfies certain conditions to allow generalization of universal coding and MDL.
3) This generalized framework is applied to problems that previously separated discrete and continuous cases, such as Markov order estimation using continuous data sequences and mixed discrete-continuous feature selection.
Universal Prediction without assuming either Discrete or ContinuousJoe Suzuki
1. The document discusses universal prediction without assuming data is either discrete or continuous. It presents a method to estimate generalized density functions to achieve universal prediction for any unknown probabilistic model.
2. A key insight is that universal prediction can be achieved by estimating the ratio between the true density function and a reference measure, without needing to directly estimate the density function. This allows universal prediction for data that is neither discrete nor continuous.
3. The method involves recursively refining partitions of the sample space to estimate the density ratio. It is shown that this ratio can be estimated universally for any density function, achieving the goal of prediction without assumptions about the data type.
Bayesian network structure estimation based on the Bayesian/MDL criteria when...Joe Suzuki
J. Suzuki. ``Bayesian network structure estimation based on the Bayesian/MDL criteria when both discrete and continuous variables are present". IEEE Data Compression Conference, pp. 307-316, Snowbird, Utah, April 2012.
3. 問題
離散の場合
連続の場合
実験
HSIC
まとめ
問題: (x1 , y1 ), · · · , (xn , yn ) から、X ⊥ Y か否かを検定
⊥
相互情報量: I (X , Y ) :=
∑∑
x
PXY (x, y ) log
y
PXY (x, y )
PX (x)PY (y )
I (X , Y ) = 0 ⇐⇒ X ⊥ Y
⊥
Hilbert Schmidt independent criterion: 相関係数の非線型化
相関係数 (X , Y ) = 0
⇐=
X ⊥ Y
⊥
̸=⇒
HSIC (X , Y ) = 0 ⇐⇒ X ⊥ Y
⊥
独立性検定 (X ⊥ Y か否か)
⊥
(x1 , y1 ), · · · , (xn , yn ) から、I (X , Y ), HSIC (X , Y ) を推定
Bayes Independence Test - HSIC と性能を比較する 3 / 19
4. 問題
離散の場合
連続の場合
HSIC
実験
まとめ
相互情報量の推定 (最尤推定)
X , Y : 離散
In (x n , y n ) :=
∑∑
x
y
ˆ
Pn (x, y ) log
ˆ
Pn (x, y )
ˆ
ˆ
Pn (x)Pn (y )
ˆ
Pn (x, y ): (x1 , y1 ), · · · , (xn , yn ) での (X , Y ) = (x, y ) の相対頻度
ˆ
Pn (x): x1 , · · · , xn での X = x の相対頻度
ˆ
Pn (y ): y1 , · · · , yn での Y = y の相対頻度
In (x, y ) → I (X , Y ) (n → ∞)
X ⊥ Y であっても、確率 1 で、In (x n , y n ) > 0 が無限回生じる
⊥
独立性検定をどのように構成するか ({ϵn } の設定) が不明
In (x n , y n ) < ϵn ⇐⇒ X ⊥ Y
⊥
X , Y が連続のときに、どのように一般化されるのかが不明
Bayes Independence Test - HSIC と性能を比較する 4 / 19
5. 問題
離散の場合
連続の場合
HSIC
実験
まとめ
相互情報量の Bayes 推定の提案: 離散
Lempel-Ziv アルゴリズム (lzh, gzip など)
x n = (x1 , · · · , xn ) を圧縮して、z m = (z1 , · · · , zm ) ∈ {0, 1}m
1
.
2
m
PX によらず、圧縮率
がエントロピー H(X ) に収束
n
∑
.
2−m ≤ 1 (Kraft の不等式)
n
n
QX (x n ) := 2−m とおくと、m = − log QX (x n ) は圧縮後の長さ
.
n
n
QY (y n ), QXY (x n , y n ) も定義し、X ⊥ Y の事前確率を p として
⊥
Jn (x n , y n ) :=
n
(1 − p)QXY (x n , y n ) .
1
log
n (x n )Q n (y n )
n
pQX
Y
Bayes Independence Test - HSIC と性能を比較する 5 / 19
6. 問題
離散の場合
連続の場合
HSIC
実験
まとめ
MDL(minimum description length) 原理
例から、各モデルについて、
モデルの記述
モデルを仮定したときの例の記述
の長さの合計を最小とするモデルを選択
する情報量基準 (Rissanen, 1976)
1
1
n
n
log QX (x n ) − log QY (y n )
n
n
1
n
MDL(X ̸⊥ Y ) := − log(1 − p) − log QXY (x n , y n )
⊥
n
MDL(X ⊥ Y ) := − log p −
⊥
一致性
n → ∞ で、MDL 最小のモデルが真のモデルと確率 1 で一致
Bayes Independence Test - HSIC と性能を比較する 6 / 19
7. 問題
離散の場合
連続の場合
HSIC
実験
まとめ
相互情報量の Bayes 推定の提案: 離散 (続)
MDL の一致性から、独立性検定の一致性が証明される
Jn (x n , y n ) ≤ 0 ⇐⇒ MDL(X ⊥ Y ) ≤ MDL(X ̸⊥ Y )
⊥
⊥
α := |X |, β := |Y | として、
Jn (x n , y n ) ≈ In (x n .y n ) −
(α − 1)(β − 1)
log n
2n
Jn (x n , y n ) ≤ 0 ⇐⇒ In (x n , y n ) ≤ ϵn :=
(α − 1)(β − 1)
log n
2n
Jn (x n , y n ) → I (X , Y ) (n → ∞)
O(n) の計算量
Suzuki 2012 では、p =
1
を仮定していた
2
Bayes Independence Test - HSIC と性能を比較する 7 / 19
8. 問題
離散の場合
連続の場合
HSIC
実験
まとめ
ユニバーサル性: 離散
任意の PX について、
m
1
n
= − log QX (x n ) → H(X )
n
n
i.i.d. であることと、大数の強法則から、任意の PX について、
1
1∑
n
− log PX (x n ) = −
log PX (xi ) → E [− log PX (X )] = H(X )
n
n
n
i=1
したがって、任意の PX について、
P n (x n )
1
log X n → 0
n
n
QX (x )
Bayes Independence Test - HSIC と性能を比較する 8 / 19
9. 問題
離散の場合
連続の場合
HSIC
実験
まとめ
ユニバーサル性: 連続
正則条件のもとで、
任意の密度関数 fX について、
f n (x n )
1
log X n → 0
n
n
gX (x )
∫ ∞
g n (x n )dx ≤ 1
−∞
なる
n
gX
が存在する (Ryabko 2009)
正則条件の仮定の除去
2 変数以上でも成立
離散でも連続でもない確率変数についても成立
(Suzuki 2013)
Bayes Independence Test - HSIC と性能を比較する 9 / 19
10. 問題
離散の場合
連続の場合
n
gX
実験
HSIC
まとめ
の構成
(k)
(k)
(1)
(1)
レベル k での量子化: x n = (x1 , · · · , xn ) → (a1 , · · · , an )
レベル 1
E
レベル 2
E
.
.
.
n
Q1 (a1 , · · · , an )
(1)
(1)
(2)
(2)
λ(a1 ) · · · λ(an )
(2)
n (2)
Q2 (a1 , · · · , an )
λ(a1 ) · · · λ(an )
.
.
.
(k)
レベル k
.
.
.
(1)
n
gX (x n ) = w1 ×
.
.
.
(1)
λ(a1 ) · · · λ(an )
(k)
n
Qk (a1 , · · · , an )
(k)
(k)
λ(a1 ) · · · λ(an )
(k)
(1)
n
Q1 (a1 , · · · , an )
(1)
E
+· · ·+wk ×
(k)
n
Qk (a1 , · · · , an )
(k)
(k)
λ(a1 ) · · · λ(an )
+· · ·
Bayes Independence Test - HSIC と性能を比較する 10 / 19
11. 離散の場合
連続の場合
HSIC
実験
まとめ
相互情報量の Bayes 推定の提案: 一般の場合
相互情報量の Bayes 推定量
Jn (x n , y n ) :=
n
(1 − p)gXY (x n , y n )
1
log
n
n
n
pgX (x n )gY (y n )
.
(通常の密度関数ではなく、離散の場合を含めることができる)
.
MDL 原理の一般化と思える
MDL(X ⊥ Y ) := − log p −
⊥
.
問題
1
1
n
n
log gX (x n ) − log gY (y n )
n
n
MDL(X ̸⊥ Y ) := − log(1 − p) −
⊥
1
n
log gXY (x n , y n )
n
予想: 一致性
n → ∞ で、MDL 最小のモデルが真のモデルと確率 1 で一致
Bayes Independence Test - HSIC と性能を比較する 11 / 19
12. 問題
離散の場合
連続の場合
HSIC
実験
まとめ
Jn (x , y ) → I (X , Y ) (n → ∞)
n
n
証明: x n , y n が i.i.d.、大数の強法則から、任意の fX について、
n
n
fXY (x n , y n )
fXY (x n , y n )
1
1∑
log n n n n =
log n n n n
n
fX (x )fY (x )
n
fX (x )fY (x )
n
i=1
fXY (XY )
→ E [log
] = I (X , Y )
fX (X )fY (Y )
Jn (x n , y n ) − I (X , Y )
f n (x n , y n )
f n (x n )
f n (y n )
1
1
1
= − log XY n n + log X n + log Y n
n
n
n
n
gXY (x , y ) n
gX (x ) n
gY (y )
n
fXY (x n , y n )
1
1
1−p
+ log n n n n − I (X , Y ) + log
n
fX (x )fY (x )
n
p
→ 0
Bayes Independence Test - HSIC と性能を比較する 12 / 19
13. 問題
離散の場合
連続の場合
HSIC
実験
まとめ
HSIC
相関係数 cov (X , Y ) の非線形化
確率変数
値域
RKHS
kernel
X
X
F: 基底 {fi }
k :X ×X →R
HSIC (PXY , F, G) =
∑
Y
Y
G: 基底 {gj }
l :Y ×Y →R
cov (fi (X ), gj (Y ))2
i,j
k が universal のとき、HSIC (PXY , F, G) = 0 ⇐⇒ X ⊥ Y
⊥
例: Gaussian kernel は、universal
k(x, y ) = exp{−(x − y )2 /2}
Bayes Independence Test - HSIC と性能を比較する 13 / 19
14. 問題
離散の場合
連続の場合
HSIC
実験
まとめ
HSIC 適用の問題点
HSIC (PXY , F, G) の推定量
1
K = (k(xi , xj )), L = (k(yi , yj )), H = (δi,j − n ) として、
HSIC (x n , y n , F, G) =
.
1
tr (KHLH)
(n − 1)2
n → ∞ で、確率 1 で、
HSIC (PXY , F, G) → HSIC (PXY , F, G)
となる証明がない
.
H0 : X ⊥ Y の危険率 α を設定した検
⊥
定で、採択域 {ϵn } の設定が難しい
{x n , y n |HSIC (x n , y n , F, G) ≤ ϵn }
O(n3 ) の計算量 (不完全 Cholesky 分解
で近似しても O(n2 ))
Bayes Independence Test - HSIC と性能を比較する 14 / 19
15. 問題
離散の場合
連続の場合
HSIC
実験
まとめ
実験
X
1
.
Y
1
E
r1 2
B
¨0
0 r
¨
2 r¨¨
r
¨ r
p
rE 1
1 ¨¨
j
r
1−p
[
2
(X , Y ) ∼ N(0, Σ), Σ =
1
ρ
ρ
1
]
I (X , Y ) = HSIC (X , Y ) = 0
1
⇐⇒ p = ⇐⇒ X ⊥ Y
⊥
2
, −1 < ρ < 1
I (X , Y ) = HSIC (X , Y ) = 0 ⇐⇒ ρ = 0 ⇐⇒ X ⊥ Y
⊥
.
3
P(X = 0) = P(X = 1) = 1 , Y ∼ N(aX , 1), a ≥ 0
2
I (X , Y ) = HSIC (X , Y ) = 0 ⇐⇒ a = 0 ⇐⇒ X ⊥ Y
⊥
Bayes Independence Test - HSIC と性能を比較する 15 / 19
16. 問題
離散の場合
連続の場合
HSIC
実験
まとめ
実験 1
n = 100 のときの誤り率
真の p
→ 推定した p
提案
= 0.5 → p
= 0.4 → p
= 0.3 → p
= 0.2 → p
= 0.1 → p
0.084
0.758
0.333
0.048
0.001
p
p
p
p
p
̸= 0.5
= 0.5
= 0.5
= 0.5
= 0.5
4
0.306
0.507
0.139
0.018
0.000
HSIC
しきい値 (×10−4 )
8
12
16
0.135 0.077 0.043
0.694 0.787 0.860
0.251 0.396 0.505
0.035 0.083 0.135
0.001 0.005 0.010
↑
20
0.022
0.908
0.610
0.201
0.021
Bayes Independence Test - HSIC と性能を比較する 16 / 19