Matrix Calculus
Go to: Introduction, Notation, Index
Contents of Calculus Section
Notation
- j is the square root of -1
- XR and XI are the real
and imaginary parts of X = XR +
jXI
- (XY)R = XRYR
- XIYI
- (XY)I = XRYI
+ XIYR
- XC = XR -
jXI is the complex conjugate of X
- XH=(XR)T=(XT)C
is the Hermitian transpose of X
- X: denotes the long column vector formed by concatenating the
columns of X (see vectorization).
- A ⊗ B = KRON(A,B), the kroneker product
- A • B the Hadamard or elementwise product
- A ÷ B the elementwise quotient
- matrices and vectors A, B, C do not depend on
X
- In = I[n#n] the
n#n identity matrix
- Tm,n = TVEC(m,n) is the vectorized
transpose matrix, i.e.
XT:=Tm,nX: for
X[m,n]
- ∂Y/∂X and
∂Y/∂XC are partial derivatives with
XC and X respectively held constant (note
that
XH=(XC)T)
- ∂Y/∂XR and
∂Y/∂XI are partial derivatives with
XI and XR respectively
held constant
In the main part of this page we express results in terms of differentials
rather than derivatives for two reasons: they avoid notational disagreements
and they cope easily with the complex case. In most cases however, the
differentials have been written in the form dY: =
dY/dX dX: so that the corresponding
derivative may be easily extracted.
Derivatives with respect to a real matrix
If X is p#q and Y is m#n, then
dY: = dY/dX dX: where
the derivative dY/dX is a large mn#pq
matrix. If X and/or Y are column vectors or scalars, then the
vectorization operator : has no effect and may be omitted.
dY/dX is also called the Jacobian Matrix of
Y: with respect to X: and det(dY/dX)
is the corresponding Jacobian. The Jacobian occurs when changing
variables in an integration:
Integral(f(Y)dY:)=Integral(f(Y(X))
det(dY/dX) dX:).
Although they do not generalise so well, other authors use alternative
notations for the cases when X and Y are both vectors or when one
is a scalar. In particular:
- dy/dx is sometimes written as a column vector rather
than a row vector
- dy/dx is sometimes transposed from the above
definition or else is sometimes written
dy/dxT to emphasise the
correspondence between the columns of the derivative and those of
xT.
- dY/dx and dy/dX are often written
as matrices rather than, as here, a column vector and row vector respectively.
The matrix form may be converted to the form used here by appending : or
:T respectively.
Derivatives with respect to a complex matrix
If X is complex then dY: =
dY/dX dX: can only be generally true
iff Y(X) is an analytic function. This
normally implies that Y(X) does not depend explicitly on
XC or XH.
Even for non-analytic functions we can treat X and
XC (with
XH=(XC)T)
as distinct variables and write uniquely dY: =
∂Y/∂X dX: +
∂Y/∂XC
dXC: provided that Y is
analytic with respect to X and XC individually
(or equivalently with respect to XR and
XI individually). ∂Y/∂X
is the Generalized Complex
Derivative and ∂Y/∂XC is the
Complex Conjugate
Derivative [R.4, R.9]; their properties are studied in
Wirtinger Calculus.
We define the generalized derivatives in terms of partial derivatives with
respect to XR and XI:
- ∂Y/∂X = ½
(∂Y/∂XR - j
∂Y/∂XI)
- ∂Y/∂XC =
(∂YC/∂X)C =
½ (∂Y/∂XR + j
∂Y/∂XI)
We have the following relationships for both analytic and non-analytic
functions Y(X):
- The following are equivalent
ways of saying that Y(X) is analytic:
- Y(X) is an analytic function of X
- dY: = ∂Y/∂X dX:
- ∂Y/∂XC = 0 for all
X
- ∂Y/∂XR + j
∂Y/∂XI = 0 for all X
(these are the Cauchy Riemann equations)
- dY: = ∂Y/∂X dX: +
∂Y/∂XC
dXC:
- ∂Y/∂XR =
∂Y/∂X +
∂Y/∂XC
- ∂Y/∂XI = j
(∂Y/∂X -
∂Y/∂XC)
- ∂Y/∂XC =
(∂YC/∂X)C
- Chain rule: If Z is a function of Y which is itself a
function of X, then ∂Z/∂X =
∂Z/∂Y ∂Y/∂X. This is the same
as for real derivatives.
- Real-valued: If Y(X) is real for all complex X,
then
- ∂Y/∂XC=
(∂Y/∂X)C
- dY: = 2(∂Y/∂X
dX:)R
- If Y(X) is real for all complex X and
W(X) is analytic and if
W(X)=Y(X) for all real-valued X, then
∂W/∂X = 2
(∂Y/∂X)R for all real X
- Example: If C=CH,
y(x)=xHCx and
w(x)=xTCx, then
∂y/∂x = xHC and
∂w/∂x =
2xTCR
Suppose f(X) is a scalar real function of a complex matrix (or
vector), X, and G(X) is a
complex-valued matrix (or vector or scalar) function of X. To minimize
f(X) subject to
G(X)=0, we use complex
Lagrange multipliers and minimize
f(X)+tr(KHG(X))+tr(KTG(X)C)
subject to G(X)=0. Hence we
solve
∂f/∂X+∂tr(KHG)/∂X+∂tr(KTGC)/∂X
= 0T subject to
G(X)=0. If
g(X) is a vector, this becomes
∂f/∂X+kH∂g/∂X+kT∂gC/∂X
= 0T . If g(X)
is a scalar, this becomes
∂f/∂X+kC∂g/∂x+k∂gC/∂x
= 0T .
- Example: If
f(x)=xHSx
where S=SH and
g(x)=aHx-1,
then
∂f/∂x+kH∂g/∂x+kT∂gC/∂x=xHS+kaH+0T=0T
which implies
Sx+kCa=0
from which
x=-kCS-1a.
Substituting this into the constraint,
g(x)=aHx-1=0,
gives
-kCaHS-1a
= 1 from which
k=-(aHS-1a)-1.
Substituting this back into the expression for x gives
x =
S-1a(aHS-1a)-1.
If f(X) is a real function of a complex matrix (or vector),
X, then ∂f/∂XC=
(∂f/∂X)C and we can define the
complex-valued column vector grad(f(X)) = 2
(∂f/∂X)H =
(∂f/∂XR+j
∂f/∂XI)T as
the Complex Gradient Vector [R.9]
with the properties listed below. If we use <-> to represent the vector
mapping associated with the Complex-to-Real isomporphism, and
X[m#n]: <->
y[2mn] where y is real, then
grad(f(X)) <-> grad(f(y))
where the latter is the conventional grad function from vector calculus.
- grad(f(X)) is zero at an extreme value of f
.
- grad(f(X)) points in the direction of steepest slope
of f(x)
- The magnitude of the steepest slope is equal to
|grad(f(X))|. Specifically, if g(X) =
grad(f(X)), then lima->0
a-1( f(X+ag(X)) -
f(X) ) = | g(X) |2
- grad(f(X)) is normal to the surface f(X)
= constant which means that it can be used for gradient ascent/descent
algorithms.
- If f(X)=yHy, then
grad(f(X))=2(∂y/∂X)Hy+2(∂y/∂XC)TyC
Basic Properties
- We may write the following differentials unambiguously without parentheses:
- Transpose:
dYT=d(YT)=(dY)T
- Hermitian Transpose:
dYH=d(YH)=(dY)H
- Conjugate:
dYC=d(YC)=(dY)C
- Linearity:
d(Y+Z)=dY+dZ
- Chain Rule: If Z
is a function of Y which is itself a function of X, then for both
the normal and the generalized complex derivative:
dZ: = dZ/dY dY: =
dZ/dY dY/dX
dX:
- Product Rule: d(YZ) =Y dZ +
dY Z
- d(YZ): = (I ⊗ Y)
dZ: + (ZT ⊗ I)
dY: = ((I ⊗ Y)
dZ/dX + (ZT ⊗
I) dY/dX ) dX:
- Hadamard Product:
d(Y • Z) =Y • dZ +
dY • Z
- Kroneker Product: d(Y
⊗ Z) =Y ⊗ dZ + dY
⊗ Z
Differentials of Linear
Functions
- d(Ax) =
d(xTAT):
=A dx
- d(xTa) =
d(aTx) = aT
dx
- d(bxTa) =
baT dx
- d(AXB): =
(A dX B): =
(BT ⊗ A) dX:
- d(aTXb) = (b ⊗
a)T dX: =
(abT):T dX:
- d(aTXa) =
d(aTXTa) =
(a ⊗ a)T dX: =
(aaT):T
dX:
- [X[m#n]]
d(AX): = (In
⊗ A) dX:
- [X[m#n]]
d(XB): = (dX B): =
(BT ⊗ Im)
dX:
- [x[n]]
d(xbT): =
(dx bT): = (b ⊗
In) dx
- d(AXTB): =
(BT ⊗ A)
dXT:
- d(aTXTb) =
(a ⊗ b)T dX:
= (abT):T
dXT:=
(baT):T
dX:
- d(|x|) =
|x|-1xT
dx
- [x: Complex] d
(xHA): = AT
dxC
- d(X[m#n] ⊗
A[p#q]): = (In
⊗ Tq,m ⊗
Ip)(Imn ⊗ A:)
dX: = (Inq ⊗
Tm,p )(In ⊗ A:
⊗ Im) dX:
- d(A[p#q] ⊗
X[m#n]): = (Iq
⊗ Tn,p ⊗
Im)(A: ⊗ Imn)
dX: = (Tm,n ⊗
Ipq )(In ⊗ A:
⊗ Im) dX:
Differentials of Quadratic
Products
- d(Ax+b)TC(Dx+e)
= ((Ax+b)TCD +
(Dx+e)TCTA)
dx
- d(xTCx) =
xT(C+CT)dx
= [C=CT]
2xTCdx
- d(Ax+b)T (Dx+e) =
( (Ax+b)TD +
(Dx+e)TA)dx
- d(Ax+b)T (Ax+b) =
2(Ax+b)TAdx
- d(Ax+b)TC(Ax+b)
= [C=CT]
2(Ax+b)TCA dx
- d(Ax+b)HC(Dx+e)
= (Ax+b)HCD dx +
(Dx+e)TCTAC
dxC
- d(Ax+b)HC(Ax+b)
= (Ax+b)HCA dx +
(Ax+b)TCTAC
dxC = [C=CH]
2((Ax+b)HCA dx)R
- d(Ax+b)H(Ax+b)
= 2((Ax+b)HA dx)R
- d (xHCx)
= xHC dx
+xTCT
dxC = [C=CH]
2(xHC dx)R
- d (xHx) =
2(xH dx)R
- d(aTXTXb) =
X(abT +
baT):T dX:
- d(aTXTXa) =
2(XaaT ):T
dX:
- d(aTXTCXb)
= (CTXabT +
CXbaT):T dX:
- d(aTXTCXa)
= ((C + CT)XaaT
):T dX: = [C=CT]
2(CXaaT):T
dX:
- d((Xa+b)TC(Xa+b)) =
((C+CT)(Xa+b)aT
):T dX:
- [X[n#n]]
d(X2): = (XdX + dX X):
= (In ⊗ X + XT
⊗ In) dX:
- [X[m#n]]
d(XTCX): =
(In ⊗ XTC)
dX: + (XTCT
⊗ In)
dXT: = (In
⊗
XTC+Tn,n(In⊗
XTCT)) dX:
- [X[m#n],
C[m#m]=CT]
d(XTCX): =
(In×n+Tn,n)(In⊗
XTC) dX:
- [X[m#n],
C[m#m]=CT]
d(diag(XTCX)) = 2diag(XTC dX)
- [X[m#n]]
d(XTX): =
(In ⊗ XT)
dX: + (XT ⊗
In) dXT: =
(In×n +
Tn,n)(In ⊗
XT) dX:
- [X[m#n]]
d(XHCX): =
(XHCdX): +
(d(XH) CX): =
(In ⊗
XHC) dX: +
(XTCT ⊗
In) dXH:
- grad((Ax+b)H(Ax+b))
= 2AH(Ax+b)
Differentials of Cubic
Products
- d(xxTAx) =
(xxT(A+AT)+xTAx×I
)dx
- d(xxTx) =
(2xxT+xTx×I
)dx
- [X[m#n]]
d(XAXTBX): =
(XTBTXAT
⊗ Im + In ⊗
XAXTB) dX: +
(XTB ⊗ XA)
dXT: =
(XTBTXAT
⊗ Im + Tn,m(XA
⊗ XTB) +
In ⊗ XAXTB)
dX:
- [X[m#n]]
d(XXTX): =
(XTX ⊗ Im +
In ⊗ XXT)
dX: + (XT ⊗ X)
dXT: = (XTX
⊗ Im + Tn,m(X
⊗ XT) + In⊗
XXT) dX:
- [X[m#n]]
d(XAXBX): =
(XTBTXTAT
⊗ Im +
XTBT ⊗ XA +
In ⊗ XAXB) dX:
- [X[n#n]]
d(X3): =
((XT)2 ⊗
In + XT ⊗ X
+ In ⊗ X2)
dX:
Differentials of Inverses
- d(X-1) = -X-1dX
X-1 [2.1]
- d(X-1): =
-(X-T ⊗ X-1)
dX:
- d(aTX-1b) = -
(X-TabTX-T ):T
dX: = - (abT):T
(X-T ⊗ X-1)
dX: [2.9]
- d(tr(ATX-1B)) =
d(tr(BTXTA)) =
-(X-TABTX-T):T
dX: = -(ABT):T
(X-T ⊗ X-1)
dX:
Differentials of Trace
Note: matrix dimensions must result in an n*n argument for tr().
- d(tr(Y))=tr(dY)
- d(tr(X)) = d(tr(XT)) =
I:T dX: [2.4]
- d(tr(Xk))
=k(Xk-1)T:T
dX:
- d(tr(AXk)) =
(SUMr=0:k-1(XrAXk-r-1)T
):T dX:
- d(tr(AX-1B)) =
-(X-1BAX-1)T:T
dX:=
-(X-TATBTX-T):T
dX: [2.5]
- d(tr(AX-1))
=d(tr(X-1A)) =
-(X-TATX-T
):T dX:
- d(tr(ATXBT)) =
d(tr(BXTA)) =
(AB):T
dX: [2.4]
- d(tr(XAT)) =
d(tr(ATX))
=d(tr(XTA)) =
d(tr(AXT)) =
A:T dX:
-
d(tr(ATX-1BT))
= d(tr(BXTA)) =
-(X-TABX-T):T
dX: = -(AB):T
(X-T ⊗ X-1)
dX:
- d(tr(AXBXTC)) =
(ATCTXBT
+ CAXB):T dX:
- d(tr(XAXT)) =
d(tr(AXTX)) =
d(tr(XTXA)) =(
X(A+AT)):T
dX:
- (tr(XTAX)) =
d(tr(AXXT)) =
d(tr(XXTA)) =
((A+AT)X):T
dX:
- d(tr(XXT)) =
d(tr(XTX)) = 2X:T
dX:
- d(tr(AXBX)) =
(ATXTBT
+
BTXTAT
):T dX:
- d(tr((AXb+c)(AXb+c)T)
=
2(AT(AXb+c)bT):T
dX:
- [C=CT]
d(tr((XTCX)-1A) =
d(tr(A (XTCX)-1) =
-((CX(XTCX)-1)(A+AT)(XTCX)-1):T
dX:
- [B=BT,
C=CT]
d(tr((XTCX)-1(XTBX))
= d(tr(
(XTBX)(XTCX)-1)
=
2(BX(XTCX)-1-(CX(XTCX)-1)XTBX(XTCX)-1
):T dX:
- [D=DH]
d(tr((AXB+C)D(AXB+C)H))
=
((2AH(AXB+C)DBH):H
dX:)R [2.6]
-
d(tr((AXB+C)(AXB+C)H))
=
((2AH(AXB+C)BH):H
dX:)R
- [D=DH]
d(tr(XDXH)) =
((2XD):H
dX:)R
- d(tr(XXH)) =
(2X:H dX:)R
In the following expressions M# denotes the
inverse of M or, if M is singular, any
generalized inverse (including the
pseudoinverse).
- [D=DH]
argminX{tr((AXB+C)D(AXB+C)H)}
=
-(AHA)#AHCDBH(BDBH)# [2.7]
- [D=DH]
argminX{tr((AX+C)D(AX+C)H)}
=
-(AHA)#AHC
- [D=DH]
argminX{tr((AXB+C)HD(AXB+C))}
=
-(AHDA)#AHDCBH(BBH)#
- [D=DH]
argminX{tr((AX+C)HD(AX+C))}
=
-(AHDA)#AHDC
- [D=DH]
argminx{(Ax+c)HD(Ax+c)}
=
-(AHDA)#AHDc
- [D=DH,
R=RH]
argminX{tr((AXB+C)D(AXB+C)H+(AXP+Q)R(AXP+Q)H)}
=
-(AHA)#AH(CDBH+QRPH)(BDBH+PRPH)#
- [D=DH,
R=RH]
argminX{tr((AX+C)D(AX+C)H+(AX+Q)R(AX+Q)H)}
=
-(AHA)#AH(CD+QR)(D+R)#
- [D=DH,
R=RH]
argminX{tr((AXB+C)D(AXB+C)H+(AX)R(AX)H)}
=
-(AHA)#AH(CDBH)(BDBH+R)#
- [D=DH,
R=RH]
argminX{tr((XB+C)D(XB+C)H+XRXH)}
=
-(CDBH)(BDBH+R)#
- [D=DH]
argminX{tr((AXB+C)D(AXB+C)H)
| EXF=G} =
(AHA)#(EH{E(AHA)#EH}#{E(AHA)#AHCDBH(BDBH)#F+G}{FH(BDBH)#F}#FH
-
AHCDBH)(BDBH)# [2.8]
- [D=DH]
argminX{tr((AX+C)D(AX+C)H)
| EX=G} =
(AHA)#(EH{E(AHA)#EH}#{E(AHA)#AHC+G}
- AHC)
- [D=DH]
argminX{tr((Ax+c)D(Ax+c)H)
| Ex=g} =
(AHA)#(EH{E(AHA)#EH}#{E(AHA)#AHc+g}
- AHc)
- [D=DH]
argminX{tr((AXB+C)HD(AXB+C))
| EXF=G} =
(AHDA)#(EH{E(AHDA)#EH}#{E(AHDA)#AHDCBH(BBH)#F+G}{FH(BBH)#F}#FH
-
AHDCBH)(BBH)#
- [D=DH]
argminX{tr((AX+C)HD(AX+C))
| EX=G} =
(AHDA)#(EH{E(AHDA)#EH}#{E(AHDA)#AHDC+G}-
AHDC)
- [D=DH]
argminx{(Ax+c)HD(Ax+c)
| Ex=g} =
(AHDA)#(EH{E(AHDA)#EH}#{E(AHDA)#AHDc+g}-
AHDc)
Note: matrix dimensions must result in an n#n argument for
det(). Some of the expressions below involve inverses: these forms apply only
if the quantity being inverted is square and non-singular; alternative forms
involving the adjoint, ADJ(), do not have
the non-singular requirement.
- d(det(X)) = d(det(XT))
= ADJ(XT):T
dX: = det(X)
(X-T):T
dX: [2.10]
- d(det(ATXB)) =
d(det(BTXTA)) =
(A ADJ(ATXB)TBT):T
dX: = [A,B:
nonsingular] det(ATXB) ×
(X-T):T dX:
[2.11]
- d(ln(det(ATXB))) = [A,B: nonsingular]
(X-T):T dX:
[2.12]
- d(ln(det(X))) =
(X-T):T
dX:
- d(det(Xk)) =
d(det(X)k) = k ×
det(Xk) ×
(X-T):T dX:
[2.13]
- d(ln(det(Xk))) = k
× (X-T):T
dX:
- d(det(XTCX)) = [C=CT] 2(CX ADJ(XTCX)):T
dX: =
2det(XTCX)×(CX(XTCX)-1):T
dX: [2.14]
- = [C=CT,
CX: nonsingular]
2det(XTCX)×(X-T):T
dX:
- d(ln(det(XTCX))) = [C=CT]
2(CX(XTCX)-1):T
dX:
- = [C=CT,
CX:
nonsingular] 2(X-T):T
dX:
- d(ln(det(DIAG(diag(XTCX))))) = [C=CT,Xm#n]
2(CX ÷ (1m diag(XTCX)T)):T
dX:
- d(det(XHCX)) =
det(XHCX) ×
(CTXC
(XTCTXC)-1):TdX:
+
(CX(XHCX)-1):T
dXC:) [2.15]
- d(ln(det(XHCX))) =
(CTXC
(XTCTXC)-1):TdX:
+
(CX(XHCX)-1):T
dXC: [2.16]
dY/dX is called the Jacobian Matrix
of Y: with respect to X: and
JX(Y)=det(dY/dX) is
the corresponding Jacobian. The Jacobian occurs when changing variables
in an integration:
Integral(f(Y)dY:)=Integral(f(Y(X))
det(dY/dX) dX:).
-
JX(X[n#n]-1)=
(-1)ndet(X)-2n
Hessian matrix
If f is a real function of x then the Hermitian matrix Hx
f = (d/dx
(df/dx)H)T is
the Hessian matrix of f(x). A value of x for which
grad f(x) = 0 corresponds to a minimum, maximum or
saddle point according to whether Hx f is
positive definite, negative definite or indefinite.
- [Real] Hx f
= d/dx (df/dx)T
- Hx f is symmetric
- Hx (aTx) = 0
- Hx
(Ax+b)TC(Dx+e) =
ATCD +
DTCTA
- Hx (Ax+b)T
(Dx+e) = ATD +
DTA
- Hx
(Ax+b)TC(Ax+b) =
AT(C + CT)A =
[C=CT]
2ATCA
- Hx (Ax+b)T
(Ax+b) = 2ATA
- Hx (xTCx) =
C+CT = [C=CT] 2C
- Hx (xTx) =
2I
- [x: Complex] Hx
f = (d/dx
(df/dx)H)T =
d/dxC
(df/dx)T
This page is part of The Matrix Reference
Manual. Copyright © 1998-2022 Mike Brookes, Imperial
College, London, UK. See the file gfl.html for copying
instructions. Please send any comments or suggestions to "mike.brookes" at
"imperial.ac.uk".
Updated: $Id: calculus.html 11291 2021-01-05 18:26:10Z dmb $