For classification, we use a non-linear support vector machine
with a multi-channel kernel. We first select the overall most successful
channels and then choose the most successful combination of channels
for each action class individually. Figure to the left illustrates
the number of occurrences of channel components in the channel
combinations optimized for
KTH actions dataset
and our movie action dataset.
We observe that HOG descriptors are chosen more frequently than HOFs, but both
are used in many channels. Among the grids, BoF representations are selected most frequently for movie actions. Finer grids such as horizontal 3x1 partitioning and 3-bin temporal grids are frequently selected for the KTH dataset. The observed behavior is consistent with the fact that KTH dataset is more structured in space-time compared to our movie actions dataset.
|