Revised presentation slide for NLP-DL, 2016/6/22.
Recent Progress (from 2014) in Recurrent Neural Networks and Natural Language Processing.
Profile http://www.cl.ecei.tohoku.ac.jp/~sosuke.k/
Japanese ver. https://www.slideshare.net/hytae/rnn-63761483
Revised presentation slide for PFN Seminar, 2017/3/9.
Learning Communication with Neural Networks.
Presentation video: https://www.youtube.com/watch?v=ZrLiNAMHszo
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
This document discusses techniques for training deep variational autoencoders and probabilistic ladder networks. It proposes three advances: 1) Using an inference model similar to ladder networks with multiple stochastic layers, 2) Adding a warm-up period to keep units active early in training, and 3) Using batch normalization. These advances allow training models with up to five stochastic layers and achieve state-of-the-art log-likelihood results on benchmark datasets. The document explains variational autoencoders, probabilistic ladder networks, and how the proposed techniques parameterize the generative and inference models.
This document summarizes a presentation about variational autoencoders (VAEs) presented at the ICLR 2016 conference. The document discusses 5 VAE-related papers presented at ICLR 2016, including Importance Weighted Autoencoders, The Variational Fair Autoencoder, Generating Images from Captions with Attention, Variational Gaussian Process, and Variationally Auto-Encoded Deep Gaussian Processes. It also provides background on variational inference and VAEs, explaining how VAEs use neural networks to model probability distributions and maximize a lower bound on the log likelihood.
Modern enterprise data—tracking key performance indicators like conversions or click-throughs—exhibits a pathologically high dimensionality, which requires re-thinking data representation to make analysis tractable.
(DL輪読)Matching Networks for One Shot LearningMasahiro Suzuki
1. Matching Networks is a neural network architecture proposed by DeepMind for one-shot learning.
2. The network learns to classify novel examples by comparing them to a small support set of examples, using an attention mechanism to focus on the most relevant support examples.
3. The network is trained using a meta-learning approach, where it learns to learn from small support sets to classify novel examples from classes not seen during training.
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionEun Ji Lee
1. The document summarizes a research paper on neural image caption generation using visual attention mechanisms. It introduces attention models that allow an image captioning model to focus on salient regions of the image dynamically.
2. It describes the image captioning model which uses an LSTM decoder conditioned on an encoded image representation and a context vector. The context vector is generated by taking a weighted sum of image features, with the weights determined by an attention model.
3. It discusses two types of attention mechanisms - "hard" or stochastic attention which selects a single image location at each time step, and "soft" or deterministic attention which blends all locations with learned weights. The model is trained end-to-end to maximize
Lecture 06 marco aurelio ranzato - deep learningmustafa sarac
This document provides an overview of deep learning. It begins by contrasting traditional pattern recognition approaches with hierarchical compositional models used in deep learning. It then discusses different types of deep learning architectures including feedforward neural networks, convolutional neural networks, and recurrent neural networks. The document also covers unsupervised and supervised learning protocols for deep learning models. It emphasizes that deep learning models are able to learn complex functions by composing simpler nonlinear transformations.
The document discusses improved approaches to implementing dynamic tries in a space-efficient manner. It summarizes the Bonsai data structure, which supports dynamic trie operations in O(1) expected time but uses O(nlogσ + nloglogn) bits of space. The document then proposes a new approach called m-Bonsai that uses only O(nlogσ) bits of space in expectation while also supporting O(1) expected time operations, achieving the optimal space bound. Experimental results show m-Bonsai uses significantly less memory than Bonsai and has comparable or better performance.
The document discusses learning graphical models from data. It describes two main tasks: inference, which is computing answers to queries about a probability distribution described by a Bayesian network, and learning, which is estimating a model from data. It provides examples of learning for completely observed models, including maximum likelihood estimation for the parameters of a conditional Gaussian model. It also discusses supervised versus unsupervised learning of hidden Markov models, and techniques for dealing with small training sets like adding pseudocounts to estimates.
The document describes the sequence-to-sequence (seq2seq) model with an encoder-decoder architecture. It explains that the seq2seq model uses two recurrent neural networks - an encoder RNN that processes the input sequence into a fixed-length context vector, and a decoder RNN that generates the output sequence from the context vector. It provides details on how the encoder, decoder, and training process work in the seq2seq model.
[DL輪読会]Generative Models of Visually Grounded ImaginationDeep Learning JP
The document proposes a new model for visually grounded semantic imagination that can generate images from linguistic descriptions of concepts specified by attributes. The model uses a variational autoencoder with three inference networks to handle images, attributes, and missing modalities. It represents the attribute inference distribution as the product of expert Gaussians, allowing generation of concepts not seen during training by combining learned attributes. The paper introduces three criteria for evaluating such models: correctness, coverage, and compositionality.
(研究会輪読) Facial Landmark Detection by Deep Multi-task LearningMasahiro Suzuki
The document summarizes a research paper on facial landmark detection using deep multi-task learning. It proposes a Tasks-Constrained Deep Convolutional Network (TCDCN) that uses facial landmark detection as the main task and related auxiliary tasks like pose estimation and attribute inference to improve performance. The TCDCN learns shared representations across tasks using a deep convolutional network. It introduces task-wise early stopping to halt learning on auxiliary tasks that reach optimal performance early to avoid overfitting and improve convergence on the main task of landmark detection. Experimental results showed the proposed approach outperformed existing methods.
The document summarizes the paper "Matching Networks for One Shot Learning". It discusses one-shot learning, where a classifier can learn new concepts from only one or a few examples. It introduces matching networks, a new approach that trains an end-to-end nearest neighbor classifier for one-shot learning tasks. The matching networks architecture uses an attention mechanism to compare a test example to a small support set and achieve state-of-the-art one-shot accuracy on Omniglot and other datasets. The document provides background on one-shot learning challenges and related work on siamese networks, memory augmented neural networks, and attention mechanisms.
This document provides an overview of VAE-type deep generative models, especially RNNs combined with VAEs. It begins with notations and abbreviations used. The agenda then covers the mathematical formulation of generative models, the Variational Autoencoder (VAE), variants of VAE that combine it with RNNs (VRAE, VRNN, DRAW), a Chainer implementation of Convolutional DRAW, other related models (Inverse DRAW, VAE+GAN), and concludes with challenges of VAE-like generative models.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
[Paper Reading] Attention is All You NeedDaiki Tanaka
The document summarizes the "Attention Is All You Need" paper, which introduced the Transformer model for natural language processing. The Transformer uses attention mechanisms rather than recurrent or convolutional layers, allowing for more parallelization. It achieved state-of-the-art results in machine translation tasks using techniques like multi-head attention, positional encoding, and beam search decoding. The paper demonstrated the Transformer's ability to draw global dependencies between input and output with constant computational complexity.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data anlytics tools.
This document contains lecture notes on sparse autoencoders. It begins with an introduction describing the limitations of supervised learning and the need for algorithms that can automatically learn feature representations from unlabeled data. The notes then state that sparse autoencoders are one approach to learn features from unlabeled data, and describe the organization of the rest of the notes. The notes will cover feedforward neural networks, backpropagation for supervised learning, autoencoders for unsupervised learning, and how sparse autoencoders are derived from these concepts.
The document summarizes radial basis function (RBF) networks. Key points:
- RBF networks use radial basis functions as activation functions and can universally approximate continuous functions.
- They are local approximators compared to multilayer perceptrons which are global approximators.
- Learning involves determining the centers, widths, and weights. Centers can be randomly selected or via clustering. Widths are usually different for each basis function. Weights are typically learned via least squares or gradient descent methods.
1. The document discusses using machine learning and deep learning techniques for trading, including classification, regression, clustering, and time series modeling with RNNs.
2. It provides an overview of different ML algorithms like decision trees, random forests, CNNs, RNNs and reinforcement learning and how they could be applied to problems in trading like predicting stock prices, generating trading signals, and portfolio optimization.
3. It presents some ideas for modeling trading problems using technical indicators or fundamental factors as inputs to classifiers, regressors or sequence models, and using reinforcement learning to optimize trading strategies.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
From RNN to neural networks for cyclic undirected graphstuxette
This document discusses different neural network methods for processing graph-structured data. It begins by describing recurrent neural networks (RNNs) and their limitations for graphs, such as an inability to handle undirected or cyclic graphs. It then summarizes two alternative approaches: one that uses contraction maps to allow recurrent updates on arbitrary graphs, and one that employs a constructive architecture with frozen neurons to avoid issues with cycles. Both methods aim to make predictions at the node or graph level on relational data like molecules or web pages.
1. The document discusses various machine learning classification algorithms including neural networks, support vector machines, logistic regression, and radial basis function networks.
2. It provides examples of using straight lines and complex boundaries to classify data with neural networks. Maximum margin hyperplanes are used for support vector machine classification.
3. Logistic regression is described as useful for binary classification problems by using a sigmoid function and cross entropy loss. Radial basis function networks can perform nonlinear classification with a kernel trick.
Machine listening is a field that encompasses research on a wide range of tasks, including speech recognition, audio content recognition, audio-based search, and content-based music analysis. In this talk, I will start by introducing some of the ways in which machine learning enables computers to process and understand audio in a meaningful way. Then I will draw on some specific examples from my dissertation showing techniques for automated analysis of live drum performances. Specifically, I will focus on my work on drum detection, which uses gamma mixture models and a variant of non-negative matrix factorization, and drum pattern analysis, which uses deep neural networks to infer high-level rhythmic and stylistic information about a performance.
Audio chord recognition using deep neural networksbzamecnik
This document discusses using deep neural networks for audio chord recognition from music recordings. It describes the task of identifying chord labels for time segments of audio. The model uses convolutional and recurrent layers with chromagram features extracted from the audio as input. Evaluation shows the CNN+LSTM model achieves over 50% accuracy on a dataset of 180 annotated songs. Future work ideas include improving segmentation and exploring additional neural network architectures.
Lecture 06 marco aurelio ranzato - deep learningmustafa sarac
This document provides an overview of deep learning. It begins by contrasting traditional pattern recognition approaches with hierarchical compositional models used in deep learning. It then discusses different types of deep learning architectures including feedforward neural networks, convolutional neural networks, and recurrent neural networks. The document also covers unsupervised and supervised learning protocols for deep learning models. It emphasizes that deep learning models are able to learn complex functions by composing simpler nonlinear transformations.
The document discusses improved approaches to implementing dynamic tries in a space-efficient manner. It summarizes the Bonsai data structure, which supports dynamic trie operations in O(1) expected time but uses O(nlogσ + nloglogn) bits of space. The document then proposes a new approach called m-Bonsai that uses only O(nlogσ) bits of space in expectation while also supporting O(1) expected time operations, achieving the optimal space bound. Experimental results show m-Bonsai uses significantly less memory than Bonsai and has comparable or better performance.
The document discusses learning graphical models from data. It describes two main tasks: inference, which is computing answers to queries about a probability distribution described by a Bayesian network, and learning, which is estimating a model from data. It provides examples of learning for completely observed models, including maximum likelihood estimation for the parameters of a conditional Gaussian model. It also discusses supervised versus unsupervised learning of hidden Markov models, and techniques for dealing with small training sets like adding pseudocounts to estimates.
The document describes the sequence-to-sequence (seq2seq) model with an encoder-decoder architecture. It explains that the seq2seq model uses two recurrent neural networks - an encoder RNN that processes the input sequence into a fixed-length context vector, and a decoder RNN that generates the output sequence from the context vector. It provides details on how the encoder, decoder, and training process work in the seq2seq model.
[DL輪読会]Generative Models of Visually Grounded ImaginationDeep Learning JP
The document proposes a new model for visually grounded semantic imagination that can generate images from linguistic descriptions of concepts specified by attributes. The model uses a variational autoencoder with three inference networks to handle images, attributes, and missing modalities. It represents the attribute inference distribution as the product of expert Gaussians, allowing generation of concepts not seen during training by combining learned attributes. The paper introduces three criteria for evaluating such models: correctness, coverage, and compositionality.
(研究会輪読) Facial Landmark Detection by Deep Multi-task LearningMasahiro Suzuki
The document summarizes a research paper on facial landmark detection using deep multi-task learning. It proposes a Tasks-Constrained Deep Convolutional Network (TCDCN) that uses facial landmark detection as the main task and related auxiliary tasks like pose estimation and attribute inference to improve performance. The TCDCN learns shared representations across tasks using a deep convolutional network. It introduces task-wise early stopping to halt learning on auxiliary tasks that reach optimal performance early to avoid overfitting and improve convergence on the main task of landmark detection. Experimental results showed the proposed approach outperformed existing methods.
The document summarizes the paper "Matching Networks for One Shot Learning". It discusses one-shot learning, where a classifier can learn new concepts from only one or a few examples. It introduces matching networks, a new approach that trains an end-to-end nearest neighbor classifier for one-shot learning tasks. The matching networks architecture uses an attention mechanism to compare a test example to a small support set and achieve state-of-the-art one-shot accuracy on Omniglot and other datasets. The document provides background on one-shot learning challenges and related work on siamese networks, memory augmented neural networks, and attention mechanisms.
This document provides an overview of VAE-type deep generative models, especially RNNs combined with VAEs. It begins with notations and abbreviations used. The agenda then covers the mathematical formulation of generative models, the Variational Autoencoder (VAE), variants of VAE that combine it with RNNs (VRAE, VRNN, DRAW), a Chainer implementation of Convolutional DRAW, other related models (Inverse DRAW, VAE+GAN), and concludes with challenges of VAE-like generative models.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
[Paper Reading] Attention is All You NeedDaiki Tanaka
The document summarizes the "Attention Is All You Need" paper, which introduced the Transformer model for natural language processing. The Transformer uses attention mechanisms rather than recurrent or convolutional layers, allowing for more parallelization. It achieved state-of-the-art results in machine translation tasks using techniques like multi-head attention, positional encoding, and beam search decoding. The paper demonstrated the Transformer's ability to draw global dependencies between input and output with constant computational complexity.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data anlytics tools.
This document contains lecture notes on sparse autoencoders. It begins with an introduction describing the limitations of supervised learning and the need for algorithms that can automatically learn feature representations from unlabeled data. The notes then state that sparse autoencoders are one approach to learn features from unlabeled data, and describe the organization of the rest of the notes. The notes will cover feedforward neural networks, backpropagation for supervised learning, autoencoders for unsupervised learning, and how sparse autoencoders are derived from these concepts.
The document summarizes radial basis function (RBF) networks. Key points:
- RBF networks use radial basis functions as activation functions and can universally approximate continuous functions.
- They are local approximators compared to multilayer perceptrons which are global approximators.
- Learning involves determining the centers, widths, and weights. Centers can be randomly selected or via clustering. Widths are usually different for each basis function. Weights are typically learned via least squares or gradient descent methods.
1. The document discusses using machine learning and deep learning techniques for trading, including classification, regression, clustering, and time series modeling with RNNs.
2. It provides an overview of different ML algorithms like decision trees, random forests, CNNs, RNNs and reinforcement learning and how they could be applied to problems in trading like predicting stock prices, generating trading signals, and portfolio optimization.
3. It presents some ideas for modeling trading problems using technical indicators or fundamental factors as inputs to classifiers, regressors or sequence models, and using reinforcement learning to optimize trading strategies.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
From RNN to neural networks for cyclic undirected graphstuxette
This document discusses different neural network methods for processing graph-structured data. It begins by describing recurrent neural networks (RNNs) and their limitations for graphs, such as an inability to handle undirected or cyclic graphs. It then summarizes two alternative approaches: one that uses contraction maps to allow recurrent updates on arbitrary graphs, and one that employs a constructive architecture with frozen neurons to avoid issues with cycles. Both methods aim to make predictions at the node or graph level on relational data like molecules or web pages.
1. The document discusses various machine learning classification algorithms including neural networks, support vector machines, logistic regression, and radial basis function networks.
2. It provides examples of using straight lines and complex boundaries to classify data with neural networks. Maximum margin hyperplanes are used for support vector machine classification.
3. Logistic regression is described as useful for binary classification problems by using a sigmoid function and cross entropy loss. Radial basis function networks can perform nonlinear classification with a kernel trick.
Machine listening is a field that encompasses research on a wide range of tasks, including speech recognition, audio content recognition, audio-based search, and content-based music analysis. In this talk, I will start by introducing some of the ways in which machine learning enables computers to process and understand audio in a meaningful way. Then I will draw on some specific examples from my dissertation showing techniques for automated analysis of live drum performances. Specifically, I will focus on my work on drum detection, which uses gamma mixture models and a variant of non-negative matrix factorization, and drum pattern analysis, which uses deep neural networks to infer high-level rhythmic and stylistic information about a performance.
Audio chord recognition using deep neural networksbzamecnik
This document discusses using deep neural networks for audio chord recognition from music recordings. It describes the task of identifying chord labels for time segments of audio. The model uses convolutional and recurrent layers with chromagram features extracted from the audio as input. Evaluation shows the CNN+LSTM model achieves over 50% accuracy on a dataset of 180 annotated songs. Future work ideas include improving segmentation and exploring additional neural network architectures.
Generating Musical Notes and Transcription using Deep LearningVarad Meru
Music has always been the most followed art form, and lot of research had gone into understanding it. In recent years, deep learning approaches for building unsupervised hierarchical representations from unlabeled data have gained significant interest. Progress in fields, such as image processing and natural language processing, has been substantial, but to my knowledge, methods on auditory data for learning representations have not been studied extensively. In this project I try to use two methods for generating music from range of musical inputs such as MIDI to complex WAV formats. I use RNN-RBMs and CDBN to explore music.
The document provides an overview of Music Information Retrieval (MIR) techniques for analyzing music with computers. It discusses common MIR tasks like genre/mood classification, beat tracking, and music similarity. Recent approaches to music auto-tagging using deep learning are highlighted, such as using neural networks to learn features directly from audio rather than relying on hand-designed features. Recurrent neural networks are presented as a way to model temporal dependencies in music for applications like onset detection. As an example, the document describes a system for live drum transcription that uses onset detection, spectrogram slicing, and non-negative matrix factorization for source separation to detect drum activations in real-time performance audio.
Uncertainty Awareness in Integrating Machine Learning and Game TheoryRikiya Takahashi
This document discusses integrating machine learning and game theory while accounting for uncertainty. It provides an example of previous work predicting travel time distribution on a road network using taxi data. It also discusses functional approximation in reinforcement learning, noting that techniques like deep learning can better represent functions with fewer parameters compared to nonparametric models like random forests. The document emphasizes avoiding unnecessary intermediate estimation steps and using approaches like fitted Q-iteration that are robust to estimation errors from small datasets.
This document summarizes a research paper that proposes a novel architecture for implementing a 1D lifting integer wavelet transform (IWT) using residue number system (RNS). The key aspects covered are:
1) RNS offers advantages over binary representations for digital signal processing by avoiding carry propagation. A ROM-based approach is proposed for RNS division.
2) The lifting scheme for discrete wavelet transforms is summarized, including split, predict, and update stages.
3) A novel RNS-based architecture is proposed using three main blocks - split, predict, and update - that repeat at each decomposition level. Pipelined implementations of the predict and update blocks are detailed.
Tensor Spectral Clustering is an algorithm that generalizes graph partitioning and spectral clustering methods to account for higher-order network structures. It defines a new objective function called motif conductance that measures how partitions cut motifs like triangles in addition to edges. The algorithm represents a tensor of higher-order random walk transitions as a matrix and computes eigenvectors to find a partition that minimizes the number of motifs cut, allowing networks to be clustered based on higher-order connectivity patterns. Experiments on synthetic and real networks show it can discover meaningful partitions by accounting for motifs that capture important structural relationships.
This document discusses several methods for designing sequential circuits, including state tables, state assignment, and deriving flip-flop input equations. It then provides examples of implementing sequential circuits using ROMs, PLAs, CPLDs, and FPGAs. Specifically, it designs a comparator circuit and code converter as examples of iterative and sequential circuits. It also discusses implementing a parallel adder and shift register using an FPGA.
This document discusses several methods for designing sequential circuits, including state table reduction, state assignment, derivation of flip-flop input equations, and realization using logic gates. It provides an example of designing a comparator circuit using an iterative approach with identical cells. The document also describes implementing sequential circuits using ROMs, PLAs, CPLDs and FPGAs, giving examples of a code converter and parallel adder circuit designs for each method.
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTIONijwmn
Multiple-input multiple-output (MIMO) systems are playing an important role in the recent wireless
communication. The complexity of the different systems models challenge different researches to get a good
complexity to performance balance. Lattices Reduction Techniques and Lenstra-Lenstra-Lovàsz (LLL)
algorithm bring more resources to investigate and can contribute to the complexity reduction purposes.
In this paper, we are looking to modify the LLL algorithm to reduce the computation operations by
exploiting the structure of the upper triangular matrix without “big” performance degradation. Basically,
the first columns of the upper triangular matrix contain many zeroes, so the algorithm will perform several
operations with very limited income. We are presenting a performance and complexity study and our
proposal show that we can gain in term of complexity while the performance results remains almost the
same.
SLAM of Multi-Robot System Considering Its Network Topologytoukaigi
This document proposes a new solution to the multi-robot simultaneous localization and mapping (SLAM) problem that takes into account the network topology between robots. Previous multi-robot SLAM research has expanded one-robot SLAM algorithms without considering how the relationship between robots changes over time. The proposed approach models the network structure and derives the mathematical formulation for estimating the multi-robot SLAM. It presents motion and observation update equations in an information filter framework that can be implemented in a decentralized way on individual robots. Future work will focus on specific challenges in multi-robot SLAM like map merging.
ON FINDING MINIMUM AND MAXIMUM PATH LENGTH IN GRID-BASED WIRELESS NETWORKSijwmn
The document discusses finding the minimum and maximum path lengths between cells in 3D grid-based wireless networks. It first derives formulas for the minimum path length between points in a 2D grid. It shows the minimum is the maximum difference between the corresponding coordinates. It then extends this to 3D grids. It determines the maximum path length is the sum of coordinate differences, while the minimum depends on cell positions but is at most the maximum coordinate difference. It considers different cases to calculate the minimum path length between any source-destination cell pair.
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS cscpconf
We present a new design for random number generation. The outputs of linear feedback shift registers (LFSRs) act as continuous inputs to the two boundaries of a one-dimensional (1-D)
Elementary Cellular Automata (ECA). The results show superior randomness features and the
output string has passed the Diehard statistical battery of tests. The design is good candidatefor parallel random number generation, has strong correlation immunity and it is inherentlyamenable for VLSI implementation.
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONScsitconf
This summarizes a document describing a new method for random number generation using linear feedback shift registers (LFSRs) as boundary conditions for a one-dimensional cellular automaton (CA). The outputs of two uncoupled LFSRs are used as inputs to the left and right boundary cells of the CA. Testing the output string of the central CA cell using the Diehard statistical tests showed it passed all tests, performing better than previous methods using fixed or periodic boundary conditions. The design exhibits good randomness, parallelism, and is suitable for VLSI implementation.
Comparative study of results obtained by analysis of structures using ANSYS, ...IOSR Journals
The analysis of complex structures like frames, trusses and beams is carried out using the Finite
Element Method (FEM) in software products like ANSYS and STAAD. The aim of this paper is to compare the
deformation results of simple and complex structures obtained using these products. The same structures are
also analyzed by a MATLAB program to provide a common reference for comparison. STAAD is used by civil
engineers to analyze structures like beams and columns while ANSYS is generally used by mechanical engineers
for structural analysis of machines, automobile roll cage, etc. Since both products employ the same fundamental
principle of FEM, there should be no difference in their results. Results however, prove contradictory to this for
complex structures. Since FEM is an approximate method, accuracy of the solutions cannot be a basis for their
comparison and hence, none of the varying results can be termed as better or worse. Their comparison may,
however, point to conservative results, significant digits and magnitude of difference so as to enable the analyst
to select the software best suited for the particular application of his or her structure.
The paper examines the problem of systems redesign within the context of passive electrical networks and through analogies provides also the means of addressing issues of re-design of mechanical networks. The problem addressed here are special cases of the more general network redesign problem. Redesigning autonomous passive electric networks involves changing the network natural dynamics by modification of the types of elements, possibly their values, interconnection topology and possibly addition, or elimination of parts of the network. We investigate the modelling of systems, whose structure is not fixed but evolves during the system lifecycle. As such, this is a problem that differs considerably from a standard control problem, since it involves changing the system itself without control and aims to achieve the desirable system properties, as these may be expressed by the natural frequencies by system re-engineering. In fact, this problem involves the selection of alternative values for dynamic elements and non-dynamic elements within a fixed interconnection topology and/or alteration of the network interconnection topology and possible evolution of the cardinality of physical elements (increase of elements, branches). The aim of the paper is to define an appropriate representation framework that allows the deployment of control theoretic tools for the re-engineering of properties of a given network. We use impedance and admittance modelling for passive electrical networks and develop a systems framework that is capable of addressing “life-cycle design issues” of networks where the problems of alteration of existing topology and values of the elements, as well as issues of growth, or death of parts of the network are addressed.
We use the Natural Impedance/ Admittance (NI-A) models and we establish a representation of the different types of transformations on such models. This representation provides the means for an appropriate formulation of natural frequencies assignment using the Determinantal Assignment Problem framework defined on appropriate structured transformations. The developed natural representation of transformations are expressed as additive structured transformations. For the simpler case of RL or RC networks it is shown that the single parameter variation problem (dynamic or non-dynamic) is equivalent to Root Locus problems.
follow IEEE NTUA SB on facebook:
https://www.facebook.com/IeeeNtuaSB
Continuum Modeling and Control of Large Nonuniform NetworksYang Zhang
Presented at The 49th Annual Allerton Conference on Communication, Control, and Computing, 2011
Abstract—Recent research has shown that some Markov chains modeling networks converge to continuum limits, which are solutions of partial differential equations (PDEs), as the number of the network nodes approaches infinity. Hence we can approximate such large networks by PDEs. However, the previous results were limited to uniform immobile networks with a fixed transmission rule. In this paper we first extend the analysis to uniform networks with more general transmission rules. Then through location transformations we derive the continuum limits of nonuniform and possibly mobile networks. Finally, by comparing the continuum limits of corresponding nonuniform and uniform networks, we develop a method to control the transmissions in nonuniform and mobile networks so that the continuum limit is invariant under node locations, and hence mobility. This enables nonuniform and mobile networks to maintain stable global characteristics in the presence of varying node locations.
Transport and routing on coupled spatial networksrichardgmorris
This document discusses a model for route choice between two coupled spatial networks - a "fast but sparse" network and a "slow but dense" network. It defines metrics like average route distance, coupling between networks, and a Gini coefficient for betweenness centrality. Simulation results show two regimes based on the rewiring probability p between networks. For p > p*, optimization relies on routing behavior, while for p ≤ p* the Gini coefficient changes with coupling strength. The document advocates a problem-led approach to characterizing real-world coupled infrastructure systems.
This document discusses objectives related to number systems and conversion between binary, decimal, octal, and hexadecimal numbering systems. It covers arithmetic operations for whole numbers and fractions in different bases, as well as representation of negative binary numbers in sign-magnitude, one's complement, and two's complement forms. Key topics include conversion between numbering bases, binary arithmetic, representation of negative numbers, detecting overflow, and binary codes.
Using spectral radius ratio for node degreeIJCNCJournal
In this paper, we show that the spectral radius ratio for node degree could be used to analyze the variation of node degree during the evolution of complex networks. We focus on three commonly studied models of complex networks: random networks, scale-free networks and small-world networks. The spectral radius ratio for node degree is defined as the ratio of the principal (largest) eigenvalue of the adjacency matrix of a network graph to that of the average node degree. During the evolution of each of the above three categories of networks (using the appropriate evolution model for each category), we observe the spectral radius ratio for node degree to exhibit high-very high positive correlation (0.75 or above) to that of the
coefficient of variation of node degree (ratio of the standard deviation of node degree and average node degree). We show that the spectral radius ratio for node degree could be used as the basis to tune the operating parameters of the evolution models for each of the three categories of complex networks as well as analyze the impact of specific operating parameters for each model.
Modelling Quantum Transport in Nanostructuresiosrjce
IOSR Journal of Electronics and Communication Engineering(IOSR-JECE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of electronics and communication engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in electronics and communication engineering. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
This document summarizes three methods for modeling quantum transport in nanostructures:
1) The non-equilibrium Green's function (NEGF) method provides a rigorous description of quantum transport by solving Poisson's equation and the quantum transport solver based on NEGF formalism self-consistently.
2) The recursive Green's function method computes the Green's function recursively without full matrix inversion, reducing computational efforts.
3) The Gauss estimation method computes spectral coefficients representing the Green's function to estimate current at discrete longitudinal field values rather than integrating over the entire field.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIPVLSICS Design
This is widely accepted that Network-on-Chip represents a promising solution for forthcoming complex embedded systems. The current SoC Solutions are built from heterogeneous hardware and Software components integrated around a complex communication infrastructure. The crossbar is a vital component of in any NoC router. In this work, we have designed a crossbar interconnect for serial bit data transfer and 128-parallel bit data transfer. We have shown comparision between power and delay for the serial bit and parallel bit data transfer through crossbar switch. The design is implemented in 0.180 micron TSM technology.The bit rate achived in serial transfer is slow as compared with parallel data transfer. The simulation resuls show that the critical path delay is less for parallel bit data transfer but power dissipation is high.
New from BookNet Canada for 2025: BNC CataList - Tech Forum 2025BookNet Canada
Join BookNet Canada Associate Product Manager Vivian Luu for this presentation all about what’s new with BNC CataList over the last year. Learn about the new tag system, full book previews, bulk actions, and more. Watch to the end to see what’s ahead for CataList.
Learn more about CataList here: https://bnccatalist.ca/
Link to recording and transcript: https://bnctechforum.ca/sessions/new-from-booknet-canada-for-2025-bnc-catalist/
Presented by BookNet Canada on April 1, 2025 with support from the Department of Canadian Heritage.
Leveraging Knowledge Graphs for RAG: A Smarter Approach to Contextual AI Appl...All Things Open
Presented at All Things Open AI 2025
Presented by David vonThenen - DigitalOcean
Title: Leveraging Knowledge Graphs for RAG: A Smarter Approach to Contextual AI Applications
Abstract: In the ever-evolving field of AI, retrieval-augmented generation (RAG) systems have become critical for delivering high-quality, contextually relevant answers in applications powered by large language models (LLMs). While vector databases have traditionally dominated RAG applications, graph databases, specifically knowledge graphs, offer a transformative approach to contextual AI that’s often overlooked. This approach provides unique advantages for applications requiring deep insights, intelligent search, and reasoning over both structured and unstructured sources, making it ideal for complex business scenarios.
Attendees will leave with an understanding of how to build a RAG system using a graph database and practical skills for data querying and insights retrieval. By comparing graph and vector database approaches, we’ll highlight when and why graph databases may offer superior benefits for managing complex data relationships. The session will provide concrete examples and advanced techniques, empowering participants to incorporate knowledge graphs into their AI systems for better data-driven outcomes and improved LLM performance. This discussion will conclude with a live demo showcasing key techniques and insights covered in this talk.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
Bluesky: https://bsky.app/profile/allthingsopen.bsky.social
2025 conference: https://2025.allthingsopen.org/
Let's Create a GitHub Copilot Extension! - Nick Taylor, PomeriumAll Things Open
Presented at All Things Open AI 2025
Presented by Nick Taylor - Pomerium
Title: Let's Create a GitHub Copilot Extension!
Abstract: Get hands-on in this talk where we'll create a GitHub Copilot Extension from scratch.
We'll use the Copilot Extensions SDK, https://github.com/copilot-extensions/preview-sdk.js, and Hono.js, covering best practices like payload validation and progress notifications and error handling.
We'll also go through how to set up a dev environment for debugging, including port forwarding to expose your extension during development as well as the Node.js debugger.
By the end, we'll have a working Copilot extension that the audience can try out live.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
Bluesky: https://bsky.app/profile/allthingsopen.bsky.social
2025 conference: https://2025.allthingsopen.org/
Mastering NIST CSF 2.0 - The New Govern Function.pdfBachir Benyammi
Mastering NIST CSF 2.0 - The New Govern Function
Join us for an insightful webinar on mastering the latest updates to the NIST Cybersecurity Framework (CSF) 2.0, with a special focus on the newly introduced "Govern" function delivered by one of our founding members, Bachir Benyammi, Managing Director at Cyber Practice.
This session will cover key components such as leadership and accountability, policy development, strategic alignment, and continuous monitoring and improvement.
Don't miss this opportunity to enhance your organization's cybersecurity posture and stay ahead of emerging threats.
Secure your spot today and take the first step towards a more resilient cybersecurity strategy!
Event hosted by Sofiane Chafai, ISC2 El Djazair Chapter President
Watch the webinar on our YouTube channel: https://youtu.be/ty0giFH6Qp0
You Don't Need an AI Strategy, But You Do Need to Be Strategic About AI - Jes...All Things Open
Presented at All Things Open AI 2025
Presented by Jessica Hall - Hallway Studio
Title: You Don't Need an AI Strategy, But You Do Need to Be Strategic About AI
Abstract: There’s so much noise about creating an “AI strategy,” it’s easy to feel like you’re already behind. But here’s the thing: you don’t need an AI strategy or a data strategy. Those things need to serve your business strategy and that requires strategic thinking.
Here’s what you’ll get:
A clear understanding of why AI is a means to an end—not the end itself—and how to use it to solve problems traditional methods can’t touch.
How to align AI with strategy using questions like “Where do we play? How do we win?” from Roger L. Martin and A.G. Lafley.
What successful AI initiatives have in common: clear value, smart use of unique data, and meaningful business impact.
A checklist to evaluate AI opportunities—covering metrics, workflows, and the human factors that make or break AI efforts.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
Bluesky: https://bsky.app/profile/allthingsopen.bsky.social
2025 conference: https://2025.allthingsopen.org/
Revolutionizing GPU-as-a-Service for Maximum EfficiencyAI Infra Forum
In this session, we'll explore our cutting-edge GPU-as-a-Service solution designed to transform enterprise AI operations. Learn how our MemVerge.ai platform maximizes GPU utilization, streamlines workload management, and ensures uninterrupted operations through innovative features like Dynamic GPU Surfing. We'll dive into key use cases, from training large language models to enterprise-scale AI deployment. We'll demonstrate how our solution benefits various stakeholders – from platform engineers to data scientists and decision-makers. Discover how our platform optimizes costs while maintaining data security and sovereignty.
DON’T PANIC: AI IS COMING – The Hitchhiker’s Guide to AI - Mark Hinkle, Perip...All Things Open
Presented at All Things Open AI 2025
Presented by Mark Hinkle - Peripety Labs
Title: DON’T PANIC: AI IS COMING – The Hitchhiker’s Guide to AI
Abstract: AI is coming of age, and much like discovering intergalactic travel, it’s equal parts thrilling and terrifying. Fears of job loss, doomsday scenarios, and bureaucratic AI overlords dominate the conversation—but I think the reality is far less apocalyptic and far more exciting. With the right guide, you can navigate this new universe, adapt, and even thrive. That’s what AllThingsOpen.AI is all about—building a community where people and businesses don’t just survive AI’s rise but flourish in it. So grab your towel, keep an open mind, and let’s explore the future—without the panic. Listen to Conference Co-Producer and publisher of the Artificially Intelligent Enterprise, Mark Hinkle, provide a vision on how AI will play out in our lives.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
Bluesky: https://bsky.app/profile/allthingsopen.bsky.social
2025 conference: https://2025.allthingsopen.org/
GDG Cloud Southlake #41: Shay Levi: Beyond the Hype:How Enterprises Are Using AIJames Anderson
Beyond the Hype: How Enterprises Are Actually Using AI
Webinar Abstract:
AI promises to revolutionize enterprises - but what’s actually working in the real world? In this session, we cut through the noise and share practical, real-world AI implementations that deliver results. Learn how leading enterprises are solving their most complex AI challenges in hours, not months, while keeping full control over security, compliance, and integrations. We’ll break down key lessons, highlight recent use cases, and show how Unframe’s Turnkey Enterprise AI Platform is making AI adoption fast, scalable, and risk-free.
Join the session to get actionable insights on enterprise AI - without the fluff.
Bio:
Shay Levi is the Co-Founder and CEO of Unframe, a company redefining enterprise AI with scalable, secure solutions. Previously, he co-founded Noname Security and led the company to its $500M acquisition by Akamai in just four years. A proven innovator in cybersecurity and technology, he specializes in building transformative solutions.
Don't just talk to AI, do more with AI: how to improve productivity with AI a...All Things Open
Presented at All Things Open AI 2025
Presented by Sheng Liang - Acorn Labs
Title: Don't just talk to AI, do more with AI: how to improve productivity with AI agents
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
Bluesky: https://bsky.app/profile/allthingsopen.bsky.social
2025 conference: https://2025.allthingsopen.org/
UiPath Automation Developer Associate Training Series 2025 - Session 8DianaGray10
In session 8, the final session of this series, you will learn about the Implementation Methodology Fundamentals and about additional self-paced study courses you will need to complete to finalize the courses and receive your credential.
Dev Dives: Unleash the power of macOS Automation with UiPathUiPathCommunity
Join us on March 27 to be among the first to explore UiPath innovative macOS automation capabilities.
This is a must-attend session for developers eager to unlock the full potential of automation.
📕 This webinar will offer insights on:
How to design, debug, and run automations directly on your Mac using UiPath Studio Web and UiPath Assistant for Mac.
We’ll walk you through local debugging on macOS, working with native UI elements, and integrating with key tools like Excel on Mac.
This is a must-attend session for developers eager to unlock the full potential of automation.
👨🏫 Speakers:
Andrei Oros, Product Management Director @UiPath
SIlviu Tanasie, Senior Product Manager @UiPath
Leveraging Pre-Trained Transformer Models for Protein Function Prediction - T...All Things Open
Presented at All Things Open AI 2025
Presented by Tia Pope - North Carolina A&T
Title: Leveraging Pre-Trained Transformer Models for Protein Function Prediction
Abstract: Transformer-based models, such as ProtGPT2 and ESM, are revolutionizing protein sequence analysis by enabling detailed embeddings and advanced function prediction. This talk provides a hands-on introduction to using pre-trained open-source transformer models for generating protein embeddings and leveraging them for classification tasks. Attendees will learn to tokenize sequences, extract embeddings, and implement machine-learning pipelines for protein function annotation based on Gene Ontology (GO) or Enzyme Commission (EC) numbers. This session will showcase how pre-trained transformers can democratize access to advanced protein analysis techniques while addressing scalability and explainability challenges. After the talk, the speaker will provide a notebook to test basic functionality, enabling participants to explore the concepts discussed.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
Bluesky: https://bsky.app/profile/allthingsopen.bsky.social
2025 conference: https://2025.allthingsopen.org/
This is session #5 of the 5-session online study series with Google Cloud, where we take you onto the journey learning generative AI. You’ll explore the dynamic landscape of Generative AI, gaining both theoretical insights and practical know-how of Google Cloud GenAI tools such as Gemini, Vertex AI, AI agents and Imagen 3.
UiPath NY AI Series: Session 3: UiPath Autopilot for Everyone with Clipboard AIDianaGray10
🚀 Embracing the Future: UiPath NY AI Series – Session 3: UiPath Autopilot for Everyone with Clipboard AI
📢 Event Overview
This session will provide a deep dive into how UiPath Clipboard AI and Autopilot are reshaping automation, offering attendees a firsthand look at their capabilities, use cases, and real-world benefits. Whether you're a developer, business leader, or automation enthusiast, you'll gain valuable insights into leveraging these AI-driven tools to streamline operations and maximize productivity. 🤖✨
12. •
•
•
•
•
•
12
[Jozefowicz+15]
)
. (18)
T recurrence
+1)
)
C
+1)
(19)
the model)
aining RNN-
(20)
1)T
(21)
)T
. (22)
ormance of
wo baseline
tion capture
mean frame-
of compari-
RBM is per-
ling starting
ptimally.
d on a simu-
er & Hinton,
Figure 3. Receptive fields of 48 hidden units of an RNN-
RBM trained on the bouncing balls dataset. Each square
shows the input weights of a hidden unit as an image.
The human motion capture dataset2
is represented by
a sequence of joint angles, translations and rotations
of the base of the spine in an exponential-map parame-
terization (Hsu et al., 2005; Taylor et al., 2007). Since
the data consists of 49 real values per time step, we
use the Gaussian RBM variant (Welling et al., 2005)
for this task. We use up to 450 hidden units and an
initial learning rate of 0.001. The mean squared pre-
diction test error is 20.1 for the RTRBM and reduced
substantially to 16.2 for the RNN-RBM.
6 Modeling sequences of polyphonic
music
In this section, we show results with main applica-
tion of interest for this paper: probabilistic modeling
of sequences of polyphonic music. We report our ex-
periments on four datasets of varying complexity con-
verted to our input format.
Piano-midi.de is a classical piano MIDI archive that
was split according to Poliner & Ellis (2007).
Nottingham is a collection of 1200 folk tunes3
with
chords instantiated from the ABC format.
MuseData is an electronic library of orchestral and
piano classical music from CCARH4
.
JSB chorales refers to the entire corpus of 382 four-
part harmonized chorales by J. S. Bach with the
split of Allan & Williams (2005).
2
people.csail.mit.edu/ehsu/work/sig05stf
3
ifdo.ca/~seymour/nottingham/nottingham.html
東北⼤学 ⼩林颯介 @ NLP-DL
13. •
•
•
•
13
[Jozefowicz+15]
for Nottingham, N-dropout stands for Nottingham with nonzero
dropout, and P stands for Piano-Midi.
Arch. 5M-tst 10M-v 20M-v 20M-tst
Tanh 4.811 4.729 4.635 4.582 (97.7)
LSTM 4.699 4.511 4.437 4.399 (81.4)
LSTM-f 4.785 4.752 4.658 4.606 (100.8)
LSTM-i 4.755 4.558 4.480 4.444 (85.1)
LSTM-o 4.708 4.496 4.447 4.411 (82.3)
LSTM-b 4.698 4.437 4.423 4.380 (79.83)
GRU 4.684 4.554 4.559 4.519 (91.7)
MUT1 4.699 4.605 4.594 4.550 (94.6)
MUT2 4.707 4.539 4.538 4.503 (90.2)
MUT3 4.692 4.523 4.530 4.494 (89.47)
Table 3. Perplexities on the PTB. The prefix (e.g., 5M) denotes
the number of parameters in the model. The suffix “v” denotes
validation negative log likelihood, the suffix“tst” refers to the test
set. The perplexity for select architectures is reported in paren-
theses. We used dropout only on models that have 10M or 20M
parameters, since the 5M models did not benefit from dropout at
all, and most dropout-free models achieved a test perplexity of
108, and never greater than 120. In particular, the perplexity of
the best models without dropout is below 110, which outperforms
the results of Mikolov et al. (2014).
東北⼤学 ⼩林颯介 @ NLP-DL
17. •
•
•
•
17
resurgence of new structural designs for recurrent neural networks (RNNs)
esigns are derived from popular structures including vanilla RNNs, Long
works (LSTMs) [4] and Gated Recurrent Units (GRUs) [5]. Despite of their
ost of them share a common computational building block, described by the
(Wx + Uz + b), (1)
Rm
are state vectors coming from different information sources, W 2 Rd⇥n
e-to-state transition matrices, and b is a bias vector. This computational
a combinator for integrating information flow from the x and z by a sum
by a nonlinearity . We refer to it as the additive building block. Additive
ly implemented in various state computations in RNNs (e.g. hidden state
RNNs, gate/cell computations of LSTMs and GRUs.
an alternative design for constructing the computational building block by
of information integration. Specifically, instead of utilizing sum operation
e Hadamard product “ ” to fuse Wx and Uz:
(Wx Uz + b) (2)
ucture Description and Analysis
neral Formulation of Multiplicative Integration
idea behind Multiplicative Integration is to integrate different information flows Wx
adamard product “ ”. A more general formulation of Multiplicative Integration
e bias vectors 1 and 2 added to Wx and Uz:
((Wx + 1) (Uz + 2) + b)
1, 2 2 Rd
are bias vectors. Notice that such formulation contains the first order
itive building block, i.e., 1 Uht 1 + 2 Wxt. In order to make the Mult
on more flexible, we introduce another bias vector ↵ 2 Rd
to gate2
the term W
g the following formulation:
(↵ Wx Uz + 1 Uz + 2 Wx + b),
t the number of parameters of the Multiplicative Integration is about the same as t
building block, since the number of new parameters (↵, 1 and 2) are negligible c
number of parameters. Also, Multiplicative Integration can be easily extended to
Us3
, that adopt vanilla building blocks for computing gates and output states, wher
replace them with the Multiplicative Integration. More generally, in any kind of
information flows (k 2) are involved (e.g. RNN with multiple skip connect
dforward models like residual networks [12]), one can implement pairwise Mult
on for integrating all k information sources.東北⼤学 ⼩林颯介 @ NLP-DL
25. Pixel Recurrent Neu
x1
xi
xn
xn2
Figure 2. Left: To generate pixel xi one conditions on all the pre-
viously generated pixels left and above of xi. Center: Illustration
of a Row LSTM with a kernel of size 3. The dependency field of
the Row LSTM does not reach pixels further away on the sides
of the image. Right: Illustration of the two directions of the Di-
agonal BiLSTM. The dependency field of the Diagonal BiLSTM
covers the entire available context in the image.
Figure 3. In the Diagonal BiLSTM, to allow for parallelization
along the diagonals, the input map is skewed by offseting each
row by one position with respect to the previous row. When the
spatial layer is computed left to right and column by column, the
output map is shifted back into the original size. The convolution
uses a kernel of size 2 ⇥ 1.
(2015); Uria et al. (2014)). By contrast we model p(x) as
a discrete distribution, with every conditional distribution
3
T
th
tu
fo
x
p
d
la
T
in
T
a
c
L
th
tw
u
T
(s
re
in
la
T
th
s
h
Pixel Recurrent Neural Networks
x1
xi
xn
xn2
Figure 2. Left: To generate pixel xi one conditions on all the pre-
viously generated pixels left and above of xi. Center: Illustration
of a Row LSTM with a kernel of size 3. The dependency field of
the Row LSTM does not reach pixels further away on the sides
of the image. Right: Illustration of the two directions of the Di-
agonal BiLSTM. The dependency field of the Diagonal BiLSTM
covers the entire available context in the image.
3.1. Row LSTM
The Row LSTM is a unidirectiona
the image row by row from top to b
tures for a whole row at once; the
formed with a one-dimensional con
xi the layer captures a roughly triang
pixel as shown in Figure 2 (center).
dimensional convolution has size k
larger the value of k the broader the c
The weight sharing in the convoluti
invariance of the computed features
The computation proceeds as follow
an input-to-state component and a r
component that together determine th
LSTM core. To enhance parallelizat
•
•
• 25
as a conference paper at ICLR 2016
2d Grid LSTM blockblock
m0
h0
h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
cks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2
ons. The dashed lines indicate identity transformations. The standard LSTM block
a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
ector m1 applied along the vertical dimension.
er review as a conference paper at ICLR 2016
2d Grid LSTM blockandard LSTM block
m0
h0
h0
I ⇤ xi h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
re 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2
3 dimensions. The dashed lines indicate identity transformations. The standard LSTM block
not have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
memory vector m1 applied along the vertical dimension.
review as a conference paper at ICLR 2016
2d Grid LSTM blockard LSTM block
m0
h0
h0
I ⇤ xi h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2
dimensions. The dashed lines indicate identity transformations. The standard LSTM block
ot have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
mory vector m1 applied along the vertical dimension.
conference paper at ICLR 2016
2d Grid LSTM block
m0
h0
h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
orm the standard LSTM and those that form Grid LSTM networks of N = 1, 2
The dashed lines indicate identity transformations. The standard LSTM block
mory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
m1 applied along the vertical dimension.
onference paper at ICLR 2016
2d Grid LSTM block
m0
h0
h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
rm the standard LSTM and those that form Grid LSTM networks of N = 1, 2
The dashed lines indicate identity transformations. The standard LSTM block
mory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
m1 applied along the vertical dimension.
review as a conference paper at ICLR 2016
2d Grid LSTM blockdard LSTM block
m0
h0
h0
I ⇤ xi h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
e 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2
dimensions. The dashed lines indicate identity transformations. The standard LSTM block
ot have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
emory vector m1 applied along the vertical dimension.
Under review as a conference paper at ICLR 2016
2d Grid LSTM blockStandard LSTM block
m m0
h0
h
h0
I ⇤ xi h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM
Figure 1: Blocks form the standard LSTM and those that form Grid LSTM networks o
and 3 dimensions. The dashed lines indicate identity transformations. The standard LS
does not have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM
the memory vector m1 applied along the vertical dimension.
東北⼤学 ⼩林颯介 @ NLP-DL
26. •
•
•
•
26
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Language Processing, pages 1556–1566,
Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics
works, a type of recurrent neural net-
work with a more complex computational
unit, have obtained strong results on a va-
riety of sequence modeling tasks. The
only underlying LSTM structure that has
been explored so far is a linear chain.
However, natural language exhibits syn-
tactic properties that would naturally com-
bine words to phrases. We introduce the
Tree-LSTM, a generalization of LSTMs to
tree-structured network topologies. Tree-
LSTMs outperform all existing systems
and strong LSTM baselines on two tasks:
predicting the semantic relatedness of two
sentences (SemEval 2014, Task 1) and
sentiment classification (Stanford Senti-
ment Treebank).
1 Introduction
Most models for distributed representations of
phrases and sentences—that is, models where real-
valued vectors are used to represent meaning—fall
into one of three classes: bag-of-words models,
sequence models, and tree-structured models. In
bag-of-words models, phrase and sentence repre-
sentations are independent of word order; for ex-
ample, they can be generated by averaging con-
stituent word representations (Landauer and Du-
mais, 1997; Foltz et al., 1998). In contrast, se-
quence models construct sentence representations
as an order-sensitive function of the sequence of
tokens (Elman, 1990; Mikolov, 2012). Lastly,
tree-structured models compose each phrase and
sentence representation from its constituent sub-
phrases according to a given syntactic structure
over the sentence (Goller and Kuchler, 1996;
Socher et al., 2011).
x1
x2
x4 x5 x6
y1
y2 y3
y4 y6
Figure 1: Top: A chain-structured LSTM net-
work. Bottom: A tree-structured LSTM network
with arbitrary branching factor.
Order-insensitive models are insufficient to
fully capture the semantics of natural language
due to their inability to account for differences in
meaning as a result of differences in word order
or syntactic structure (e.g., “cats climb trees” vs.
“trees climb cats”). We therefore turn to order-
sensitive sequential or tree-structured models. In
particular, tree-structured models are a linguisti-
cally attractive option due to their relation to syn-
tactic interpretations of sentence structure. A nat-
ural question, then, is the following: to what ex-
tent (if at all) can we do better with tree-structured
models as opposed to sequential models for sen-
tence representation? In this paper, we work to-
wards addressing this question by directly com-
paring a type of sequential model that has recently
been used to achieve state-of-the-art results in sev-
eral NLP tasks against its tree-structured general-
ization.
Due to their capability for processing arbitrary-
length sequences, recurrent neural networks
1556
w0
w0w1w2
w4 w5 w6
w0 w4 w5
G
EN
-L
GEN-NX-LGEN-NX-L
G
EN
-R
GEN-NX-R GEN-NX-R
w1w2w3
LD LD
Figure 4: Generation of left and right dependents of node w0
In order to jointly take
into account, we employ y
goes from the furthest lef
left dependent (LD is a
dent). As shown in Figur
representation of all left d
this representation is then
right dependent of the sam
w0
w1w2w3 w4 w5 w6
Generated by four LSTMs with tied We and tied Who
w0
w1w2w3
w0w1w2
w4 w5 w6
w0 w4 w5
G
EN
-L
GEN-NX-LGEN-NX-L
G
EN
-R
GEN-NX-R GEN-NX-R
Figure 3: Generation process of left (w1,w2,w3) and right
Who 2 R|V|⇥d the output matrix of our model, where
|V| is the vocabulary size, s the word embedding size
and d the hidden unit size. We use tied We and tied
Who for the four LSTMs to reduce the number of pa-
rameters in our model. The four LSTMs also share
their hidden states. Let H 2 Rd⇥(n+1) denote the
shared hidden states of all time steps and e(wt) the
one-hot vector of wt. Then, H[:,t] represents D(wt)
at time step t, and the computation2 is:
xt = We ·e(wt0 ) (2a)
z 0
東北⼤学 ⼩林颯介 @ NLP-DL
27. Figure 2: Attentional Encoder-Decoder model.
dj is calculated as the summation vector weighted
by ↵j(i):
dj =
nX
i=1
↵j(i)hi. (6)
To incorporate the attention mechanism into the
decoding process, the context vector is used for the
the j-th word prediction by putting an additional
hidden layer ˜sj:
˜s = tanh(W [s ; d ] + b ), (7)
Figure 3: Proposed model: Tree-to-sequence
tentional NMT model.
a sentence inherent in language. We propose
novel tree-based encoder in order to explicitly ta
the syntactic structure into consideration in t
NMT model. We focus on the phrase structure
a sentence and construct a sentence vector fro
phrase vectors in a bottom-up fashion. The se
tence vector in the tree-based encoder is the
•
•
•
•
•
27東北⼤学 ⼩林颯介 @ NLP-DL
28. The hungry cat
NP (VP(S
REDUCE
GENNT(NP)NT(VP)
…
cat hungry The
a<t
p(at)
ut
TtSt
gure 5: Neural architecture for defining a distribution over at given representations of the stack (St), output buffer (Tt) and
story of actions (a<t). Details of the composition architecture of the NP, the action history LSTM, and the other elements of the
ack are not shown. This architecture corresponds to the generator state at line 7 of Figure 4.
f the forward and reverse LSTMs are concatenated,
assed through an affine transformation and a tanh
onlinearity to become the subtree embedding.4 Be-
ause each of the child node embeddings (u, v, w in
ig. 6) is computed similarly (if it corresponds to an
ternal node), this composition function is a kind of
cursive neural network.
2 Word Generation
4.4 Discriminative Parsing Model
A discriminative parsing model can be obtained by
replacing the embedding of Tt at each time step with
an embedding of the input buffer Bt. To train this
model, the conditional likelihood of each sequence
of actions given the input string is maximized.5
5 Inference via Importance Sampling
Our generative model p(x, y) defines a joint dis-
•
28
3.5 Comparison to Other Models
Our generation algorithm algorithm differs from
previous stack-based parsing/generation algorithms
in two ways. First, it constructs rooted tree struc-
tures top down (rather than bottom up), and sec-
ond, the transition operators are capable of directly
generating arbitrary tree structures rather than, e.g.,
assuming binarized trees, as is the case in much
prior work that has used transition-based algorithms
to produce phrase-structure trees (Sagae and Lavie,
2005; Zhang and Clark, 2011; Zhu et al., 2013).
4 Generative Model
RNNGs use the generator transition set just pre-
sented to define a joint distribution on syntax trees
(y) and words (x). This distribution is defined as a
sequence model over generator transitions that is pa-
rameterized using a continuous space embedding of
the algorithm state at each time step (ut); i.e.,
p(x, y) =
|a(x,y)|
Y
t=1
p(at | a<t)
=
|a(x,y)|
Y
t=1
exp r>
at
ut + bat
P
a02AG(Tt,St,nt) exp r>
a0 ut + ba0
,
and where action-specific embeddings ra and bias
vector b are parameters in ⇥.
The representation of the algorithm state at time
t, ut, is computed by combining the representation
of the generator’s three data structures: the output
dard RNN encoding architecture. The stack (S) is
more complicated for two reasons. First, the ele-
ments of the stack are more complicated objects than
symbols from a discrete alphabet: open nontermi-
nals, terminals, and full trees, are all present on the
stack. Second, it is manipulated using both push and
pop operations. To efficiently obtain representations
of S under push and pop operations, we use stack
LSTMs (Dyer et al., 2015).
4.1 Syntactic Composition Function
When a REDUCE operation is executed, the parser
pops a sequence of completed subtrees and/or to-
kens (together with their vector embeddings) from
the stack and makes them children of the most recent
open nonterminal on the stack, “completing” the
constituent. To compute an embedding of this new
subtree, we use a composition function based on
bidirectional LSTMs, which is illustrated in Fig. 6.
NP
u v w
NP u v w NP
x
x
Figure 6: Syntactic composition function based on bidirec-
tional LSTMs that is executed during a REDUCE operation; the
network on the right models the structure on the left.
The first vector read by the LSTM in both the for-
ward and reverse directions is an embedding of the
[Dyer+16]
Input: The hungry cat meows .
Stack Buffer Action
0 The | hungry | cat | meows | . NT(S)
1 (S The | hungry | cat | meows | . NT(NP)
2 (S | (NP The | hungry | cat | meows | . SHIFT
3 (S | (NP | The hungry | cat | meows | . SHIFT
4 (S | (NP | The | hungry cat | meows | . SHIFT
5 (S | (NP | The | hungry | cat meows | . REDUCE
6 (S | (NP The hungry cat) meows | . NT(VP)
7 (S | (NP The hungry cat) | (VP meows | . SHIFT
8 (S | (NP The hungry cat) | (VP meows . REDUCE
9 (S | (NP The hungry cat) | (VP meows) . SHIFT
10 (S | (NP The hungry cat) | (VP meows) | . REDUCE
11 (S (NP The hungry cat) (VP meows) .)
Figure 2: Top-down parsing example.
tackt Termst Open NTst Action Stackt+1 Termst+1 Open NTst+1
T n NT(X) S | (X T n + 1
T n GEN(x) S | x T | x n
| (X | ⌧1 | . . . | ⌧` T n REDUCE S | (X ⌧1 . . . ⌧`) T n 1
ure 3: Generator transitions. Symbols defined as in Fig. 1 with the addition of T representing the history of generated terminals.
Stack Terminals Action
0 NT(S)
1 (S NT(NP)
2 (S | (NP GEN(The)
3 (S | (NP | The The GEN(hungry)
4 (S | (NP | The | hungry The | hungry GEN(cat)
5 (S | (NP | The | hungry | cat The | hungry | cat REDUCE
6 (S | (NP The hungry cat) The | hungry | cat NT(VP)
7 (S | (NP The hungry cat) | (VP The | hungry | cat GEN(meows)
8 (S | (NP The hungry cat) | (VP meows The | hungry | cat | meows REDUCE
9 (S | (NP The hungry cat) | (VP meows) The | hungry | cat | meows GEN(.)
10 (S | (NP The hungry cat) | (VP meows) | . The | hungry | cat | meows | . REDUCE
11 (S (NP The hungry cat) (VP meows) .) The | hungry | cat | meows | .
•
東北⼤学 ⼩林颯介 @ NLP-DL
29. •
•
29
[Bowman+16]
bu er
down
sat
stack
cat
the
composition
tracking
transition
down
sat
the cat composition
tracking
transition
down
sat
the cat
tracking
(a) The SPINN model unrolled for two transitions during the processing of the sentence the cat sat down. ‘Tracking’, ‘transition’,
and ‘composition’ are neural network layers. Gray arrows indicate connections which are blocked by a gating function.
bu er
stack
t = 0
down
sat
cat
the
t = 1
down
sat
cat
the
t = 2
down
sat
cat
the
t = 3
down
sat
the cat
t = 4
down
sat
the cat
t = 5
down
sat
the cat
t = 6
sat down
the cat
t = 7 = T
(the cat) (sat down)
output to model
for semantic task
(b) The fully unrolled SPINN for the cat sat down, with neural network layers omitted for clarity.
bu er
down
sat
stack
cat
the
composition
tracking
transition
down
sat
the cat composition
tracking
transition
down
sat
the cat
tracking
(a) The SPINN model unrolled for two transitions during the processing of the sentence the cat sat down. ‘Tracking’, ‘transition’,
and ‘composition’ are neural network layers. Gray arrows indicate connections which are blocked by a gating function.
bu er
stack
t = 0
down
sat
cat
the
t = 1
down
sat
cat
the
t = 2
down
sat
cat
the
t = 3
down
sat
the cat
t = 4
down
sat
the cat
t = 5
down
sat
the cat
t = 6
sat down
the cat
t = 7 = T
(the cat) (sat down)
output to model
for semantic task
(b) The fully unrolled SPINN for the cat sat down, with neural network layers omitted for clarity.東北⼤学 ⼩林颯介 @ NLP-DL
30. •
•
•
30
tor with the structure lkj = (1 j/J) (k/d)(1 2j/J) (assuming 1-based indexing),
ng the number of words in the sentence, and d is the dimension of the embedding. This
presentation, which we call position encoding (PE), means that the order of the words
mi. The same representation is used for questions, memory inputs and memory outputs.
Encoding: Many of the QA tasks require some notion of temporal context, i.e. in
ample of Section 2, the model needs to understand that Sam is in the bedroom after
esented by a one-hot vector of length V (where the vocabulary is of size V = 177,
simplistic nature of the QA language). The same representation is used for the
d answer a. Two versions of the data are used, one that has 1000 training problems
second larger one with 10,000 per task.
Details
wise stated, all experiments used a K = 3 hops model with the adjacent weight sharing
all tasks that output lists (i.e. the answers are multiple words), we take each possible
of possible outputs and record them as a separate answer vocabulary word.
presentation: In our experiments we explore two different representations for
s. The first is the bag-of-words (BoW) representation that takes the sentence
2, ..., xin}, embeds each word and sums the resulting vectors: e.g mi =
P
j Axij and
j. The input vector u representing the question is also embedded as a bag of words:
. This has the drawback that it cannot capture the order of the words in the sentence,
ortant for some tasks.
propose a second representation that encodes the position of words within the
s takes the form: mi =
P
j lj · Axij, where · is an element-wise multiplication. lj is a
4
4.2 ATTENTION MECHANISMS
Neural models with memories coupled to differentiable addressing mechanism have been success-
fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-
danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,
2015). Since we are interested in associative memories we employed a “content” based attention.
This has the property that the vector retrieved from our memory would not change if we randomly
shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular,
our process block based on an attention mechanism uses the following:
qt = LSTM(q⇤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
q⇤
t = [qt rt] (7)
Read
Process Write
Figure 1: The Read-Process-and-Write model.
where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is
a query vector which allows us to read rt from the memories, f is a function that computes a
single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a
recurrent state but which takes no inputs. q⇤
t is the state which this LSTM evolves, and is formed
by concatenating the query qt with the resulting attention readout rt. t is the index which indicates
how many “processing steps” are being carried to compute the state to be fed to the decoder. Note
that permuting mi and mi0 has no effect on the read vector rt.
4.3 READ, PROCESS, WRITE
Our model, which naturally handles input sets, has three components (the exact equations and im-
plementation will be released in an appendix prior to publication):
• A reading block, which simply embeds each element xi 2 X using a small neural network
onto a memory vector mi (the same neural network is used for all i).
• A process block, which is an LSTM without inputs or outputs performing T steps of com-
putation over the memories mi. This LSTM keeps updating its state by reading mi repeat-
edly using the attention mechanism described in the previous section. At the end of this
block, its hidden state q⇤
T is an embedding which is permutation invariant to the inputs. See
eqs. (3)-(7) for more details.
4
fully applied to handwriting generation a
danau et al., 2015a), and more general c
2015). Since we are interested in associa
This has the property that the vector retri
shuffled the memory. This is crucial for
our process block based on an attention m
qt = LSTM(q⇤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
q⇤
t = [qt rt] (7)
where i indexes through each memory v
a query vector which allows us to read
single scalar from mi and qt (e.g., a do
recurrent state but which takes no inputs
by concatenating the query qt with the re
how many “processing steps” are being c
that permuting mi and mi0 has no effect o
4.3 READ, PROCESS, WRITE
Our model, which naturally handles inpu
plementation will be released in an appen
• A reading block, which simply e
onto a memory vector mi (the sa
• A process block, which is an LS
putation over the memories mi.
edly using the attention mechan
block, its hidden state q⇤
T is an em
eqs. (3)-(7) for more details.
東北⼤学 ⼩林颯介 @ NLP-DL
33. •
•
•
•
•
•
•
33
, d(ht−1)] + bh), (4)
function from Equation 2.
d-forward fully connected
is a significant difference:
ks every fully-connected
only once, while it is not
nt layer: each training ex-
mposed of a number of in-
ropout this results in hid-
on every step. This obser-
tion of how to sample the
re two options: sample it
sequence (per-sequence)
mask on every step (per-
wo strategies for sampling
etail in Section 3.4.
ht = ot ∗ f(ct),
where it, ft, ot are input, output and forget gate
step t; gt is the vector of cell updates and ct is
updated cell vector used to update the hidden s
ht; σ is the sigmoid function and ∗ is the elem
wise multiplication.
Our approach is to apply dropout to the cell
date vector ct as follows:
ct = ft ∗ ct−1 + it ∗ d(gt)
In contrast, Moon et al. (2015) propose to
ply dropout directly to the cell values and use
sequence sampling:
ct = d(ft ∗ ct−1 + it ∗ gt)
We will discuss the limitations of the appro
of Moon et al. (2015) in Section 3.4 and sup
Figure 1: Illustration of the three types
circles represent connections, hidden state
we apply dropout.
gt = f(Wg xt, rt ∗ ht−1 + bg)
ht = (1 − zt) ∗ ht−1 + zt ∗ gt
Similarly to the LSTMs, we propoose
dropout to the hidden state updates vector
ht = (1 − zt) ∗ ht−1 + zt ∗ d(gt)
To the best of our knowledge, this work is
to study the effect of recurrent dropout
networks.
3.4 Dropout and memory
Before going further with the explanatio
東北⼤学 ⼩林颯介 @ NLP-DL
34. •
•
•
•
•
34
hidden-to-hidden transformations. We introduce the batch-normalizing transform BN( · ; , )
the LSTM as follows:
0
B
B
@
˜ft
˜it
˜ot
˜gt
1
C
C
A = BN(Whht 1; h, h) + BN(Wxxt; x, x) + b (6)
ct = (˜ft) ct 1 + (˜it) tanh( ˜gt) (7)
ht = (˜ot) tanh(BN(ct; c, c)) (8)
network by discarding the absolute scale of activations.
We want to a preserve the information in the network, by
normalizing the activations in a training example relative
to the statistics of the entire training data.
3 Normalization via Mini-Batch
Statistics
Since the full whitening of each layer’s inputs is costly
and not everywhere differentiable, we make two neces-
sary simplifications. The first is that instead of whitening
the features in layer inputs and outputs jointly, we will
normalize each scalar feature independently, by making it
have the mean of zero and the variance of 1. For a layer
with d-dimensional input x = (x(1)
. . . x(d)
), we will nor-
malize each dimension
x(k)
=
x(k)
− E[x(k)
]
Var[x(k)]
where the expectation and variance are computed over the
training data set. As shown in (LeCun et al., 1998b), such
normalization speeds up convergence, even when the fea-
tures are not decorrelated.
Note that simply normalizing each input of a layer may
change what the layer can represent. For instance, nor-
B = {x1...m}
Let the normalized values be x1...m
formations be y1...m. We refer to t
BNγ,β : x1...m →
as the Batch Normalizing Transfor
Transform in Algorithm 1. In the a
added to the mini-batch variance f
Input: Values of x over a mini-ba
Parameters to be learned:
Output: {yi = BNγ,β(xi)}
µB ←
1
m
m
i=1
xi
σ2
B ←
1
m
m
i=1
(xi − µB)2
xi ←
xi − µB
σ2
B + ϵ
yi ← γxi + β ≡ BNγ,β(xi)
Algorithm 1: Batch Normalizing
東北⼤学 ⼩林颯介 @ NLP-DL
35. •
•
•
•
•
•
35
Overfitting in machine learning is addressed by restricting the space o
considered. This can be accomplished by reducing the number of par
with an inductive bias for simpler models, such as early stopping.
can be achieved by incorporating more sophisticated prior knowledg
activations on a reasonable path can be difficult, especially across lo
in mind, we devise a regularizer for the state representation learned
RNNs, that aims to encourage stability of the path taken through repr
we propose the following additional cost term for Recurrent Neural N
1
T
TX
t=1
(khtk2 kht 1k2)2
Where ht is the vector of hidden activations at time-step t, and is a h
amounts of regularization. We call this penalty the norm-stabilizer, as
norms of the hiddens to be stable (i.e. approximately constant acros
coherence” penalty of Jonschkowski & Brock (2015), our penalty
representation to remain constant, only its norm.
In the absence of inputs and nonlinearities, a constant norm would imp
to-hidden transition matrix for simple RNNs (SRNNs). However, in t
sition matrix, inputs and nonlinearities can still change the norm of
instability. This makes targeting the hidden activations directly a mo
ing norm stability. Stability becomes especially important when we
sequences at test time than those seen during training (the “training h
arXiv:1511.08400v
東北⼤学 ⼩林颯介 @ NLP-DL
36. •
•
•
nce paper at ICLR 2016
English (unsupervised)
German (translation)
Tags (parsing)English
y Setting – one encoder, multiple decoders. This scheme is useful for either
as in Dong et al. (2015) or between different tasks. Here, English and Ger-
of words in the respective languages. The α values give the proportions of
are allocated for the different tasks.
Published as a conference paper at ICLR 2016
English (unsupervised)
German (translation)
Tags (parsing)English
Figure 2: One-to-many Setting – one encoder, multiple decoders. This scheme is useful for either
multi-target translation as in Dong et al. (2015) or between different tasks. Here, English and Ger-
man imply sequences of words in the respective languages. The α values give the proportions of
parameter updates that are allocated for the different tasks.
for constituency parsing as used in (Vinyals et al., 2015a), (b) a sequence of German words for ma-
chine translation (Luong et al., 2015a), and (c) the same sequence of English words for autoencoders
or a related sequence of English words for the skip-thought objective (Kiros et al., 2015).
3.2 MANY-TO-ONE SETTING
This scheme is the opposite of the one-to-many setting. As illustrated in Figure 3, it consists of mul-
tiple encoders and one decoder. This is useful for tasks in which only the decoder can be shared, for
example, when our tasks include machine translation and image caption generation (Vinyals et al.,
2015b). In addition, from a machine translation perspective, this setting can benefit from a large
amount of monolingual data on the target side, which is a standard practice in machine translation
system and has also been explored for neural MT by Gulcehre et al. (2015).
English (unsupervised)
Image (captioning) English
German (translation)
Figure 3: Many-to-one setting – multiple encoders, one decoder. This scheme is handy for tasks in
which only the decoders can be shared.
3.3 MANY-TO-MANY SETTING
Lastly, as the name describes, this category is the most general one, consisting of multiple encoders
Published as a conference paper at ICLR 2016
German (translation)
English (unsupervised) German (unsupervised)
English
Figure 4: Many-to-many setting – multiple encoders, multiple decoders. We consider t
in a limited context of machine translation to utilize the large monolingual corpora i
source and the target languages. Here, we consider a single translation task and two un
autoencoder tasks.
consist of ordered sentences, e.g., paragraphs. Unfortunately, in many applications th
machine translation, we only have sentence-level data where the sentences are unordered.
that, we split each sentence into two halves; we then use one half to predict the other hal
36東北⼤学 ⼩林颯介 @ NLP-DL
40. •
•
•
40
hello , my name is Tony Jebara .
Attentive Read
hi , Tony Jebara
<eos> hi , Tony
h1 h2 h3 h4 h5
s1 s2 s3 s4
h6 h7 h8
“Tony”
DNN
Embedding
for “Tony”
Selective Read
for “Tony”
(a) Attention-based Encoder-Decoder (RNNSearch)
(c) State Update
s4
SourceVocabulary
Softmax
Prob(“Jebara”)=Prob(“Jebara”, g) +Prob(“Jebara”, c)
… ...
(b) Generate-Mode & Copy-Mode
!
M
M
東北⼤学 ⼩林颯介 @ NLP-DL
41. forms and their meanings is non-trivial (de Saus-
sure, 1916). While some compositional relation-
ships exist, e.g., morphological processes such as
adding -ing or -ly to a stem have relatively reg-
ular effects, many words with lexical similarities
convey different meanings, such as, the word pairs
lesson () lessen and coarse () course.
3 C2W Model
Our compositional character to word (C2W)
model is based on bidirectional LSTMs (Graves
and Schmidhuber, 2005), which are able to
learn complex non-local dependencies in sequence
models. An illustration is shown in Figure 1. The
input of the C2W model (illustrated on bottom) is
a single word type w, and we wish to obtain is
a d-dimensional vector used to represent w. This
model shares the same input and output of a word
lookup table (illustrated on top), allowing it to eas-
ily replace then in any network.
As input, we define an alphabet of characters
C. For English, this vocabulary would contain an
entry for each uppercase and lowercase letter as
well as numbers and punctuation. The input word
w is decomposed into a sequence of characters
c1, . . . , cm, where m is the length of w. Each ci
cats
cat
cats
job
....
....
........
cats
c a t s
a
c
t
....
....
s
Character
Lookup
Table
Word
Lookup
Table
Bi-LSTM
embeddings
for word "cats"
embeddings
for word "cats"
•
•
•
•
•
41東北⼤学 ⼩林颯介 @ NLP-DL
43. •
•
•
43
In order to apply the REINFORCE algorithm (Williams, 1992; Zaremba & Sutskever, 2015) to the
problem of sequence generation we cast our problem in the reinforcement learning (RL) frame-
work (Sutton & Barto, 1988). Our generative model (the RNN) can be viewed as an agent, which
interacts with the external environment (the words and the context vector it sees as input at every
time step). The parameters of this agent defines a policy, whose execution results in the agent pick-
ing an action. In the sequence generation setting, an action refers to predicting the next word in
the sequence at each time step. After taking an action the agent updates its internal state (the hid-
den units of RNN). Once the agent has reached the end of a sequence, it observes a reward. We
can choose any reward function. Here, we use BLEU (Papineni et al., 2002) and ROUGE-2 (Lin
& Hovy, 2003) since these are the metrics we use at test time. BLEU is essentially a geometric
mean over n-gram precision scores as well as a brevity penalty (Liang et al., 2006); in this work, we
consider up to 4-grams. ROUGE-2 is instead recall over bi-grams. Like in imitation learning, we
have a training set of optimal sequences of actions. During training we choose actions according to
the current policy and only observe a reward at the end of the sequence (or after maximum sequence
length), by comparing the sequence of actions from the current policy against the optimal action
sequence. The goal of training is to find the parameters of the agent that maximize the expected
reward. We define our loss as the negative expected reward:
L✓ =
X
wg
1 ,...,wg
T
p✓(wg
1, . . . , wg
T )r(wg
1, . . . , wg
T ) = E[wg
1 ,...wg
T ]⇠p✓
r(wg
1, . . . , wg
T ), (9)
where wg
n is the word chosen by our model at the n-th time step, and r is the reward associated
with the generated sequence. In practice, we approximate this expectation with a single sample
from the distribution of actions implemented by the RNN (right hand side of the equation above
and Figure 9 of Supplementary Material). We refer the reader to prior work (Zaremba & Sutskever,
2015; Williams, 1992) for the full derivation of the gradients. Here, we directly report the partial
derivatives and their interpretation. The derivatives w.r.t. parameters are:
@L✓
@✓
=
X
t
@L✓
@ot
@ot
@✓
(10)
6
Published as a conference paper at ICLR 2016
h2 = ✓( , h1)
p✓(w| , h1)
XENT
h1
w2 w3XENT
top-k
w0
1,...,k p✓(w|w0
1,...,k, h2) w00
1,...,k
h3 = ✓(w0
1,...,k, h2)
top-k
Figure 3: Illustration of the End-to-End BackProp method. The first steps of the unrolled sequence
(here just the first step) are exactly the same as in a regular RNN trained with cross-entropy. How-
ever, in the remaining steps the input to each module is a sparse vector whose non-zero entries are
the k largest probabilities of the distribution predicted at the previous time step. Errors are back-
propagated through these inputs as well.
While this algorithm is a simple way to expose the model to its own predictions, the loss function
optimized is still XENT at each time step. There is no explicit supervision at the sequence level
while training the model.
3.2 SEQUENCE LEVEL TRAINING
We now introduce a novel algorithm for sequence level training, which we call Mixed Incremental
Cross-Entropy Reinforce (MIXER). The proposed method avoids the exposure bias problem, and
oss L using a two-step pro-
ass, we compute candidate
n violations (sequences with
backward pass, we back-
ugh the seq2seq RNNs. Un-
ining, the first-step requires
Time Step
a red dog smells home today
the dog dog barks quickly Friday
red blue cat barks straight now
runs today
a red dog runs quickly today
東北⼤学 ⼩林颯介 @ NLP-DL