This article tries to position deep learning in the intersection of artificial intelligence and cognitive science, as a long quest toward human intelligence. First, the recent development of huge language models obtained by transformer-based methods such as BERT and GPT-3 is introduced. Then, I explain what these models can do and can not do, and why. Two essential problems, which is embodiment and symbol grounding, are shown. In order to solve these problems, deep reinforcement learning with world models are currently studied. Disentanglement is shown to be an important concept to find factors to control. Lastly, I explain my perspective toward the future advancement, and conclude the paper.