Learning to Compose and Reason with Language Tree Structures for Visual Grounding Article Swipe
YOU?
·
· 2019
· Open Access
·
· DOI: https://doi.org/10.1109/tpami.2019.2911066
· OA: W2946086442
Grounding natural language in images, such as localizing "the black dog on\nthe left of the tree", is one of the core problems in artificial intelligence,\nas it needs to comprehend the fine-grained and compositional language space.\nHowever, existing solutions rely on the association between the holistic\nlanguage features and visual features, while neglect the nature of\ncompositional reasoning implied in the language. In this paper, we propose a\nnatural language grounding model that can automatically compose a binary tree\nstructure for parsing the language and then perform visual reasoning along the\ntree in a bottom-up fashion. We call our model RVG-TREE: Recursive Grounding\nTree, which is inspired by the intuition that any language expression can be\nrecursively decomposed into two constituent parts, and the grounding confidence\nscore can be recursively accumulated by calculating their grounding scores\nreturned by sub-trees. RVG-TREE can be trained end-to-end by using the\nStraight-Through Gumbel-Softmax estimator that allows the gradients from the\ncontinuous score functions passing through the discrete tree construction.\nExperiments on several benchmarks show that our model achieves the\nstate-of-the-art performance with more explainable reasoning.\n