Researchers Present New Work in Natural Language Processing
A form of artificial intelligence, natural language processing allows computers to understand spoken and written language. It underlies a growing number of applications, from search engines that rapidly scan the Web for factoids, to smart assistants like Siri and Cortana that can book a table at a nearby restaurant. Advances in NLP have made the automated translation of a growing number of languages more accurate, including in real-time conversation. NLP also helps with filtering through mountains of text by extracting summaries of long documents and flagging important emails in a tide of incoming mail.
Each year, leading experts in natural language processing gather to share new research at the Empirical Methods in Natural Language Processing (EMNLP) conference, held this year in Lisbon from Sept. 19 to 21. Listed below are some of the studies that Columbia researchers will be presenting.
Faced with pages of text, it helps to know where the story gets interesting. Kathleen McKeown, a computer scientist who heads Columbia’s Data Science Institute, and graduate student, Jessica Ouyang, have come up with an automated method for finding a story’s crux, or its “most reportable event.” They mined posts on the online bulletin board, Reddit, and tracked changes in complexity, meaning and emotion, to come up with a way to identify the most compelling part of a story. Their next step is to develop an automated summary, which could be useful in recommendation systems that suggest related stories, among other applications. A panel of EMNLP judges recently recognized the study as a “Notable Resource.”
The ability to detect sarcasm is crucial to understanding what’s really being said. Faced with the statement, “I love going to the dentist,” an automated sentiment-analysis system can have trouble telling whether the speaker likes going to the dentist or hates it. Columbia computer scientist Smaranda Muresan, and graduate students, Debanjan Ghosh and Weiwei Guo, propose thinking about sarcasm-detection as a word-disambiguation problem, where the literal and sarcastic meaning of “love” is determined by its context. Using Twitter data, they show that the so-called distributional semantic technique produces a 10 percent improvement over current approaches.
Arabic is the fourth most common language on the Web, but tools to quickly gauge opinions and other subjective information from Arabic-language news stories, ratings, product reviews and other texts, remain poorly developed. Computer scientist Owen Rambow and research associate Ramy Eskander have come up with a dictionary, or lexicon, of Arabic language sentiment that has a 38 percent improvement over similar lexicons. Future work will look at ways to infer the correct meaning of words with multiple definitions and in ambiguous contexts, said Eskander.
Teaching a computer to translate from one language to another involves lots of tedious parsing, or breaking down sentences into their parts of speech, and labeling each part, be it nouns, verbs or adjectives. Computer scientist Michael Collins and graduate student Mohammad Sadegh Rasooli have come up with a way to automatically parse the syntax of new languages with the help of translation data. From political speeches to movie subtitles, translation data is more readily available than language-specific data, making it easier to develop high-accuracy automated syntactic parsers. The research has applications in machine translation, text summarization and information extraction from the Web.
— Kim Martineau