Beyond the Keyword: Natural Language Processing Expert Heng Ji Joins Rensselaer as Edward P. Hamilton Development Chair

Research Seeks To Develop More Robust Computer Searches Based on Human Speech

December 10, 2013

Image

Heng Ji,an expert in natural language processing, has been appointed as the Edward P. Hamilton Development Chair and Tenured Associate Professor in Computer Science at Rensselaer Polytechnic Institute. Ji was most recently a faculty member in computer science at the City University of New York.

“Heng brings an extraordinary track record of success in a very strategic research area to Rensselaer,” said Laurie Leshin, dean of the School of Science at Rensselaer. “She’s a well-published expert in natural language processing and her intellectual leadership and energy will be critical in both the The Rensselaer Institute for Data Exploration and Applications, and our work with the Watson at Rensselaer cognitive computing system. We are thrilled to welcome her to Rensselaer.”

Ji has recently been honored with several awards, including the best paper awards at the Institute of Electrical and Electronics Engineers (IEEE) 2013 International Conference on Data Mining, and the Society of Industrial and Applied Mathematics 2013 International Conference on Data Mining. IEEE Intelligent Systems also honored Ji as one of 10 young stars in the field of artificial intelligence, naming her one of “AI’s 10 to Watch” for 2013. The recognition is given to young researchers who have “made impressive research contributions and had an impact in the literature” of AI.

The Edward P. Hamilton Development Chair is supported by an endowment established in 1976 by Rensselaer alumnus Edward P. Hamilton, Class of 1907. The endowment is intended to “encourage excellence in education in all fields and at all levels at Rensselaer by recognizing and rewarding an outstanding faculty member and providing resources to pursue the development of new programs.” Hamilton was a member of the Rensselaer Board of Trustees for 20 years, and was actively involved in the Rensselaer Alumni Association. Ji’s research will support the School of Science interdisciplinary science theme in network and data science, security, and visualization.

Ji’s current research focuses on natural language processing, with emphasis on the design of efficient algorithms that can extract knowledge and information on a massive scale from Web-based sources such as social media posts, Wikipedia articles, and news reports.

“Computer searches currently have certain limitations. If you want to use Google, for example, you have to come up with intelligent keywords, you can only search in your own language, and your search may return thousands of documents,” said Ji. “A computer that could understand natural language could overcome those limitations, and our goal is to build that computer.”

In order to understand a natural language question and provide relevant answers, Ji and her team combine the power of a sophisticated linguistic analysis function with the automation of machine learning. The system then seeks connections between the question and the Web-based information sources.

As an example, Ji referred to a social media search program she and her team created for journalists at The New York Times.

“The reporters wanted to use Twitter to discover breaking news, so we built an interface that analyzes Twitter messages and searches them for news events,” said Ji. “It’s not a trivial problem because, even when people are discussing the same event, they tend to use different expressions. We had to develop a computer program that could correctly group these messages together as news events.”

As a first step in its process, the team manually labels a sampling of documents – for example, Tweets of news events – to establish a set of ground rules for natural language. The computer uses a machine learning algorithm to apply the ground rules to additional documents it examines. The same process can also be used on Web pages, blogs, and multimedia sources.

When a user poses a question or request in natural language, a linguistic analysis examines the phrase on multiple separate levels. The “syntax” level, for example, parses the phrase into its constituent syntactical elements, such as the subject, object, or verb.  Another level of analysis, semantics, determines the meaning of a word – for example, “fire” could mean “combustion” or “dismissed from employment.”

The system matches the request with information found in a variety of sources, including Web pages, blogs, and social media posts, and in a variety of formats, such as text, speech, video, and images.

“We’re interested in how we can discover information from heterogeneous sources,” said Ji. “We want information to come from multiple languages, multiple genres, multiple data modalities, and multiple documents. When we say ‘big data,’ we are thinking more of the diversity than size of the information.”

Finally, the system organizes relevant results into a table of information, similar, said Ji, to the summary tables found on many biographical and other Wikipedia pages. This table allows users to easily digest the information the computer provides.

Big Data, broad data, high performance computing, data analytics, and Web science are creating a significant transformation globally in the way we make connections, make discoveries, make decisions, make products, and, ultimately, make progress. Ji’s research, under the auspices of The Rensselaer IDEA, is part of the university-wide effort at Rensselaer to maximize the capabilities of these tools and technologies for the purpose of expediting scientific discovery and innovation, developing the next generation of these digital enablers, and preparing our students to succeed and lead in this new data-driven world.

Ji received bachelor’s and master’s degrees in computational linguistics from Tsinghua University, and master’s and doctoral degrees in computer science from the Courant Institute of Mathematical Sciences at New York University.

Press Contact Mary L. Martialay
Back to top