I recently watched a talk by Doug Lenat, CEO of Cycorp, Inc. about his company’s technology, Cyc. Cyc is a knowledge base that contains a whole bunch of information and, more importantly, semantically meaningful links between those bits of information. By contrast, Google collects data, lots of it, and trusts that the links put in the data by people are enough to express some sort of meaning. But this means that Google can’t answer fairly elementary questions like ”Is the CN Tower taller than the Sears building?,” even though it “knows” the answer to the questions “How tall is the CN Tower?” and “How tall is the Sears building?” You still have to get out a calculator and compute the difference yourself, a fact which on the face of it seems perfectly normal but in reality is ridiculous. The difficulty lies in the fact that Google doesn’t understand that “towers” and “buildings” have a property “height,” or that “taller” refers to a comparison between heights, (or that “tower” is a subset of “building,” for that matter).
So how is Cyc different? Well Cyc is an effort to build a description of the kind of knowledge that makes interpretation of language possible. Lenat gives several good examples of where humans interpret language in very complex ways. For example, two apparently very similar statements – “Every American has a mother” and “Every American has a president” – express fundamentally different concepts. We would never interpret the first as implying that all Americans have the same mother, and yet that meaning is quickly inferred when we read the second. Another example: “Dave’s friend Jim was skiing on the TV. The idiot wasn’t wearing a jacket” versus “Dave’s friend Jim was skiing on the TV. The idiot didn’t recognise him.” “The idiot” refers to a different person depending on the context, and it’s very easy for us to identify which it is. Computers, on the other hand, find it very difficult to process that kind of language. In order to be able to they would need to know a vast amount about the world. Just think of a few things that contribute to your ability to understand the difference:
The amount of implicit knowledge that we use to understand the world, and especially language, is simply staggering, and Lenat suggests that the reason so many previous “learning AI” projects have failed is that they have never reached the point where the machine can make sensible inductions about the world, as it simply doesn’t possess the required knowledge to make its deductions accurate or sensible.
So how is Cyc different and what does that mean for our data? Cyc is different because the fundamental emphasis is on relating concepts to one another: “towers are a type of building,” “americans all share one president,” “every animal has a mother, which is a female animal.” It does this through its own predicate calculus which relates types of object to one another. That last example, for instance, would look something like this: (If you’ve never seen lambda calculus or lisp/scheme before, this will look weird. Don’t worry, it’s not crucial to the article)
(#$implies
(#$isa ?A #$Animal)
(#$thereExists ?M
(#$and (#$mother ?A ?M)
(#$isa ?M #$FemaleAnimal))))
It gets far more complex than this, of course, and Cyc has a huge number of different types of assertion, most, I would imagine, expressing more subtle concepts than the rigourously mathematical version above. The interesting thing is that Cyc is now getting to the point where it can make sensible logical deductions about new concepts (from the Internet or from people) based on the knowledge it already has. Even that’s not so simple when you think about it. As Lenat points out in his talk, the internet is often a bad place to find information about the real world: the fact that water flows downhill, for example, is so fundamentally obvious that most references on the Internet are to water flowing uphill, either as “magic” or a metaphor. There are yet more idiosyncrasies of dealing with the real world: to zoologists apes are not monkeys, but to most of the world making that distinction would seem pedantic. Likewise, Lenat gives the lovely example of being asked by his wife to pick up “those red flowers that I love” from the florists. The fact that those red flowers she loves are Poinsettias, and thus not actually “flowers,” is true but nonetheless potentially dangerous to marital stability.
But what’s really amazing is what you can ask Cyc. “Does Lassie have a nose?” “Can a can can-can?” – Cyc can answer both of these by simple inductive reasoning. “What terrorist-related attacks were perpetrated on Muslim targets on Christian holy days in the mid-1990s?” – Cyc semantically “understands” each of the concepts in that question far more deeply than Google can at the moment, and can do all the (fundamentally algorithmic) cross-referencing for you, a fact that clearly appeals to the US military, who have apparently invested some $25M in the project.
What’s different about Cyc is that it has looked for meaning in the data, and more than that, it has attacked the complexity of the world head on. The Semantic Web is all well and good, but descriptions of the world in RDF are extremely restrictive – unfortunately information is inherently more complex and subtle than we would like it to be. As Cyc learns and builds on its knowledge base, it is only going to become more and more intelligent, far more likely to be able to pass the Turing test than any of the “annoying chatbots” (Lenat again) that currently win the competitions. Who knows, perhaps Cyc will become a real-world Skynet?
Update: Just found this article from a 1990 edition of Communications of the ACM – haven’t had a chance to read it yet, but it looks interesting.
”Computers versus Common Sense” – A Google Talk by Doug Lenat, CEO, Cycorp Inc.
”Computer boffins pop AI’s $60m question” – 2002 IOL article on Cyc, archived on the OpenCyc website.
”OpenCyc Documentation” – user docs for OpenCyc.