Schrödinger's Web
Looking back at my post Perfect or Sloppy- RDF,Shirky and Wittgenstein and Danny's detailed response, wittgensteins-laptop (sorry you lost the original post Danny), a couple of things are clear. I didn't do a good job of explaining what I think the issue is and it was a bit them and us(not the intention).
Ian also made a good point. I should clarify that the issues that were bugging me are not with RDF itself but with the layers further up the Semantic Web Stack specifically the logic and proof layer built on top of RDF.
I would like to describe how I understand the proposed Semantic Web Stack and ask the community how certain questions have been covered off. It may be that I misunderstand the vision or that the questions I have have already been covered off.
As I understand it, the RDF and Onotology layers allow graphs of statements to be made and linked together. Multiple descriptions of a concept can be made and RDF allows inconsistency. The query level allows portions of graphs to be selected or joined together and the logic level allows knew knowledge to be inferred from the statements and questions to be answered using the mass of RDF statements. I understand this Logic to be first order predicate calculus(FOPC)?.
My concern is that the logic layer is very intolerant of inconsistency or error. From what I have been able to find it seems the proposed answer to this is to limit the scope of the logic to trusted consistent statements or user arbitration of conflicts will be required. This is the root of my concern, I cannot see how this is possible. Inconsistency is not just generated at the system logic or schema level it is deeper. It is the necessary result of allowing multiple descriptions of the same thing.
Inconsistency will always arise when ever humans have to make classification choices. This was one of the points in my previous post.
Danny was quite right to point out that most software today requires consistency. We all know the length programmers go to to ensure consistency and this is because programmatic methods are based on predicate logic. If a program enters an inconsistent state, usually that thread of execution must end. If the inconsistency is in persistent data you are in real trouble because restarting won't fix the problem.
Compilers enforce the consistency of the code but the data in the system must also be consistent if programmatic LOGIC is to be based on it.
Two principal methods are used to achieve this:
1. Limit to one description of an entity I.e no competing descriptions e.g one record per entity ID.
2. Fields marked as non programmatic e.g. Text descriptions. The contents of these fields will not be used by the program logic, they are for human use only.
With this approach any uncertainty in programmatic fields cannot generate inconsistency, principally because there is only one version of the truth i.e. statements are orthogonal.
Now contrast this with the semantic web where by definition you will be working with descriptions from many systems. Inconsistency will a natural feature not an error condition.
Note the fundamental nature of the inconsistency, it is not a property of the different systems, two identical systems will still yield inconsistency, because it is a function of how people use a system, not the system itself .
I confused the previous example by suggesting two different systems with slightly different schemas.
This time consider two identical library systems, both which have a schema with the concept of editions of a work , both of which are defined by the same RDF URI. In one system the librarian considers an edition to be when ever there is difference such as the two different covers for the same Harry Potter and catalogues accordingly. The librarian using the other system thinks it is a different edition only if the contents are different.
Now, taking descriptions from both systems you will get an inconsistency. Does the work Harry Potter have 1 or 2 editions.
This is not something you can fix by giving a different URI to the editions concept for each system because the inconsistency is result of the classification decision made by that person for that record in that system at that time i.e. it is not systematic. The result is that inconsistency will arise in an unpredictable way even between identical software systems with identical schemas. (it is one reason why integration of different systems still remains a pain even if you use RDF).
This inconsistency isn't a problem in its own right. But if layers of predicate logic are working off this data then it will become unstable very quickly.
My current understanding is that the SW community are suggesting that either inconsistency is avoided (how as it is a fundamental result of allowing multiple descriptions of the same thing) or that the system should ask a user at that point to arbitrate (on what basis should they choose one over the other? They are both right).
It strikes me that if inconsistency is fundamental then it should be treated as such, not something to be avoided.
Isn't the SW approach today, based on predicate logic, simply using the wrong maths? Just as the AI community was before it embraced fuzzyness, uncertainty and statistics? Or the classical physics community before quantum mechanics?
That transition saw AI moving from "programming" AI systems with rules and logic to creating learning systems that needed training.
It seems to me that the internet has 1 billion users capable of training it. We see examples of this in things like Google spell checking, which, rather than creating a traditional dictionary is based upon what people type and then retype when they can no results. When a spelling suggestion is given, if the user chooses it this provides further feedback or training as to what is useful spelling help and not. This turns out to work much better than the programmed approach. Other examples that spring to mind in del.icio.us and Flickr.
Realising that a work both has 1 and 2 editions at the same time seems to me to be exactly the position classical physics found it self in at the birth of quantum physics. The maths of classical physics could not cope with particles being at several locations at the same time. Neither could the classical physicist!
A new maths was required. One based upon uncertainly and probability. This maths is very well understood and forms the basis of solid state physics upon which electronic engineering is based upon which of coarse the computer is based!
So I guess my question is this: Is the logic layer intended to be FOPC and if so why. Who is ensuring that the SW community isn't falling into the same traps the AI community did? What can be learned from the AI community?
What is the problem of using probability based maths, it works for physics!
Maybe all of these have good answers. If so I wasn't able to easily find them. Or my understanding of the logic layers is wrong?
Please let me know.
References (1)
-
Response: Schrödinger's WebSeems to me people will eventually accept to live with the notion of subjects being superposition of mutually inconsistent states, but going further to admit that observed properties of a subject in a given representation context are probabilistically determined would certainly prove at least as difficult as it has been in Physics.
Reader Comments (18)
>> Is the logic layer intended to be FOPC and if so why.
There's no single "logic layer"; that's really just a concept to shoot for. It's important to think of the Semantic Web as creating potential - it's an model and languages in which to express knowledge.
In practice there are many types of processing that might be used according to application needs as well as many software frameworks to use. The FOPC is really quite limited in real-world terms: nothing time varying, no degrees of true/false.
You might express your data in one of the OWL ontology languages and then do logic processing. You may also process your data in completely different ways, say by solving linear constraints, etc. It's up to application developers to do hard things - SW technologies don't make it easy, just interoperable.
Also, the application defines a domain of particular types and sources of data; the Semantic Web doesn't imply trying to process all the inferences of the whole web. The application must aggregate and cleanse the data to its ends. So if the application is working on a application where 1000 people say a resource is "good" and 100 say "bad" that's an application issue not a difficulty for some specific SW logic layer.
Thanks for your comments. Very interesting.
I think you misunderstand the example though. This is not extra information. It is contradictory information.
System 1 says there is a total of 1 editions for the work. System two says there is a total of 2 editions. However the publisher chooses to assign ISBNs is not relevant to the example. It is a distraction from essence of the example though, perhaps I should find another example.
From the sounds of your description of the logic layer. I might have got the wrong impression. It would seem to me much clearer to call it an inference layer, the inference methods not being defined. Logic is a quite restrictive label.
Here, two identical systems are used by many different staff. For the most part 95% of the editions descriptions are consistent because the concept of edition used by each librarian is very similar, they have both been on the same training course.
In using the data from these identical systems all you have is a set of descriptions and the description source (system A or system B). Most of these descriptions match but some of them are inconsistent. On what basis do you ascribe a different URI, the nature of the conflict is not apparent without referring to the actual work and seeing why they have been classified differently. And most importantly, one person using system A might have used a stricter form of the concept Work but another user of system A (the same system) may have used a wider concept of Work on other records. The concept used to make classification decisions is in the persons head so the conflicts arise differently record per record depending on the users who catalogued that work not at the system level i.e. conflict can be of opposing types for each record.
In order refactor the URIs used you need to refer to more than just the statements that conflict to understand the classification conflict, you have to look at the actual work to see that one is more strict than another. Does each conflict require a manual intervention in the system which is using these records for every conflict? Won’t this generate an enormous number of sub classifications of edition trying to capture all the possible blurry edges.
Imagine whole install base of that library system, perhaps a few hundred. Now there is lots more inconsistency. This is what I don’t understand at the practical level. How are inconsistencies resolves because they are due to individual user actions and not consistent even in one system.
If the classification system available to users is not sufficiently constrained or if the users disagree about classification then it's time for more design thinking.
Ian and I are saying that there are solutions to the problem in a well designed Semantic Web app.
This is very far from being hypothetical.
I am the CTO for a library systems company. We have 112 installed systems in which highly trained cataloguers enter bibliographic metadata for the books on the shelf.
We also have a union database which aggregates all these descriptions from each system.
I can tell you that even with one of the world’s oldest classification systems and highly trained staff there are many inconsistencies.
Inconsistency is not a problem if the data is for human consumption, we are very good a dealing with it. But if you want to build programmatic logic off it you are in trouble.
Individual computer systems remove inconsistence by not allowing multiple descriptions of the same thing. This is the fundamental difference with semantic web.
The semantic web by definition is across multiple systems each of which may provide different descriptions of the same thing.
The semantic web must allow inconsistency.
Your suggestion is that the semantic web only works where inconsistency can be removed i.e. the semantic web is extremely brittle.
Finally, how does a well designed system ensure that everyone including users of another system, make the same choices when applying classification. That would require everyone to share exactly the same conception of the meaning of the words in the classification. Believe me, if you could see the size of the cataloguing guide lines manual, you would realize that it is not possible. So inconsistency is not something that a well designed system can remove because it cannot control users actions.
I very much like the notion of a subject (aka resource, concept, thing) as superposition of various quantic states, and actually it's quite close to the notion of "hubject" in the above paper. The slight difference is that in the hubjects proposal, there is not really the notion of inconsistency, since different representations are supported by somehow orthogonal frameworks. Justin points out that even in the same framework, representations of the same thing can be inconsistent. But in this case, it might be that the underlying assumption of the latter "sameness" is to be revisited.
Thanks for your comments. Your link doesn't seem to work and I would very much like to read the paper.
I was thinking along the lines of a computation being the equivalent of an observation in that it must collapse the super position and produce a definite answer.
It seems to me that an approach based on probability naturally gets better the larger the scale whereas being based on FOPC will become more unstable as you go to larger scales. Unfortunately I expect most developers have no experience of maths other than FOPC so it seems very alien. In fact it is the basis of all modern physics!
It should also be noted that flight control systems in air craft use some of these approaches. They have four identical systems and require 75% consistency of data before acting (not 100%).
"consider two identical library systems, both which have a schema with the concept of editions of a work , both of which are defined by the same RDF URI."
You are saying there is a concept called work identified by a URI. This concept is used by two identical systems. In your reply to my comment you say
"In using the data from these identical systems all you have is a set of descriptions and the description source (system A or system B). Most of these descriptions match but some of them are inconsistent. On what basis do you ascribe a different URI"
Here you are talking about URIs for the descriptions of actual works. Which type of URI are you debating in your post?
You then go on to say
"...one person using system A might have used a stricter form of the concept Work but another user of system A (the same system) may have used a wider concept of Work on other records"
Don't these people read manuals or the help file for their system? Imagine a relational database with a column for entering the postal town of an address. I might choose to enter Rothwell which is the name of my town, even though I know that the postal town is actually the larger town of Kettering. I don't claim to live in Kettering so I don't like putting it in my address. Perhaps if you were to publish a manual for this address system that says "Only enter town names corresponding to those approved by the post office" then I would be in the wrong. If there are no such guidelines then you are going to get bad data.
In an RDF-based system the property "postalTown" would have a URI. If I visited that URI with my web browser I would expect to see the definition of that property, hopefully as stated above. If my RDF-aware computer program visits the property URI it should get some RDF that documents the property. This is a key difference between an RDF-based system and an XML one: given an XML element how can a human find out what it means (hint: here's one to try it on:
I fail to see how the problem you describe is an RDF or even a semantic web one. Neither of those technologies claim to solve the problem of people differing in interpretations. They do claim to offer a solution for the problem of two people who wish to agree on an interpretation and use it to share information.
I very much agree with the notion of computation (at large) as a semantic collapse. Of course my background in maths, physics and astronomy helps :))
The link to the hubjects paper did not work in my previous post because your too smart formatting tool added the final "." of the sentence to the URL ... The following should work :
http://www.mondeca.com/lab/bernard/hubjects.pdf
BTW I intend to publish those days an updated version including new aspects, and will certainly add a dash of quantic sauce. Stay tuned at universimmedia for announces.
The URI with the definition of Edition on it is only a description, it is not an authority, I cannot show two physical books to the URI and get it to tell me if they are the same edition or not. But that is what the post office does with your address.
I am not trying to claim the SW should solve the problem of people differing in interpretations. I am saying that people’s interpretations will always differ and this will create inconsistency and that this is should not break semantic web applications. It should be a fundamental feature of the semantic web. If the semantic web upper layers are built on FOPC then the result of all that is a very fragile system because logic doesn’t deal with inconsistency very well.
“Don't these people read manuals or the help file for their system”.
If the semantic web solution to these inconsistencies is better user training then I think that is a real issue. Surely having an approach that is less sensitive to inconsistency is a better way than trying to change user behaviour?
So what would you write in a description of the concept Edition so that 1000s of librarians who read it, when faced with two books on their desk one with a slight alteration, would make the same classification decision? Some will be better trained than others some will have English as their second language and some just won’t give a dam.
There are millions of possible alterations between copies of a work which ones should be considered a new edition ( from different printing issues to CDs stuck on, to different size of cover variations, different paper weights) some are obviously one or the other but the edge between the two is where inconsistency arises. For example, printers may start putting web addresses of the author on the back, does that require a new edition?
But English is extremely ambiguous, but more fundamental no body shares exactly the same understanding of any single word of English (this is Wittgenstein’s point). The edges of each and every concept in the language are different for each person. Note I said edges.
If this weren’t the case we wouldn’t need layers to draw up contracts and then fight over the meanings in court when there is a disagreement.
If you were to give 10 people the same description of Happy Face or Sad face and then show them 100 pictures of faces that are somewhere in between, do you think they will all answer the same way, no matter how good the description? This is because the continuous world has been forced into only one of two states and the edge between them is where inconsistency arises in classification.
It is the edge between concepts that causes the problem, so if you try and break the classification down into smaller and smaller pieces you simply create more and more edges and will in fact get more inconsistency.
What I don’t understand is why the Semantic Web community appears to be bent on removing inconsistency when it is obvious our world and computer systems are full of it. Why can’t the Semantic Web include inconsistency as a fundamental part of it?
>> community appears to be bent on removing inconsistency
>> when it is obvious our world and computer systems
>> are full of it.
I think you don't understand why because your premise is not true.
Take a look at http://www.w3.org/DesignIssues/RDFnot.html and its links.
That's one of the foundation documents for the SW from 1998 and TBL clearly says
"I am not proposing any FPOC or HOL inference engine." and
"The goal of the semantic web is to express real life. Many things in real life, real questions which we will face are not efficiently computable."
and the solution is to
"Create subsets of the web in which specific constraints give you specific computational properties."
Please provide references to back this up.
Incorrect. The post office is one entity that can define that term. There are many others. I have my own definition that I described above. All these can co-exist. But I shouldn't say my concept of postal town is the same as the post office's concept that they label as postal town unless I'm willing to allow fuzziness. Most SW applications embrace that fuzziness - it's good enough for most applications.
Communication is negotiation. No two people share identical conceptualisations of their perceived reality. Evidentially this doesn't preclude sharing information about the world. We can use analogies, similes and metaphors to suggest how our concepts relate to others.
I'm using a metaphor now to attempt to communicate my concept of RDF to you. RDF has similar mechanisms to those I describe above - by design. It denotes concepts by URIs (concepts) and anyone can have URIs (concepts) which are distinct from everyone else's. It allows relations to be asserted between URIs such as equivalence and type (analogies, metaphors).
You sadly miss the point completely.
You are seeking a computational solution to a non-computational problem. Some things are better left to human judgement.