Strategies for Successful Development

Process and technology for small teams building real software

Archive for April, 2009

Urls, References, Shallow Copies, Deep Linking, and Wikipedia

Posted by S4SD (Seth Morris) on 2009/04/19


I have a friend who’s a relatively junior programmer; the other day he was asking me about copying reference objects in .NET. References, pointers, value/stack objects, and the difference between deep and shallow copy semantics is something no language has really done right and it’s the source of bugs, performance and memory hits, and time lost writing test code on many, or maybe most, projects.

.NET makes it both easier and harder with the clear distinction between reference types and value types. It’s easier in that the behavior is clear if you know which type you have (although each can contain members of the other type), but harder in that far too many developers don’t really know the difference. Also, the notion of boxing comes up in all the wrong contexts: it isn’t really important as a performance consideration in most real-world applications, but it does hide logic errors that are unlikely to be detected by unit tests, since the developer writing the unit tests didn’t understand the distinction in the first place.

The second-worse language I’ve used for reference vs. value confusion and the need to code review shallow vs. deep copy semantics in every function is JavaScript/ECMAScript. The language works fine, but for some reason even extremely careful developers seem to get lost in the almost-scripting and almost-object-oriented1 language when combined with multiple, almost-stateless frames in a large application.

But it’s the worst language that causes the most concern: HTML. And here we have to go into an issue of overloaded operators: what is a "reference"?

  • In software, a reference is a stand-in name for something. In lower-level languages it’s usually backed by a pointer to memory and in higher level languages it is likely to be an index into an object map, but in either case you can have multiple pieces of code using the reference and all are accessing the same object instance. If one function changes the object, subsequent accesses will get the latest changes.
  • In HTML, a reference is an HREF: a Hypertext REFerence. It is a stand-in name for a document (or a location in a document). HTML makes no assumptions about whether a document’s content is stable. In fact, most pages on the web are assumed to change, with comments added to blogs, forum posts edited, online catalogs updated, etc.
  • In nonfiction or scholarly writing—including encyclopedias—a reference is a stand-in name that identifies not only data, but who takes responsibility for that data and exactly which version of the data is being referenced. This is so important in that context that scholars and journalists have defined standards for representing references consistently.

The Wikipedia Problem

Which brings us to Wikipedia. And to a way to teach references and values that might resonate with "kids these days.2"

It’s common to link to wikis, especially to Wikipedia, when explaining or defining things. Wikis are great: they facilitate communication, collaboration, and community and they are a good way to get the most-wanted content in a reference populated first. They have their limitations, but every tool does.

The problem with linking to Wikipedia (or another wiki, although to a lesser degree with slower-changing wikis) is that HTML links are true, blind references. You get whatever is there right now, and Wikipedia is known for (often transient) spam content and inaccuracies.

In software, the solution is a deep-copy: actually copy the content to your own site and reference that (probably with a link to the current Wikipedia content). It isn’t a bad choice, but it can get tedious. Another solution is to link to the Wikipedia article history. For example: is a link to the Wikipedia article of References (of several kinds). Looking at it now, as I’m writing this, I see several sections I would love to comment on here—and several sections that have tags indicating they are likely to change soon. If I use that link, my comments are likely to be irrelevant later. I’m keeping value data (my text) that doesn’t correspond to the reference data (the content beyond the link). is the link to the version I see while writing this3. If I want to make comments, or if I am worried that the page will be modified in some way I don’t want (say, a page prone to political, religious, or commercial spam), Wikipedia promises that this URL will be stable. In the language of C++, it is a "const reference."

Of particular interest to me, writing blog entries, is the References4 section of the Wikipedia page. This is the section where a Wiki article contains external reference data (the rest of the page is value data and internal references to other Wikipedia pages). I’m likely to want my readers to see the references. Because that section is very likely to change, the purportedly-stable link lets me refer to these references-in-my-reference safely. But notice that I don’t know if those objects (the pages behind links in the Reference and External Links sections of a Wikipedia page, even a stable one) are stable, although in some cases I can make a good guess (links to PDFs of published papers, for example, are likely to be very stable).


"So this is all interesting, but what is the point?" There are two lessons here, one for passing on information (in blog posts, emails, twitters, etc.) and one for programmers, especially of the more-junior sort.

When you’re passing on information:

  • Know if you’re giving someone a reference or a value. Copy information you need stable and available, just as you would in code.
  • Don’t link to Wikipedia’s "current page" unless that’s what you mean.

And for coders, if you understand hyperlinks, you understand references and values:

  • A reference is like a hyperlink: It goes to something else and you don’t own it.
  • A value is like content in a page: It’s owned by the page and no one else can change it unless they can edit the page.
  • A reference copy is copying the URL: it’s still a reference to the same object. If someone edits the page, you will get new data the next time you access it.
  • Remember that a reference may contain references: Pages can contain links to other pages.
  • A shallow copy is copying a page: If the page had references (links) on it, your new copy has those same links.
  • A deep copy is a web spider: You can copy a page and every page it links to (and every page they link to, and so on). If you do that, you have a new, unchanging page and no one can even see it unless you tell them the URL.
  • A const reference is a link to a page that doesn’t change: But it may link to pages that do change!

1) ECMA is an object-oriented language, and whether JavaScript in a web page is "scripting" in the usual sense is an interesting question, but most interesting JavaScript applications are "almost" scripting and "almost" object oriented. And yes, those two are different axes; there are many excellent object-oriented scripting languages.

2) I’m feeling old. I was listening in on a discussion this week between a Senior Architect and an Architect where the Architect didn’t seem to be following some of the more fiddly details; the Senior commented to me later that the Architect may never have written a WNDPROC.

3) You can get this from the History tab, but it’s probably better to click the "Permanent Link" or  "Cite this Page" item in the "Toolbox" section of the navbar.

4) At this time, the relationship between the "External Links" and "References" sections of the Wikipedia page template is unclear and apparently undocumented.

Listening to: The Alan Parson’s Project – Best Of – Damned if I Do


Posted in Opinion | Leave a Comment »