The Internet Archive Wayback Machine and Copyrights
Posted by admin on October 29, 2011 in Uncategorized
“Most institutions cannot touch this because it hits every privacy, copyright, and export controversy.” – 1996 quote from Brewster Kahle, Founder of the Internet Archive.
The Internet Archive’s Wayback Machine is a great resource where many Internet sites are archived but what about the copyrights issues? About 10 years ago I found they had they republished my web sites without permission so I made a removal request. I was surprised to see they kept pushing back when I made the requests and said to use the “robots.txt” method of removing material. They certainly have no legal authority to demand changes in the robots.txt file and many people who publish blogs and the like do not have access to change the file. Also, according to their latest tax returns, they generated over $8 Million in revenue for “providing web crawling and hosting.” It is unclear where all this revenue comes from but it looks like they are selling the data.
The Internet archive apparently refuses to publicly discuss many of their policies because there is no legal basis to do what they are doing. The Internet archive has an article at their web site called Internet Archives and Copyright but it never actually discusses the operation of the Internet Archives, robots.txt, removals, or posting all this stuff without permission. See the paper Proceed With Caution: How Digital Archives Have Been Left in the Dark that states:
Digital archives face many legal barriers, including practically perpetual copyright terms in the material they include, an uncertain fair use doctrine, a chaotic licensing scheme, and a proliferation of online contracting that threaten archivists’ efforts to construct comprehensive digital libraries.
Also see Authors v. Archivers: The Copyright Infringement Battle Over Web Pages.
The Internet Archive Wayback links to their “Recommendations for Managing Removal Requests And Preserving Archival Integrity” but the policy makes no sense. This is because they recommend that the “robots.txt” file be used to exclude sites yet they also say that removal requests “will not be made public.” However, placing something in a web site’s robot.txt file automatically makes it public. The policy goes on to state that if the robots.txt method then a removal request should be sent. I submitted a large numer of web sites to be removed but they kept finding excuses not to honor the request and kept insisting on the robots.txt file method. It appears the reason is because once the restrictions in the robots.txt file are no longer there the site automatically would reappear in the archive and and all the old data they collected would be restored. Under the manual removal method nothing will be restored until it is again manually added. The Internet Archive apparently refuses to honor requests to remove the old data but it is unclear. Their removal instructions state a removal request will “remove documents from the Wayback Machine.” However, it is unclear because the Internet archive refers requests to private e-mail and they will not answer requests to publicly explain how the data is handled.
Upon further investigation the discussion board of their own web site is full of removal requests which are apparently not honored. One person is even offering a free service to help people file legal notices in the form of Digital Millennium Copyright Act takedown notices. However, when you look at their bulletin boards about feature films removal requests are honored almost immediately.
The information in the Internet Archive Wayback Machine is being for all sorts of purposes. Unfortunately some unscrupulous attorneys are using the information to file frivolous disputes such as using the domain dispute mechanism. This dispute mechanism was developed and administered by trademark attorneys. In some cases companies demand domains that were registered years before the company even existed and they sometimes use (unverified) Wayback Machine printouts as “evidence.” I mentioned this in one of my removal requests. They now have latched onto this and send me long-winded responses about evidence and subpoenas without ever addressing the fact that they are not honoring their removal policy or that they never had permission to post the material to begin with. They use the classic “wear down” technique and hope complainers give up trying to make removal requests.
I have contacted the Internet Archive Wayback Machine and some of the people who initially set it and asked about the copyright issues. I ask them to explain why it is alright for them to do this while other people are facing having their Internet connection turned off, lawsuits, or even arrest for republishing material protected by copyrights. I either don’t get a response or they mumble something nonsensical. It is clearly a touchy subject they don’t want to address.
As for removals, they simply won’t honor some requests and they do not follow their own removal policy. If you contact them they often act as if they never heard of these issues and use that as an excuse to delay a response. Some of their excuses (and replies) are discussed below:
- Use the Robots.txt removal mechanism: Websites cannot be compelled to publish anything just to protect their copyrights and Wayback has no right to republish the information without permission. Publishing a robots.txt requires websites to make their removal request public which violates the Internet Archives stated policy.
- Internet Archive data is not used for commercial purposes: the fact is the Internet archive is sponsored by Alexa and other private companies and the Archive has run ads in the past for the “Alexa toolbar.” Alexa is somewhat like the Nielson ratings for web sites and they provide commercial services. The Internet Archive will not explain the exact relationship with Alexa or how the data is used.
- Internet Archive is on a tight budget: I checked several years ago and the Internet archive had many millions in a trust fund. The Internet Archive also spent money on legal fees fighting a lawsuit (see http://webdev.archive.org/post/119669/lawsuit-settled) rather than simply honor their removal policy.
- Removing a domain will remove all content for the domain forever: The removal request can always be reversed.
- The Internet Archive asked me to confirm I own copyrights on pages that display a copyright notice. Apparently the Internet archive will not accept a copyright notice posted on a web page.
- The Internet Archive is now demanding I certify there is no pending legal action or foreseeable legal action before they will honor the removal request. This was because I mentioned that the Internet archive was being used to file frivolous legal actions. The Internet Archive claims they don’t want to be accused of concealing evidence (but they don’t say anything about being accused of stealing material covered by copyrights). In any case the officially posted policy states “Internet Archive has no interest in including materials in the Wayback Machine of persons who do not wish to have their Web content archived.” However, the real policy of the IA staff is that removals reduce the integrity of the archive and make it less complete. As a result the staff has an interest in denying removal requests.
- Someone made an inquiry about whether the Internet Archive kept a copy of archival material after a removal request was made. He complained that he received conflicting replies from Internet Archive personnel. A legitimate archive or library would have no problem in posting such a policy. Shockingly, the Internet Archive refused to comment publicly and claimed the complainer was the only person who ever asked that! Such a claim is not credible. This is an example of the Internet archive operating like a private company who has data and they don’t want anyone to know what they are doing.
- The Internet Archive apparently argued in court that they are not bound by their web site policy under contract law because, in that specific case, no contract was formed. Companies often claim they are not contractually bound by their web site policies by someone merely visiting a web site because no exchange of consideration takes place. (Microsoft, Cisco and even the TRUSTe privacy seal program have all argued in court that companies are not contractually bound to honor their privacy policies merely by visiting their web sites). It may be advisable to sign up for an account at the Internet Archive if you wish to enforce their terms of service.
Read about a lawsuit that got complicated because the robots.txt exclusion was not honored promptly.
It is interesting to note that when you look at their feature films bulletin boards when they get a notice of an official copyright registration it is removed almost immediately. When and individual posts a similar request they sometimes get stonewalled or get unclear, incomplete, or evasive responses. If questions are raised the individuals re often stonewalled. For institutions like the Library of Congress there is a recourse under the Freedom of Information Act but a private entity like the Internet Archive does not have to answer. It looks like they operate by immediately complying with requests from large copyright entities to avoid litigation but they roll over the little guy who can’t afford litigation and say nothing when questions are asked. They just sit back, call themselves the Internet Library, and allow other to ridicule anyone who complains. They spice it up with insinuations that complains about material not being removed must have something to hide or could be criminals.
A legitimate Internet Library would have full disclosure of all its policies and and encourage a debate of the legal and operational issues. The US Library of Congress responded to my inquiries about the Internet Archive. They were able to fully disclose what they are doing and pointed me to their archives of specific subjects at http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html. if you click on any archive you will see an explanation of the copyright issues. They also maintain a FAQ that explains what they are doing at http://www.loc.gov/webarchiving/faq.html#faqs_06.
