Tuesday, November 22, 2005

how google print could help fight plagiarism

there's been a lot of chatter in the past couple months about google print.

the idea is this: google wants to scan millions of books using OCR technology and create a massive index of book content. users could search this index, and google print would return abstracts of books that fit the search query, along with short excerpts from the books. if users like what they see in the excerpt, they can go to amazon or their favorite brick-and-mortar store to buy the book. publishers could mandate how short or long the excerpts from their books would be, or could opt-out their books from the whole thing.

this would be a fantastic tool for helping readers find books. struggling creators know that the biggest threat to their livelihood is not obscurity. thousands of books are published every year, with countless older books in the back catalog (we call 'em "back list") waiting for new readers. the majority of these books vanish into obscurity, to be read only by a tiny minority. google print would help readers find books they would like, and thus it would sell more books.

but a lot of big publishers and celebrity authors don't see it that way. in fact, they want to sue google print for using their copyrighted material without permission. really they just want to be the ones to control any indexes of their content, in the event that they someday get off their asses and implement something similar (amazon's search-in-a-book feature is similar, but is opt-in and contains a fraction of the number of books google print would contain). but that's what search engines do: they index content, without asking permission first. if google print is illegal then all search engines are illegal.

the publishers' arguments are somewhat disingenuous and require some logical contortions. in effect, to believe the publishers' arguments, you must accept that google is lying. you might hear the canard that google wants to "give away our books for free", which is ridiculous since google makes it clear that only short excerpts will be offered to readers—you won't be able to read the da vinci code on google.com. or maybe you'll hear the "they want to make money off our content" line, despite the fact that google says google print will not feature advertisements.

i hadn't posted about this to date, despite it being a hot IP story, despite working in the publishing industry as i do, because i didn't have much to add that, say, the folks at boingboing or the eff hadn't already said better. but recently i had a revelation about how google print could actually help me, as an editor, do my job better. it would actually be a very powerful tool for tracking down plagiarism.

it might seem a bit odd for me to blog about plagiarism, as i strongly believe in fair use rights, sampling rights, and the like. but there is a world of difference between sampling or parodying a work and taking the whole thing and passing it off as your own work. the former is a fragmentary, transformative use (and a creative one), whereas the latter isn't. samplers and remixers are generally pretty honest about what they have taken. the literary equivalent of sampling/remixing is called quoting. quoting is perfectly acceptable as long as sources are cited; in some situations quoting is even strongly encouraged. in contrast, the musical equivalent of plagiarism would be stealing someone else's song and claiming you wrote it. besides, as an editor for a multinational publishing/entertainment company, it's my job to be vigilant for plagiarism issues. so i hope we're clear on the distinction.

i don't actively check for plagiarism too often. generally i give my authors the benefit of the doubt unless i spot something suspicious in the text. if an author's text is usually awful but i come across a passage that is quite well-written, that's suspicious. or if an author has been consistently spelling things one way and suddenly skips to a different spelling, or somehow changes voice in mid-chapter, these are red flags.

when i do decide to start looking for plagiarized content, my first stop is naturally google. i start plugging phrases into google and see what turns up. this technique is remarkably effective. i have even found instances of seeming plagiarism on accident: i came across something a little confusing, went to google to verify the information, and the first page i found contained the exact text and figures from the chapter. oops.

but as powerful as google's web search is, it can only search content that is online. obviously. the internet is a very popular place to plagiarize from (just ask high school teachers), perhaps the #1 most popular place to do do, but it's not the only place. but a smart plagiarist, one who doesn't want to be caught, will realize that maybe copying text from the web isn't wise. "if i was able to find this website in 15 seconds," the plagiarist might think, "then my teacher/editor might be able to find it too."

so a smart plagiarist will want to copy from sources that are not indexed online, like printed materials. like books.

some books are online, but most aren't. or excerpts of them, articles adapted from them, and so on exist online but the bulk of the book doesn't. and i'm pretty sure google's web search doesn't index ebooks. so catching such plagiarism is not really possible online. teachers can still use the old trick of tracking down any books listed in a paper's bibliography and manually searching for copied content, but man is that tedious. and the trick relies on the writer including the source of their plagiarized content in the bibliography. a smart plagiarist probably would not want to cite the source that he's plagiarizing from. and most books don't have bibliographies.

google print could change all that. if google is successful (and isn't forced to stop by short-sighted legal challenges), google print could be a remarkable tool for catching plagiarists. if i came across suspicious text, i could paste it into google's web search, check there, and then with a couple more clicks, switch to google print and check there. if i got no results from either search, i could be fairly confident that the phrase in question was not plagiarized.

thus google print, rather than infringing on publishers' copyrights, would be a powerful tool for protecting copyright. and it would increase book sales by helping readers find books they want to read.

as wonderful as this would be for the publishing industry, i suspect it would be even more useful for teachers, who could almost instantly determine whether students' papers are plagiarized.

do it for the children! save our kids by saving google print!

No comments: