×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Elsevier Opens Its Papers To Text-Mining

samzenpus posted about 3 months ago | from the take-a-look dept.

Science 52

ananyo writes "Publishing giant Elsevier says that it has now made it easy for scientists to extract facts and data computationally from its more than 11 million online research papers. Other publishers are likely to follow suit this year, lowering barriers to the computer-based research technique. But some scientists object that even as publishers roll out improved technical infrastructure and allow greater access, they are exerting tight legal controls over the way text-mining is done. Under the arrangements, announced on 26 January at the American Library Association conference in Las Vegas, Nevada, researchers at academic institutions can use Elsevier's online interface (API) to batch-download documents in computer-readable XML format. Elsevier has chosen to provisionally limit researchers to 10,000 articles per week. These can be freely mined — so long as the researchers, or their institutions, sign a legal agreement. The deal includes conditions: for instance, that researchers may publish the products of their text-mining work only under a license that restricts use to non-commercial purposes, can include only snippets (of up to 200 characters) of the original text, and must include links to original content."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

52 comments

start up the algorithms (0)

Anonymous Coward | about 3 months ago | (#46143143)

Time to disprove some punks.

Re:start up the algorithms (1)

i kan reed (749298) | about 3 months ago | (#46143223)

What exactly are punks saying that can be deconstructed with statistical sampling of published papers?

I mean, are there some really dumb people alleging that academics don't use enough words starting with K?

Re: start up the algorithms (0)

Anonymous Coward | about 3 months ago | (#46145537)

Aren't these research papers published with funding coming from grants that were originally tax-payer money? Why should I, as a tax payer, have to pay for it again. Where's my report?

Google spamming (1)

Florian Weimer (88405) | about 3 months ago | (#46143177)

Isn't this called search engine spamming, and several publishing outfits have been doing it for about a decade, with varying degree of success?

Re:Google spamming (0)

Anonymous Coward | about 3 months ago | (#46143269)

Google and Google, what is Google?! How does batch-download of XML documents have anything at all to do with Google?

Re:Google spamming (3, Interesting)

John Bokma (834313) | about 3 months ago | (#46143463)

Several sites that have pay walled PDFs somehow manage to get the contents of those PDFs crawled by Google (probably others as well). Google has rules against this, but somehow those sites get away with this. E.g. if one googles for "some keywords filetype:pdf" (without the quotes) results Google show might give the impression that that the full PDF is available but when clicking one lands on a HTML page which shows the abstract and a "buy this document" link. Access is in the 30+ USD range, so about 2 USD/page or more... One of those sites is Elsevier. Or at least was, can't find an example.

When this happens to me, I contact one of the authors and end up with the paper anyway, for free, most of the time.

Another parasite is scribd.

Re:Google spamming (1)

pepty (1976012) | about 3 months ago | (#46144097)

Several sites that have pay walled PDFs somehow manage to get the contents of those PDFs crawled by Google (probably others as well). Google has rules against this,.

Really? I would have thought they would be fine with it; Google Scholar would have been hamstrung from the get go if they didn't present results from paywalled databases, and Google Books is a similar situation for books under copyright.

Re:Google spamming (2)

John Bokma (834313) | about 3 months ago | (#46144247)

The technique is called cloaking. You basically check if a page request is coming from Googlebot or not to decide what to return (or redirect). See: https://support.google.com/web... [google.com]

The services you mentioned have different rules, of course.

Re:Google spamming (1)

jafiwam (310805) | about 3 months ago | (#46144755)

The technique is called cloaking. You basically check if a page request is coming from Googlebot or not to decide what to return (or redirect). See: https://support.google.com/web... [google.com]

The services you mentioned have different rules, of course.

Some of those tools use the browser identifier to decide to let them in or not.

Something that in some browsers, can be modified by the end user....

Re:Google spamming (1)

wiredlogic (135348) | about 3 months ago | (#46145035)

Google will masquerade Googlebot as an ordinary browser to spot check cloaking but it isn't thorough enough to catch everything. With AJAX rendered content it is even harder for them to tell what is and isn't shown to normal users.

Re:Google spamming (1)

JaredOfEuropa (526365) | about 3 months ago | (#46147977)

I'd be fine with this if the search results would clearly mark entries sitting behind a paywall or requiring registration to access. I'm sure we've all been frustrated multiple times by the likes of Experts-exchange (who show answers to tech questions in Google but won;t let you at them unless you pay up).

Re:Google spamming (0)

Anonymous Coward | about 3 months ago | (#46145535)

When this happens to me, I contact one of the authors and end up with the paper anyway, for free, most of the time.

When I come across a research paper or article where the research has been funded by a publicly-funded college or university or a government grant I simply search for the paper's title and the institution and retrieve it 99.9% of the time for free. I need not contact the author(s) though I commend your forth rightness.

Re:Google spamming (0)

Anonymous Coward | about 3 months ago | (#46148289)

Google makes an exception to the rule you've stated above for academic content. ie even if a paper is paywalled, it'll crawl it. (I work for an academic publisher)

Re:Google spamming (0)

Anonymous Coward | about 3 months ago | (#46159519)

Not entirely true. Google doesn't like it if you show them different content from what you'd show a regular user. At the same time, they offer ways for site owners to have Google index paywalled or otherwise password-protected content.

Re:Google spamming (1)

c0lo (1497653) | about 3 months ago | (#46145123)

Isn't this called search engine spamming, and several publishing outfits have been doing it for about a decade, with varying degree of success?

While it may be SEO spamming, I'm inclined to see this as an attempt to outsource the cost of indexing.:
On the line of: "You fools, I have a trove of papers you are drooling for. What about... I'll let you index it by whatever your brilliant minds discover it works the best for you, then I'll use it to increase the value of my trove"

In others words (1)

dacullen (1666965) | about 3 months ago | (#46143179)

1. Please generate as many sales leads as you can 2. Profit!!!

Re:In others words (0)

Anonymous Coward | about 3 months ago | (#46143609)

Elsevier doesn't bother marketing to individuals. They market exclusively to librarians e.g. institutions.

Re:In others words (1)

pepty (1976012) | about 3 months ago | (#46144115)

They're probably using it as a way to justify the prices the institutions are forced to pay.

IEEE (0)

Anonymous Coward | about 3 months ago | (#46143185)

Wake me up when I can get all those taxpayer-funded IEEE papers online for free. *grumble*

200 characters (4, Funny)

Anonymous Coward | about 3 months ago | (#46143227)

Publishing giant Elsevier says that it has now made it easy for scientists to extract facts and data computationally from its more than 11 million online research papers. Other publishers are likely t

If the Internet is killing Newspapers (1)

ScottCooperDotNet (929575) | about 3 months ago | (#46143307)

If the Internet is killing newspapers, why isn't it killing this dead tree company?

Re:If the Internet is killing Newspapers (4, Insightful)

dj245 (732906) | about 3 months ago | (#46143389)

If the Internet is killing newspapers, why isn't it killing this dead tree company?

When people stop buying newspapers, they fire the reporters and news correspondants.

When people stop buying scientific journals (and electronic access to such), it doesn't matter. There are still hundreds of professors lined up around the block to try to get published, since it is basically required for them to earn tenure. Anytime you have a barrier to career advancement, the people who own that barrier have a near monopoly and can charge whatever the market will bear. And the market of people trying to advance their career will bear a lot.

Re:If the Internet is killing Newspapers (3, Informative)

John Bokma (834313) | about 3 months ago | (#46143503)

Because news or "news" [1] can be gotten for free on the Internet while peer reviewed scientific papers is a bit harder. My experience is that quite some sites bait Google search results (see my earlier post; you google for pdfs but end up on a landing page which allows you to buy one time access for 30+ USD for a handful of pages). My successful workaround (so far) has been contacting one of the authors for a copy (for personal study).

[1] a lot of people don't seem to care if it's made up or not

Re:If the Internet is killing Newspapers (0)

Anonymous Coward | about 3 months ago | (#46144727)

My successful workaround (so far) has been contacting one of the authors for a copy (for personal study).

Yea, that's pretty much what anyone does, even at a research institute, if it isn't part of your library subscription. I don't know anyone that actually pays the $30 unless: (a) they need the data now (note: this has been me one time), or (b) they work for a company with deep pockets that is paying for them (note: this has not been me ever... if you know someone with deep pockets that is hiring, though....). Anyone else is just an idiot.

Re:If the Internet is killing Newspapers (3, Funny)

Jane Q. Public (1010737) | about 3 months ago | (#46143505)

"If the Internet is killing newspapers, why isn't it killing this dead tree company?"

It isn't a dead tree company, per se. Elsevier publishes as much online as offline. And more than most.

Having said that: they can still die in a fire.

More access coming to other journals (1)

1_brown_mouse (160511) | about 3 months ago | (#46143337)

I like this bit from TFA:
Shillum says that Elsevier is ahead of the curve — but that other publishers are likely to follow soon. CrossRef, a non-profit collaboration of thousands of scholarly publishers, will in the next few months launch a service that lets researchers agree to standard text-mining terms and conditions by clicking a button on a publisher’s website, a ‘one-click’ solution similar to Elsevier’s set-up.

I would like to see that.

One click? (0)

Anonymous Coward | about 3 months ago | (#46143421)

Lawyers for Amazon are envisioning enlarging their swimming pools...

It would be nicer if... (3)

DeadDecoy (877617) | about 3 months ago | (#46143401)

... publishers removed the paywall to publicly funded literature, or at least made the prices more sane.

Also, while we're on the topic of text mining, would it be possible to get text-only or xml-based articles, with figures attached and cross-references as needed? It's quite annoying to manually convert a pdf when trying to setup an automated analysis over several documents. I know one could setup a shell script to dump it out using the pdftoxml converter, but the output is a bit messy to parse.

Re:It would be nicer if... (0)

Anonymous Coward | about 3 months ago | (#46143873)

The output is a bit messy to parse? Scroll a few lines upwards... voila, Perl programmer for hire. In my experience, they are darn easy to handle, just throw a box of twinkies in the cellar workspace every few hours.

Perl Programmerq (0)

Anonymous Coward | about 3 months ago | (#46143893)

Oh nevermind, I just noticed that he charges money instead of twinkies. 120 euro, or 163 dollar, per hour. Lordy..

Re:It would be nicer if... (1)

DeadDecoy (877617) | about 3 months ago | (#46144007)

There are a few issues with the output of pdftoxml that make it difficult to parse (mostly adobe's fault). For 2-column articles, the columns are interleaved. That means you'll get a little bit of text from column A followed by a little bit of text from column B. The xml tags contain the x/y coordinates, so you can develop some heuristics to cleave out segments of text for one journal. This is not particularly suitable when you want to analyze text across different journal formats, as you'll have to develop a one-off solution for each journal.

It would also be useful to have clearly demarcated sections for the abstract, results, references, etc. Again, you could set BIO (Begin-In-Out) tags based on the section title and formatting style, but you may run into a few false positives if those words are used elsewhere in the text, and the two-column issue mentioned earlier may dump in text from other sections. Finally, there's little distinction between the body of the manuscript and the header/footer information.

Overall, the text is a bit messy. If you're just looking for keywords, then it's not a big deal. If you are trying to extract more complicated syntactic structures within the document, then it becomes a problem.

Re:It would be nicer if... (0)

Anonymous Coward | about 3 months ago | (#46144533)

Someone has to pay for servers, archiving, management, in short general overhead. The audience for academic papers is not broad enough to fund this via ads, so either the author or the reader (or their respective proxies) has to pay. The open access movement broadens readership at the price of restricting publications to those who can afford it - pricing out those from poorer institutions/countries. For some areas (high energy physics, life sciences come to mind) the cost of the research involved makes $2000-3000 a round off in conducting research and open access makes far more sense. In areas where grants are small (humanities for instance) or those working without grants (albeit often in state institutions) that $2000-3000 has a chilling effect on publication and sticking with the subscription paywall might make more sense.

Re:It would be nicer if... (2)

RuffMasterD (3398975) | about 3 months ago | (#46146945)

Elsevier [wikipedia.org] had a profit margin of 36% on revenues of US$3.2 billion in 2010. They publish about 250,000 articles a year and these are downloaded about 240 million times a year. Their content is written for them, but the authors actually have to pay (public money) for the privilege, and their peer review is free labour. Then the readers have to pay too (usually public money again), and not a cent goes to the author!

Meanwhile Wikipedia's [dreamsrain.com] operating cost was $20.1 Million (mostly funded by donations), they had over 3 million articles, and they are one of the most visited sites on the Internet. The content is written for free and massively peer reviewed for free. All their content can be read by anyone, for free.

Elsevier and Wikipedia seem to have similar technical requirements and business models, but one costs WAY more than the other. That difference is pure profit. If anything, Wikipedia should cost more than Elsevier.

Re:It would be nicer if... (0)

Anonymous Coward | about 3 months ago | (#46147773)

Why? Elsevier has scientific articles, while wikipedia has endless flamewars on how to spell Aluminium. Half of wikipedia is just plain wrong. So is half of scientific articles, but at least they are right as far as we currently know. Free access to scientific articles would be a damn good thing anyway.

Re:It would be nicer if... (0)

Anonymous Coward | about 3 months ago | (#46146047)

... I know one could setup a shell script to dump it out using the pdftoxml converter, but the output is a bit messy to parse.

"A bit messy to parse" is quite an understatement. There is no known general purpose method for reconstructing PDF structure, and the ones that are close enough require extensive knowledge to configure for each class of documents. The authors of the Dolores system [researchgate.net] claim to have a system that can be taught in a few minutes per system at least for simple elements like titles, subtitles and paragraphs. They don't seem to handle images, though, and tables are most likely out of their scope.

Re:It would be nicer if... (1)

martin-boundary (547041) | about 3 months ago | (#46148733)

It wouldn't be nicer. It would be the least they should possibly do.

Publishers like Elsevier are leaches sucking at the teat of scientific institutions, weakening their libraries, which are the cornerstone of humanity's research efforts. The sooner they FOAD the better.

LongStrider (0)

Anonymous Coward | about 3 months ago | (#46143703)

ALA Midwinter was in Philadelphia, PA this year. The upcoming ALA conference this summer will be in Las Vegas.

Elsevier hasn't DIAF yet? (1)

atari2600a (1892574) | about 3 months ago | (#46143735)

Soon...once the exclusive contracts and the End User LIcense Agreements expire, the users will revolt. It was foretold in the Scientific Prophecy of Rebirth.

Greed (1)

Anonymous Coward | about 3 months ago | (#46145045)

Haha, back in the 90's, I worked at a company that built some websites for Elsevier. The effort was overseen by a young Dutch woman who came to our offices and wanted to know why we didn't have orange juice and buns for her every morning.

We designed a background image that looked great at normal viewing distances from the screen, but when seen from far away it looked like it really said "GReed-Elsevier". The sites went public, but we were made to change the background about a week after launch.

The data-mining agreement seems to suck (1)

shtrom (1251560) | about 3 months ago | (#46145723)

Acording to “Why you and I should NOT sign up for Elsevier’s TDM service“ [0], this is not all that good, as the Text and Data Mining policy is actually overly restrictive. Most notably, it forces you to go through their API to do the work, rather than parsing things locally at your leisure, and imposes conditions on the release of the uncovered data (namely a non-free CC-NC).

[0] http://blogs.ch.cam.ac.uk/pmr/... [cam.ac.uk]

This is why we can't have anything nice ... (-1)

Anonymous Coward | about 3 months ago | (#46146157)

So the Publisher-Overlord-Elsevier chooses to make it easier for scientists to do their job, and are they thanked for it? No, instead the complaints are already flying that Big-Meanie-Elsevier is preventing them from giving away the papers for free. Admit it, if Vicious-Hegomonistic-Elsevier didn't implement the restrictions one|some|many of you would be scraping the site and releasing the content into the wild without a thought or care. I got three words for you, and the last one is Hippie!

In the world of science QUALITY peer review costs time and money, and free peer review just doesn't cut it. (Feel free to make your arguements that it should, but for now it doesn't.) You just got 1/2 a loaf, how about for once you take it and STFU.

Now get off my lawn!

VALE AARON (0)

Anonymous Coward | about 3 months ago | (#46146917)

nuff said really

Free for their definition of free, not yours (1)

ghmh (73679) | about 3 months ago | (#46147363)

Note:

If you have to sign or agree to something in order to access it, it's not free, even if they say otherwise.

Re:Free for their definition of free, not yours (1)

Antique Geekmeister (740220) | about 3 months ago | (#46148781)

Even a "Public Domain" copyrighted work has rules embedded in copyright law, which apply whether you agree or not. Games played entierly without rules get very strange, very quickly, and inevitably wind up with rules evolved very quickly and not necessarily well.

Having the rules spelled out, in writing, is very helpful to let both sides know what _is_ allowed. This is often far better than the very confusing and potentially dangerous lawsuits involving what is _not_ allowed. Whether these agreements are reasonable is a different question: they do seem pretty aggressive, and restrict the document use far more than even "fair use" restricts it.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...