Elsevier Opens Its Papers To Text-Mining

Elsevier Opens Its Papers To Text-Mining 52

Posted by samzenpus on Monday February 03, 2014 @03:00PM from the take-a-look dept.

ananyo writes "Publishing giant Elsevier says that it has now made it easy for scientists to extract facts and data computationally from its more than 11 million online research papers. Other publishers are likely to follow suit this year, lowering barriers to the computer-based research technique. But some scientists object that even as publishers roll out improved technical infrastructure and allow greater access, they are exerting tight legal controls over the way text-mining is done. Under the arrangements, announced on 26 January at the American Library Association conference in Las Vegas, Nevada, researchers at academic institutions can use Elsevier's online interface (API) to batch-download documents in computer-readable XML format. Elsevier has chosen to provisionally limit researchers to 10,000 articles per week. These can be freely mined — so long as the researchers, or their institutions, sign a legal agreement. The deal includes conditions: for instance, that researchers may publish the products of their text-mining work only under a license that restricts use to non-commercial purposes, can include only snippets (of up to 200 characters) of the original text, and must include links to original content."

Elsevier Opens Its Papers To Text-Mining

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 52 Comments Log In/Create an Account

Comments Filter:

- Re: (Score:2)
  
  by i kan reed ( 749298 ) writes:
  
  What exactly are punks saying that can be deconstructed with statistical sampling of published papers?
  I mean, are there some really dumb people alleging that academics don't use enough words starting with K?
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    Too much E.
    - Re: (Score:2)
      
      by i kan reed ( 749298 ) writes:
      
      Too much E.
      Well something has to keep those academic raves fun.
Google spamming (Score:2)

by Florian Weimer ( 88405 ) writes:

Isn't this called search engine spamming, and several publishing outfits have been doing it for about a decade, with varying degree of success?
- - Re:Google spamming (Score:4, Interesting)
    
    by John Bokma ( 834313 ) writes: on Monday February 03, 2014 @03:38PM (#46143463) Homepage
    
    Several sites that have pay walled PDFs somehow manage to get the contents of those PDFs crawled by Google (probably others as well). Google has rules against this, but somehow those sites get away with this. E.g. if one googles for "some keywords filetype:pdf" (without the quotes) results Google show might give the impression that that the full PDF is available but when clicking one lands on a HTML page which shows the abstract and a "buy this document" link. Access is in the 30+ USD range, so about 2 USD/page or more... One of those sites is Elsevier. Or at least was, can't find an example.
    When this happens to me, I contact one of the authors and end up with the paper anyway, for free, most of the time.
    Another parasite is scribd.
    
    - Re: (Score:2)
      
      by pepty ( 1976012 ) writes:
      
      Several sites that have pay walled PDFs somehow manage to get the contents of those PDFs crawled by Google (probably others as well). Google has rules against this,.
      Really? I would have thought they would be fine with it; Google Scholar would have been hamstrung from the get go if they didn't present results from paywalled databases, and Google Books is a similar situation for books under copyright.
      - Re: (Score:3)
        
        by John Bokma ( 834313 ) writes:
        
        The technique is called cloaking. You basically check if a page request is coming from Googlebot or not to decide what to return (or redirect). See: https://support.google.com/web... [google.com]
        The services you mentioned have different rules, of course.
        
        Re: (Score:2)
        
        by jafiwam ( 310805 ) writes:
        
        The technique is called cloaking. You basically check if a page request is coming from Googlebot or not to decide what to return (or redirect). See: https://support.google.com/web... [google.com]
        The services you mentioned have different rules, of course.
        Some of those tools use the browser identifier to decide to let them in or not.
        Something that in some browsers, can be modified by the end user....
        
        Re: (Score:2)
        
        by John Bokma ( 834313 ) writes:
        
        Yup, see: http://johnbokma.com/mexit/200... [johnbokma.com] However, this doesn't work if they check for IP ranges.
        
        Re: (Score:2)
        
        by wiredlogic ( 135348 ) writes:
        
        Google will masquerade Googlebot as an ordinary browser to spot check cloaking but it isn't thorough enough to catch everything. With AJAX rendered content it is even harder for them to tell what is and isn't shown to normal users.
        
        Re: (Score:2)
        
        by arglebargle_xiv ( 2212710 ) writes:
        
        The technique is called cloaking.
        When Elsevier are doing it, it's called cloacaing.
      - Re: (Score:2)
        
        by JaredOfEuropa ( 526365 ) writes:
        
        I'd be fine with this if the search results would clearly mark entries sitting behind a paywall or requiring registration to access. I'm sure we've all been frustrated multiple times by the likes of Experts-exchange (who show answers to tech questions in Google but won;t let you at them unless you pay up).
- Re: (Score:2)
  
  by c0lo ( 1497653 ) writes:
  
  Isn't this called search engine spamming, and several publishing outfits have been doing it for about a decade, with varying degree of success?
  While it may be SEO spamming, I'm inclined to see this as an attempt to outsource the cost of indexing.:
  On the line of: "You fools, I have a trove of papers you are drooling for. What about... I'll let you index it by whatever your brilliant minds discover it works the best for you, then I'll use it to increase the value of my trove"
In others words (Score:1)

by dacullen ( 1666965 ) writes:

1. Please generate as many sales leads as you can 2. Profit!!!
- - Re: (Score:2)
    
    by pepty ( 1976012 ) writes:
    
    They're probably using it as a way to justify the prices the institutions are forced to pay.
200 characters (Score:4, Funny)

by Anonymous Coward writes: on Monday February 03, 2014 @03:15PM (#46143227)

Publishing giant Elsevier says that it has now made it easy for scientists to extract facts and data computationally from its more than 11 million online research papers. Other publishers are likely t

Hey IBM! (Score:1)

by Floyd-ATC ( 2619991 ) writes:

Get Watson over here will you?
If the Internet is killing Newspapers (Score:2)

by ScottCooperDotNet ( 929575 ) writes:

If the Internet is killing newspapers, why isn't it killing this dead tree company?
- Re:If the Internet is killing Newspapers (Score:5, Insightful)
  
  by dj245 ( 732906 ) writes: on Monday February 03, 2014 @03:30PM (#46143389) Homepage
  
  If the Internet is killing newspapers, why isn't it killing this dead tree company?
  When people stop buying newspapers, they fire the reporters and news correspondants.
  
  When people stop buying scientific journals (and electronic access to such), it doesn't matter. There are still hundreds of professors lined up around the block to try to get published, since it is basically required for them to earn tenure. Anytime you have a barrier to career advancement, the people who own that barrier have a near monopoly and can charge whatever the market will bear. And the market of people trying to advance their career will bear a lot.
  
- Re:If the Internet is killing Newspapers (Score:4, Informative)
  
  by John Bokma ( 834313 ) writes: on Monday February 03, 2014 @03:41PM (#46143503) Homepage
  
  Because news or "news" [1] can be gotten for free on the Internet while peer reviewed scientific papers is a bit harder. My experience is that quite some sites bait Google search results (see my earlier post; you google for pdfs but end up on a landing page which allows you to buy one time access for 30+ USD for a handful of pages). My successful workaround (so far) has been contacting one of the authors for a copy (for personal study).
  [1] a lot of people don't seem to care if it's made up or not
  
- Re:If the Internet is killing Newspapers (Score:4, Funny)
  
  by Jane Q. Public ( 1010737 ) writes: on Monday February 03, 2014 @03:41PM (#46143505)
  
  "If the Internet is killing newspapers, why isn't it killing this dead tree company?"
  It isn't a dead tree company, per se. Elsevier publishes as much online as offline. And more than most.
  
  Having said that: they can still die in a fire.
  
More access coming to other journals (Score:1)

by 1_brown_mouse ( 160511 ) writes:

I like this bit from TFA:
Shillum says that Elsevier is ahead of the curve — but that other publishers are likely to follow soon. CrossRef, a non-profit collaboration of thousands of scholarly publishers, will in the next few months launch a service that lets researchers agree to standard text-mining terms and conditions by clicking a button on a publisher’s website, a ‘one-click’ solution similar to Elsevier’s set-up.
I would like to see that.
It would be nicer if... (Score:4)

by DeadDecoy ( 877617 ) writes: on Monday February 03, 2014 @03:31PM (#46143401)

... publishers removed the paywall to publicly funded literature, or at least made the prices more sane.

Also, while we're on the topic of text mining, would it be possible to get text-only or xml-based articles, with figures attached and cross-references as needed? It's quite annoying to manually convert a pdf when trying to setup an automated analysis over several documents. I know one could setup a shell script to dump it out using the pdftoxml converter, but the output is a bit messy to parse.

- - Re: (Score:2)
    
    by DeadDecoy ( 877617 ) writes:
    
    There are a few issues with the output of pdftoxml that make it difficult to parse (mostly adobe's fault). For 2-column articles, the columns are interleaved. That means you'll get a little bit of text from column A followed by a little bit of text from column B. The xml tags contain the x/y coordinates, so you can develop some heuristics to cleave out segments of text for one journal. This is not particularly suitable when you want to analyze text across different journal formats, as you'll have to develop
- - Re: (Score:2)
    
    by RuffMasterD ( 3398975 ) writes:
    
    Elsevier [wikipedia.org] had a profit margin of 36% on revenues of US$3.2 billion in 2010. They publish about 250,000 articles a year and these are downloaded about 240 million times a year. Their content is written for them, but the authors actually have to pay (public money) for the privilege, and their peer review is free labour. Then the readers have to pay too (usually public money again), and not a cent goes to the author!
    
    Meanwhile Wikipedia's [dreamsrain.com] operating cost was $20.1 Million (mostly funded by donations), they had
- Re: (Score:2)
  
  by martin-boundary ( 547041 ) writes:
  
  It wouldn't be nicer. It would be the least they should possibly do.
  Publishers like Elsevier are leaches sucking at the teat of scientific institutions, weakening their libraries, which are the cornerstone of humanity's research efforts. The sooner they FOAD the better.
Elsevier hasn't DIAF yet? (Score:2)

by atari2600a ( 1892574 ) writes:

Soon...once the exclusive contracts and the End User LIcense Agreements expire, the users will revolt. It was foretold in the Scientific Prophecy of Rebirth.
Elsevier is a tree (Score:1)

by Mister Liberty ( 769145 ) writes:

that should have been pruned long ago.
Greed (Score:1)

by Anonymous Coward writes:

Haha, back in the 90's, I worked at a company that built some websites for Elsevier. The effort was overseen by a young Dutch woman who came to our offices and wanted to know why we didn't have orange juice and buns for her every morning.
We designed a background image that looked great at normal viewing distances from the screen, but when seen from far away it looked like it really said "GReed-Elsevier". The sites went public, but we were made to change the background about a week after launch.
The data-mining agreement seems to suck (Score:1)

by shtrom ( 1251560 ) writes:

Acording to “Why you and I should NOT sign up for Elsevier’s TDM service“ [0], this is not all that good, as the Text and Data Mining policy is actually overly restrictive. Most notably, it forces you to go through their API to do the work, rather than parsing things locally at your leisure, and imposes conditions on the release of the uncovered data (namely a non-free CC-NC).
[0] http://blogs.ch.cam.ac.uk/pmr/... [cam.ac.uk]
Free for their definition of free, not yours (Score:2)

by ghmh ( 73679 ) writes:

Note:
If you have to sign or agree to something in order to access it, it's not free, even if they say otherwise.
- Re: (Score:2)
  
  by Antique Geekmeister ( 740220 ) writes:
  
  Even a "Public Domain" copyrighted work has rules embedded in copyright law, which apply whether you agree or not. Games played entierly without rules get very strange, very quickly, and inevitably wind up with rules evolved very quickly and not necessarily well.
  Having the rules spelled out, in writing, is very helpful to let both sides know what _is_ allowed. This is often far better than the very confusing and potentially dangerous lawsuits involving what is _not_ allowed. Whether these agreements are rea

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Elsevier Opens Its Papers To Text-Mining 52

Elsevier Opens Its Papers To Text-Mining More Login

Elsevier Opens Its Papers To Text-Mining

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Google spamming (Score:2)

Re:Google spamming (Score:4, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

In others words (Score:1)

Re: (Score:2)

200 characters (Score:4, Funny)

Hey IBM! (Score:1)

If the Internet is killing Newspapers (Score:2)

Re:If the Internet is killing Newspapers (Score:5, Insightful)

Re:If the Internet is killing Newspapers (Score:4, Informative)

Re:If the Internet is killing Newspapers (Score:4, Funny)

More access coming to other journals (Score:1)

It would be nicer if... (Score:4)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Elsevier hasn't DIAF yet? (Score:2)

Elsevier is a tree (Score:1)

Greed (Score:1)

The data-mining agreement seems to suck (Score:1)

Free for their definition of free, not yours (Score:2)

Re: (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot