×

Announcing: Slashdot Deals - Explore geek apps, games, gadgets and more. (what is this?)

Thank you!

We are sorry to see you leave - Beta is different and we value the time you took to try it out. Before you decide to go, please take a look at some value-adds for Beta and learn more about it. Thank you for reading Slashdot, and for making the site better!

International Challenge To Computationally Interpret Protein Function

samzenpus posted about 2 years ago | from the working-together dept.

Biotech 59

Shipud writes "We live in the post-genomic era, when DNA sequence data is growing exponentially. However, for most of the genes that we identify, we have no idea of their biological functions. They are like words in a foreign language, waiting to be deciphered. The Critical Assessment of Function Annotation, or CAFA, is a new experiment to assess the performance of the multitude of computational methods developed by research groups worldwide and help channel the flood of data from genome research to deduce the function of proteins. Thirty research groups participated in the first CAFA, presenting a total of 54 algorithms. The researchers participated in blind-test experiments in which they predicted the function of protein sequences for which the functions are already known but haven't yet been made publicly available. Independent assessors then judged their performance. The challenge organizers explain that: 'The accurate annotation of protein function is key to understanding life at the molecular level and has great biochemical and pharmaceutical implications, explain the study authors; however, with its inherent difficulty and expense, experimental characterization of function cannot scale up to accommodate the vast amount of sequence data already available. The computational annotation of protein function has therefore emerged as a problem at the forefront of computational and molecular biology.'"

Sorry! There are no comments related to the filter you selected.

Super Bowl (-1)

Anonymous Coward | about 2 years ago | (#42781797)

I guess during the Super Bowl, even a lowly anonymous gets to be the premier comment!

just took a super satisfying shit (-1)

Anonymous Coward | about 2 years ago | (#42781799)

man, my ass feels so empty and relaxed it's like a meditating buddhist!

Protein (-1)

Anonymous Coward | about 2 years ago | (#42781837)

Your mom interpreted my proteins after I shot my wad into her eye last night.

Re:Protein (-1)

Anonymous Coward | about 2 years ago | (#42782415)

she said she didn't see it coming

Dang. (1, Offtopic)

Syelnicar (2831945) | about 2 years ago | (#42781879)

Here I was, hoping for another Folding@Home.

Re:Dang. (2)

interkin3tic (1469267) | about 2 years ago | (#42782733)

I wouldn't rule it out. My understanding was that folding at home was brute force taking these sequences, testing all possible conformations, and seeing what was the lowest energy conformation. That's still what happens to actual proteins when they fold up, so it's not like the approach doesn't make sense.

It's possible that some protein out there will cure a lot of cancers. It could be in platypus, or in some fungus in a desert, some coral, or some other exotic species. We're never going to test all proteins in very many species to see if they're useful. However, we've already sequenced a lot of genomes, and will sequence a lot more. We thus have a lot of protein sequences. We're never going to purify most of them and determine the structure that way. Computing them and using that to identify proteins that may be useful on the other hand, that's within reason. It will take a lot of computing power though. So there will probably be a use in something like folding@home.

Re:Dang. (0)

Anonymous Coward | about 2 years ago | (#42785529)

This has already been going on for the past 9 years. The HPF has folded thousands of domains in more than one hundred genomes. The structural information is combined with sequence information and has been used for protein function annotation. A proteins function is directly tied to its structure and structure is much more conserved than sequence.

http://en.wikipedia.org/wiki/Human_Proteome_Folding_Project

Re:Dang. (1)

the gnat (153162) | about 2 years ago | (#42785771)

My understanding was that folding at home was brute force taking these sequences, testing all possible conformations, and seeing what was the lowest energy conformation.

Incorrect. Folding@Home uses proteins whose structure (and usually function) is already exceptionally well characterized. That's how they can tell if their simulation actually worked. The point of the project isn't to predict the structure, because that's still extraordinarily difficult to do by purely physical simulation (as opposed to more "knowledge-based" methods like Rosetta), but to explain the physical process.

Also note that although there is an immense number of uncharacterized proteins, there are many fewer actual folds. What will eventually happen is that we experimentally determine the structure of representatives of every family of proteins, and the rest can be guessed (within some reasonable margin of error) by homology modeling, which is quite a bit simpler than predicting a truly unknown structure.

Re:Dang. (1)

Tablizer (95088) | about 2 years ago | (#42783205)

That's when my wife asks me to help out with the laundry.

No idea... like words in a foreign language (0, Offtopic)

Professr3 (670356) | about 2 years ago | (#42781887)

Your similes are magnificent. They are like eggs on a pancake, butternut waiting to be waffle-ironed.

Re:No idea... like words in a foreign language (5, Insightful)

Dutchmaan (442553) | about 2 years ago | (#42782125)

Actually, I don't think the parent topic is actually off topic.. when we do in fact decipher a genes function, it doesn't necessarily mean we will get the more subtle nuances of how they function as part of the whole orgamism, in other words, we could read specific functionality literally but misinterpret functionality of the whole..

Re:No idea... like words in a foreign language (0)

Anonymous Coward | about 2 years ago | (#42782191)

Plus, the summary was pretty awful. Painful to read, actually.

Thank you (0)

Anonymous Coward | about 2 years ago | (#42781941)

It's about time we start focusing on the future and on what we know will work. Understanding how matter organizes itself into life is one the biggest challenges ahead. I propose that we understand how life works, how life works and how to extend it before this decade is out.

Can't be done... Yet (-1)

Anonymous Coward | about 2 years ago | (#42781945)

experimental characterization of function cannot scale up to accommodate the vast amount of sequence data already available.

Yet. Computational power can scale infinitely, and scales geometrically with time. Just wait. It'll be done soon enough.

Or make the problem more efficient. I bet with each completed protein the process gets faster and more efficient. It will be done sooner than expected.

Re:Can't be done... Yet (1)

khallow (566160) | about 2 years ago | (#42782929)

Computational power can scale infinitely, and scales geometrically with time.

No, it can't. There are fundamental limits to information storage and computation. Those limits are a lot better than we can achieve, but they exist.

Or make the problem more efficient.

A better algorithm always works. It's worth noting here that at worst, one can just make the protein physically and see what happens in real time. So it can't be that hard computationally.

Re:Can't be done... Yet (1)

blue trane (110704) | about 2 years ago | (#42783333)

"There are fundamental limits to information storage and computation."

Do those limits rely on assumptions about dimensions and time? Might dark energy and/or dark matter change some of those assumptions and thus make limits that feel so fundamental now evaporate?

640k ought to be enough for anybody.

Re:Can't be done... Yet (1)

julesh (229690) | about 2 years ago | (#42784955)

Do those limits rely on assumptions about dimensions and time?

No. They derive from the second law of thermodynamics, which assumes very little...

Re:Can't be done... Yet (1)

blue trane (110704) | about 2 years ago | (#42803307)

Second law of thermodynamics is statistical (Fluctuation Theorem). Can we exploit statistics to find ways to violate the Second Law consistently enough to expand the current "fundamental" limits?

Re:Can't be done... Yet (1)

khallow (566160) | about 2 years ago | (#42785479)

Do those limits rely on assumptions about dimensions and time?

Yes. But these are assumptions borne out by our observations of our reality.

Might dark energy and/or dark matter change some of those assumptions and thus make limits that feel so fundamental now evaporate?

No. Dark energy is somewhat relevant in that an expanding universe does have an easier time of dissipating heat and a higher theoretical limit on information that can be packed into a cosmologically large space-time ball of given radius (the surface area (which is proportional to the maximum information a space can contain) of the ball becomes an exponential function of the radius rather than a fixed power). Against this, you have the problem of greatly reducing the number of states of the universe to which you can access and change.

So as I understand it, with dark energy, you can cram more information into a given space, but you have less information to cram.

Dark matter has little bearing except being something which can occupy the same space as your computational system and hence, reduce the maximum theoretical information density before a black hole is formed (since that bit of space has both information and dark matter in it).

Re:Can't be done... Yet (1)

blue trane (110704) | about 2 years ago | (#42808165)

So, at the very least, our "fundamental limits" might not hold, if our observations about reality turn out to be like the flatlanders', and there are really more dimensions than we can sense?

Re:Can't be done... Yet (1)

WillKemp (1338605) | about 2 years ago | (#42783841)

No, it can't. There are fundamental limits to information storage and computation. Those limits are a lot better than we can achieve, but they exist.

What are these fundamental limits?

A plan of action (-1)

Anonymous Coward | about 2 years ago | (#42782133)

Without a good plan, we'll be at it for decades. Here's what I think genomic researchers should do.

Genes (and proteins) are obviously organized hierarchically. Which means there must be a control hierarchy in there somewhere. To unravel and properly classify the genome, researchers must first identify and understand the hierarchical control system. Only then can they begin to populate the branches with the correct genes.

After the tree is completely built and all the genes have found their correct locations on the tree, then it's a matter of going through the tree from the top down and switching the branches of the tree off/on one at a time to see what happens. It's hard but it can be done.

Re:A plan of action (5, Insightful)

pepty (1976012) | about 2 years ago | (#42782261)

Without a good plan, we'll be at it for decades. Here's what I think genomic researchers should do.

Genes (and proteins) are obviously organized hierarchically. Which means there must be a control hierarchy in there somewhere. To unravel and properly classify the genome, researchers must first identify and understand the hierarchical control system. Only then can they begin to populate the branches with the correct genes.

After the tree is completely built and all the genes have found their correct locations on the tree, then it's a matter of going through the tree from the top down and switching the branches of the tree off/on one at a time to see what happens. It's hard but it can be done.

Unfortunately there doesn't have to be "a" control hierarchy: each subsystem can have its own hierarchy (or none) that uses its own unique control mechanisms, they don't have to operate by the same rules, they can mess with each other by lots of different ad hoc means. And that's just the genes: the proteins are much harder to model, at least as far as useful predictions go.

It's been ad hoc with no code review for over 3 billion years.

Re:A plan of action (2)

Stirling Newberry (848268) | about 2 years ago | (#42782309)

"It's been ad hoc with no code review for over 3 billion years." This again, is immensely stupid. First, natural selection is constantly weeding out undesirable variations, and second the genome is highly tectonic, constantly removing or altering pathways. It's not teleological, but DNA is the coding mechanism precisely because it is not a passive storage medium.

Re:A plan of action (2)

Forty-3 (2563965) | about 2 years ago | (#42782431)

Don't discount that as stupid. Most of what he said is true. Evolution makes you write code that works, not good or clean code, just code that works. The only time evolution comes into lay is when the code can't even compile.

Re:A plan of action (1)

Stirling Newberry (848268) | about 2 years ago | (#42782461)

No, actually its mostly not true, and science has a word for mostly not true, that word is "junk."

Nature doesn't design out of Knuth, and it is a big mistake to act or think like we will find nice analogs of human type design.

Code obfuscation (2)

goombah99 (560566) | about 2 years ago | (#42782491)

Don't discount that as stupid. Most of what he said is true. Evolution makes you write code that works, not good or clean code, just code that works. The only time evolution comes into lay is when the code can't even compile.

Indeed there's even some selective pressure for code obfuscation. Viruses take advantage of compression for example. New functions usually evolve from faulty events in old genes. There's no pressure to remove accidental calls to the wrong subroutine if they don't matter, hence a lot of messages go to the wrong place as well as the right place. Even in higher animals you see this (dog's legs that scratch themselves when you scratch their ribs) is probably some back propagation on the nerve network that was not necessary to remove for proper operation of the dog.

Re:A plan of action (0)

Anonymous Coward | about 2 years ago | (#42782475)

He/she didn't say the output wasn't reviewed. It is obviously held to a rather stringent standard. They said the *code* was ad hoc and not reviewed for 3 billion years. As someone who works with said code on a daily basis, I thought that summed it up rather nicely. It isn't judged for readability. It's judged for getting the job done under a fair degree of urgency. Imagine a shop coding in PERL under a continuous deadline of now with no commenting.

Re:A plan of action (0)

Anonymous Coward | about 2 years ago | (#42783973)

So what we need now is some kind of all seeing/knowing entity to sort it out? ;p

Re:A plan of action (4, Funny)

ColdWetDog (752185) | about 2 years ago | (#42782293)

Stunning. Absolutely astounding. Yet another AC has taken Science by the balls and shaken the Universe to it's core. Dizzying intellect, artistic prose. He's probably six feet tall, blonde and with the chiseled features of a Grecian statue.

Oh. Wait.

Re:A plan of action (0)

Anonymous Coward | about 2 years ago | (#42782373)

So much sarcasm, and yet you can't even tell it's from its...

Re:A plan of action (0)

Anonymous Coward | about 2 years ago | (#42782467)

AC ==?

Re:A plan of action (2)

Stirling Newberry (848268) | about 2 years ago | (#42782303)

This is the dumbest thing, not related to football, that I have read all day. "obviously" hierarchical? That's utterly idiotic. And I mean utterly, betraying a complete lack of any experience with metabolic processes. Many, perhaps even most, protiens do many things in many circumstances, and have dynamic equilibria within more than one metabolic chain, as do many of the small molecules which are produced.

Re:A plan of action (-1)

Anonymous Coward | about 2 years ago | (#42782311)

lol kiss dicks faggit

Re:A plan of action (1)

TapeCutter (624760) | about 2 years ago | (#42782473)

Genes (and proteins) are obviously organized hierarchically. Which means there must be a control hierarchy in there somewhere.

Obvious nonesense, if not then point to the "control hierarchy" in an ant colony (no, the Queen ant does issue orders to the soldiers and workers).

How does it work? (2)

tsa (15680) | about 2 years ago | (#42782537)

I am not a biologist so forgive me my ignorance but when people say that DNA is the blueprint for an organism I never understand how a bunch of proteins can determine an organism's shape and behavior. Aren't there more factors that determine those things, like the surroundings in which the DNA is used, like chemicals that the growing organism is surrounded with, temperature, etc?

Re:How does it work? (1)

tsa (15680) | about 2 years ago | (#42782559)

BTW I do know that DNA codes for proteins and that the proteins plus certain self-assembly mechanisms account for most of the work done in growing an organism. But there my knowledge ends.

Re:How does it work? (1)

Anonymous Coward | about 2 years ago | (#42782821)

I am not a biologist so forgive me my ignorance but when people say that DNA is the blueprint for an organism I never understand how a bunch of proteins can determine an organism's shape and behavior. Aren't there more factors that determine those things, like the surroundings in which the DNA is used, like chemicals that the growing organism is surrounded with, temperature, etc?

I think the details are not fully understood. However, in answer to your question I think nature and nurture both play a role. Lots of research has been done on identical twins who have the same DNA. Lots of research has also been done on dizygotic twins who do not share the same DNA. We know that identical twins look the same until the environment changes them. For example, if one of the twins works out to become a body builder, the pair will look quite different. We also know dizygotic twins look different from the beginning.

While identical twins look the same they have different finger prints so something else must be influencing the formation of finger prints. We also know tadpoles (which later form into frogs) change dramatically with the environment. For example, small levels of pollution can cause tadpoles to form into frogs with three legs instead of four.

Re:How does it work? (1)

interkin3tic (1469267) | about 2 years ago | (#42782847)

The proteins are what moves and shapes the cells and thereby the organism. Literally. The proteins are what a lot of the cell is made up of, it's what gives the cell it's structure, and they're all the motors in the cell. Cells are mostly water, and they have lipid envelopes, but what makes them more than bubbles is proteins, which are set by DNA. Environment, like nutrition, can have dramatic effects on the final product, but genetics is really what determines what the product is. There's no combination of environmental factors which will make a fly into Einstein. At least not without affecting proteins and/or genetics.

Maternal effects are probably the most important thing besides genetics and proteins. Absolutely critical for life in our cases, and we don't know what all it does for the embryo yet. Still, the uterus doesn't physically make the embryo, it seems the embryo does most of the self-organizing.

Re:How does it work? (2)

macklin01 (760841) | about 2 years ago | (#42782857)

I am not a biologist so forgive me my ignorance but when people say that DNA is the blueprint for an organism I never understand how a bunch of proteins can determine an organism's shape and behavior. Aren't there more factors that determine those things, like the surroundings in which the DNA is used, like chemicals that the growing organism is surrounded with, temperature, etc?

You're absolutely right. Microenvironment -- the cell's chemical, mechanical, and physical environment, determines which genes are switched on, whether those proteins get made, and how and whether they interact with other proteins to alter cell behavior.

This has been a challenge (and perhaps even a failure) of many current genome projects, which are often reductionist to the point of ignoring much of these features, whereas "context" may well be more important than the genome.

There was a big splashy paper in the New England Journal of Medicine last year, where multiple regions of a single tumor were sequenced. It was found that while there were significant differences in the genome across a single tumor, the cell phenotypes (their behavior) was much more convergent. That is, even with significantly different genes, these cells found a way to function similarly when presented a similar environmental context.

Re:How does it work? (1)

EvilSS (557649) | about 2 years ago | (#42782917)

Think of it like building a chain restaurant. You don't grab a blueprint of a building and run off and "poof", you have a fully functioning business. There is a whole process that surrounds it. Find a location, get permits, contract out the work, hire staff, advertise, etc. With a chain, the process is fairly standardized each time, with some minor (hopefully, at least at the individual level) variations. It's kind of the same with an organism. The DNA isn't so much the blueprint, it's the entire project plan. Instructions for little steps that accumulate and build until you've gone from a single cell (for example) to a fully functional individual organism. In this case, instead of the building blocks being wood, concrete, metal, hammers, builders, staff, cheese sticks, bananas, advertisements, etc, they are all proteins, each with an individual function within the process.

Re:How does it work? (1)

Anonymous Coward | about 2 years ago | (#42783489)

Biologist here.

Proteins do all the work. Here's the background:

DNA data is transcribed (think of DNA as a sequence of information, stream of bytes, if that helps) to mRNA (the m stands for 'messenger'). The DNA has twice the redundancy, if you will, as the mRNA. The DNA is for long-term storage, and the mRNA serves as a template for protein production. DNA is read to make mRNA, which is in turn read (executed, perhaps? I'm bad at analogies) to create proteins. There are molecular machines that perform these steps, which are made out of protein and bits of RNA. Amino acids (the twenty molecules that proteins in living things are typically made from) form the basic building block of the molecular machines that make your cells work.

Here are some examples of some such machines:
http://en.wikipedia.org/wiki/RNA_polymerase (copies DNA to mRNA, among other things)
http://en.wikipedia.org/wiki/Ribosome (creates protein from mRNA)
http://en.wikipedia.org/wiki/Na%2B/K%2B-ATPase (pumps sodium out of the cell, to enable an electrochemical gradient across the cell membrane. this phenomenon is responsible for your nervous system's ability to send messages!)

All (well almost all) of your cells have pretty much the same set of DNA as the others. Your cells are different because they 'express' different genes. This is controlled basically by exposing/hiding certain pieces of DNA to the transcription equipment through a set of chemical reactions involving proteins. See, most DNA is coiled up all the time on other types of protein. The bits that are not coiled up can be transcribed to make mRNA, which can be translated into protein.

It's pretty weird.

Proteins are kind of like cosmic legos. Evolution or God or both (depending on your take on the issue) built a whole lot of weird and awesome things with them.

Re:How does it work? (0)

Anonymous Coward | about 2 years ago | (#42783587)

The way it works is that DNA provides instructions for ribosomes to build proteins. Ribosomes manufacture proteins, which are then distributed through the cell for various purposes. Each protein has a shape which dictates what its behavior will be. For example: some proteins form cylinders, and are used as tunnels between different cells. Some proteins are built with little biochemical legs that literally walk along the cell's super structure to transport other materials. Chemistry and shape dictate emergent behavior that forms a cell, and in turn an organism.

Sometimes the cell does not have sufficient building materials, or conditions are poor, in which case it can not manufacture the substances it is instructed to. In this case the cell will sometimes fail (perhaps resulting in death, perhaps just with some function not working properly). Some cells have contingency plans, such that if there is not enough, say, phosphorous, a plant might start activating other genes that build proteins that fulfill similar functions but do not require phosphorous.

Re:How does it work? (0)

Anonymous Coward | about 2 years ago | (#42783599)

Your body attempts (and will succeed for the most part) to regulate temperature and chemical concentrations to bounds within which the proteins will behave as "designed".

Re:How does it work? (0)

Anonymous Coward | about 2 years ago | (#42784151)

DNA is ultimately what guides the cells taking the ingredients at hand and making them it into a cat or alligator or what have you. Proteins aren't just skin and fingernails, they are organic catalysts -- enzymes -- which provide the metabolic basis for life.

As for externalities, sure. Generally speaking, the recipes require water, a pH near 7, a tonicity near seawater, a temperature near 298K, etc. It's safe to consider life as an emergent property of matter, given the right conditions. Proteins are what do the mixing and assembling of base components and provide the reactions to keep it all up and running. For example hemoglobin (a protein) transports the oxygen you breathe through your body for use by the mitochondria in your individual cells.

Meanwhile, your mitochondria were probably bacteria that some distant ancestor subsumed billions of years ago, so yeah, it gets complicated. "All cells come from pre-existing cells" should give you an idea of how big the problem is, yet how simple it seems.

First cell? Membranes, which define cell boundaries, will form spontaneously given the right chemical soup. Add a billion years of physics (lightning strikes, thermal vents, meteors) and count your blessings. We breathe oxygen, photosynthesizers oxygenated the planet's atmosphere long ago. We can't live without consuming essential amino acids and fatty acids produced by organisms that evolved before us. We ride their coattails.

Re:How does it work? (0)

Anonymous Coward | about 2 years ago | (#42786519)

DNA is not a blueprint, it is a recipe.

Assumptions (1)

pesho (843750) | about 2 years ago | (#42782617)

That is all nice, but most of these prediction algorithms are based on one or more of the following assumptions, which are not always true:

  • 1. We have accurate mapping of the genes.
  • 2. We can predict the protein sequence from the sequence of the gene.
  • 3. One protein can not be the product of two genes.
  • 4. We have a good understanding of what the functions of the proteins in the training set are.
  • 5. If two proteins have similar sequence, they must have similar functions.
  • 6. One protein has one function.
  • 7. A protein has a function.

So any prediction should be taken with a grain of salt and experimentally verified, which brings us back to " ... with its inherent difficulty and expense, experimental characterization of function cannot scale up to accommodate the vast amount of sequence data already available ...."

Re:Assumptions (3, Informative)

the biologist (1659443) | about 2 years ago | (#42783249)

1. We have accurate mapping of the genes.

We have a pretty good idea on this one. Specific polymerases have specific sequences which they respond to, defining the start sequences of genes. It is possible we have missed some polymerase, but the likelihood is low given the extensive searches which have been done for them. As well, regions which are genes have a distinctively different character than regions which are not genes (at least in the general sense).

2. We can predict the protein sequence from the sequence of the gene.

We also have a pretty good idea about this, due to decades and decades of biologists trying to figure out the answer to this problem. The genetic code turns out to differ in some organisms from what we think of as the default. Sometimes multiple amino acids are coded for by the same sequence of bases, and so multiple proteins are produced from the identical coding region of DNA. Sometimes proteins are produced with modified amino acids, which are not explicitly coded for in the DNA of the gene, but rather by the activity of other proteins defined elsewhere by DNA. (This is a stochastic process and interference in the distribution of outcomes can sometimes result in pathological consequences.) In some organisms, the DNA is decompressed into RNA which is then translated into protein in a more typical way. (Extra bases are incorporated into the RNA in a repeatable way that results in amino acids added which were not defined in the sequence of DNA of the gene being added to proteins.) There's a whole bunch of stuff on alternate splicing, which we explicitly know that we don't know how to predict, that produces variations in protein sequence from a single gene sequence.

3. One protein can not be the product of two genes.

There are plenty of ways in which two separate genes can produce an identical protein. This actually happens ALL THE TIME in mammals, since we have two copies of every gene and most of these pairs have identical sequence. Even if the genes produce the identical protein through different mechanisms, if the protein is identical... then the protein is identical.

4. We have a good understanding of what the functions of the proteins in the training set are.

We do have a good idea of what the functions of the proteins in the training set are. See all of molecular biology for your citations.

5. If two proteins have similar sequence, they must have similar functions.

This is explicitly known to be false and is not expected under the evolutionary model. Look up the category of proteins known as 'crystalins' for a specific case counter to your assumption.

6. One protein has one function.

It is generally thought that there is a primary function for every protein. All things in biology are fuzzy, such that every protein probably has secondary side reactions or functions which may or may not be biologically relevant. (Arsenic is poisonous to us because our enzymes have a hard time distinguishing it from Phosphorous, so the enzymes which incorporate phosphorous also 'function' to incorporate arsenic.)

7. A protein has a function.

Any protein synthesized by a cell costs energy. Under the evolutionary model of biology, proteins which don't have a function should have been discarded because their synthesis was wasting energy. That said, lots and lots of proteins are continuously created and then rapidly degraded because they were improperly folded or had other problems which brought them to the attention of intracellular systems with the 'function' of degrading such errant protein and returning their components to the cell for more productive use. Some genetic diseases are the consequence of the buildup of proteins which are otherwise non-symptomatic, but don't get degraded properly by the degradation systems.

Re:Assumptions (1)

the biologist (1659443) | about 2 years ago | (#42783259)

In short, biologists are aware of the limitations of their assumptions and have some solid idea as to when their assumptions are valid or not.

Doing the bioinformatics will help a researcher sort through the dramatically large number of gene sequences to find a set which is likely enriched for the characteristic they are looking for. They know they will miss interesting cases which don't match the models used. Without these sorts of predictions, they would have to rely on random guessing as a strategy with a much lower payout.

Re:Assumptions (1)

Forever Wondering (2506940) | about 2 years ago | (#42783613)

As I am not a biologist, feel free to correct anything I say here.

---

As I understand it, the ribosome is responsible for taking the RNA and creating the protein. IIRC, it also folds the protein. A single protein can be folded in a few different ways to produce different building blocks.

There is also some sort of checking mechanism that checks for proper sequence and proper folding. If there is an error, the constructed/folded protein is broken down and the process is retried. Sometimes, the error checking mechanism fails and bad proteins are released, resulting in certain diseases such as Parkinson's or Creutzfeldt-Jakob.

There are parallel projects to computationally model protein folding. Also, there is a highly successful project, based on crown sourcing, the "protein folding game".

So, my question is, isn't it necessary to understand the folding first, and then go for the protein function annotation? Or, more broadly, how do the two projects/approaches interact?

Re:Assumptions (1)

the biologist (1659443) | about 2 years ago | (#42784635)

The ribosome is a complex of protein and ribosomal RNA (rRNA). The catalytic subunit of the ribosome, which adds new amino acids to the nascent protein, is the rRNA. A single protein can be folded an infinite number of ways, but only a small subset of that possibility is stable. Proteins which have failed to fold 'properly' will be bound by 'heat shock proteins' (HSPs) which assist the new protein in folding. These complexes provide some buffering against the problems of incorrectly manufactured or mutant proteins. If the protein can't fold 'correctly' even with this help, it will be degraded by a complex called a 'proteaseome'.

Parkinson's disease is characterized by the death of seratonin-producing neurons in the brain, which can be caused by the buildup of toxic protein precipitates that the proteasomes cannot degrade (but can be also be caused by alternate mechanisms). Creutzfeldt-Jacob's disease occurs when prion proteins refold into a super-stable configuration. This super-stable configuration induces other prion proteins to also fold into the super-stable configuration, resulting in large amounts of the protein folding 'correctly' as far as the proteasome is concerned, but 'incorrectly' as far as neurobiology is concerned. These proteins also form into large protein precipitates which interfere with cell function. The presence of large protein precipitates is a characteristic of many neurodegenerative diseases, though the proteins which form the precipitates differ in each case.

It is definitely helpful to understand the folding, as it provides lots of information which can be useful in making predictions about what a protein will do. It is not needed, however, to understand protein folding completely in order to make useful predictions about proteins from sequence. We've known for a while how to robustly predict small structural motifs called alpha-helices and beta-pleated-sheets. We can recognize certain patterns of helices and sheets as similar to the structures of proteins we've already solved. We can also infer that a protein functions within a membrane if it has enough helices of a certain length with a high density of hydrophobic amino acids along its length. The more we learn about how proteins fold, the better our predictions will become.

There are many proteins which include a 'random coil' domain, which is generally thought to be an unfolded/unstructured sequence... it turns out these sequences are often critical to the protein in its recognition of diverse binding partners, using subtle features of electrostatic binding and thermodynamics to 'correctly' recognize other proteins for interactions. This category of functions is currently being studied mostly by one lab (to my knowledge), in large part because they're hard to study and researchers tend to be drawn to things they know how to approach.

Re:Assumptions (1)

pesho (843750) | about 2 years ago | (#42785629)

1. We have accurate mapping of the genes.

We have a pretty good idea on this one. Specific polymerases have specific sequences which they respond to, defining the start sequences of genes. It is possible we have missed some polymerase, but the likelihood is low given the extensive searches which have been done for them. As well, regions which are genes have a distinctively different character than regions which are not genes (at least in the general sense).

You justify an assumption with assumption. The core promoter sequences are so degenerate that they can be found pretty much anywhere. This has lead to misannotation of long genes as multiple single genes. There are a number other causes of annotation errors.

  • Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies Alexandra M. Schnoes, Shoshana D. Brown, Igor Dodevski, Patricia C. Babbitt
  • Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies. Tripp HJ, Hewson I, Boyarsky S, Stuart JM, Zehr JP.

There are also numerous examples of manually curated entries that are wrong because people studied non-existent proteins as a result of cloning artifacts or ignoring nonsense mediated decay. Here is one example where a transcripts containing unspliced introns that are eliminated by NMD have been studied and ascribed a function Zhu J, Chen X. MCG10, a novel p53 target gene that encodes a KH domain RNA-binding protein, is capable of inducing apoptosis and cell cycle arrest in G(2)-M. Mol Cell Biol. 2000 Aug;20(15):5602-18. (accessions AF257770, AF257771)

2. We can predict the protein sequence from the sequence of the gene.

We also have a pretty good idea about this, due to decades and decades of biologists trying to figure out the answer to this problem. The genetic code turns out to differ in some organisms from what we think of as the default. Sometimes multiple amino acids are coded for by the same sequence of bases, and so multiple proteins are produced from the identical coding region of DNA. Sometimes proteins are produced with modified amino acids, which are not explicitly coded for in the DNA of the gene, but rather by the activity of other proteins defined elsewhere by DNA. (This is a stochastic process and interference in the distribution of outcomes can sometimes result in pathological consequences.) In some organisms, the DNA is decompressed into RNA which is then translated into protein in a more typical way. (Extra bases are incorporated into the RNA in a repeatable way that results in amino acids added which were not defined in the sequence of DNA of the gene being added to proteins.) There's a whole bunch of stuff on alternate splicing, which we explicitly know that we don't know how to predict, that produces variations in protein sequence from a single gene sequence.

Your pretty good idea is applicable to about 60% of the long reading frames and even less applicable to short ORFs: Ingolia NT, Lareau LF, Weissman JS. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 2011 Nov 11;147(4):789-802.. Mind you this does not include processes like RNA editing, that can further complicate how we predict protein sequence based on gene sequence.

3. One protein can not be the product of two genes.

There are plenty of ways in which two separate genes can produce an identical protein. This actually happens ALL THE TIME in mammals, since we have two copies of every gene and most of these pairs have identical sequence. Even if the genes produce the identical protein through different mechanisms, if the protein is identical... then the protein is identical.

I wasn't commenting on ploidity. I had in mind things like trans-splicing, where you assemble mature RNA from transcripts that belong to different genes sometimes located on different chromosomes, or the way protozoan genomes are rearranged prior to expression in the macronucleus.

4. We have a good understanding of what the functions of the proteins in the training set are.

We do have a good idea of what the functions of the proteins in the training set are. See all of molecular biology for your citations.

See the MCG10 example above. Even for well studied proteins like p53 (there are over 65,000 publication out there on p53) we keep finding new functions (p53 controlling energy metabolism for example).

5. If two proteins have similar sequence, they must have similar functions.

This is explicitly known to be false and is not expected under the evolutionary model. Look up the category of proteins known as 'crystalins' for a specific case counter to your assumption.

Yet, sequence homology is in the base of all algorithms for predicting protein function. I know it is the best tool we have (I use it on daily basis), but still this limits its applicability to generating a testable hypothesis. Which bring is back to my point that we have to experimentally validate all these computational predictions.

6. One protein has one function.

It is generally thought that there is a primary function for every protein. All things in biology are fuzzy, such that every protein probably has secondary side reactions or functions which may or may not be biologically relevant. (Arsenic is poisonous to us because our enzymes have a hard time distinguishing it from Phosphorous, so the enzymes which incorporate phosphorous also 'function' to incorporate arsenic.)

Again, you are supporting one assumption with another. Here are couple of examples that fly in the face of it: VPS39/Vam6/TLP is involved in lysosome fusion, but it also regulates TGF-beta signaling. Disruption of any of these functions is lethal for the organism. So which one is the primary?; FAM48A in the nucleus is a part of chromatin remodeling complex that controls transcription. In the cytoplasm it engages in completely unrelated set of interactions that regulate EMT and autophagy. Which one is the FAM48A primary function?; A number of RNA binding proteins (SRSF1, PTBP1, hnRNPK, PCG1-alpha) can have multiple functions in regulating RNA splicing protein translation and gene transcription. Again, how do you assign a primary function?

7. A protein has a function.

Any protein synthesized by a cell costs energy. Under the evolutionary model of biology, proteins which don't have a function should have been discarded because their synthesis was wasting energy. That said, lots and lots of proteins are continuously created and then rapidly degraded because they were improperly folded or had other problems which brought them to the attention of intracellular systems with the 'function' of degrading such errant protein and returning their components to the cell for more productive use. Some genetic diseases are the consequence of the buildup of proteins which are otherwise non-symptomatic, but don't get degraded properly by the degradation systems.

Organisms, especially higher eukaryotes are by no means energy efficient and "waste" energy on a whole bunch of processes. Adding a thousand useless proteins, that are rapidly eliminated soon after they are synthesized does not change the energy equation by much. Look at the Ingolia's paper above. There are thousands of proteins that are translated from alternate ORFs. Some of them have functions, others don't and are rapidly degraded (arguably the ORFs encoding them may have a function of regulated the translation of nearby reading frames). Gene duplication events often render proteins redundant and inactivating mutations in one locus can produce a useless protein that is passed to the progeny simply because there is not selective pressure against it.

Re:Assumptions (2)

the biologist (1659443) | about 2 years ago | (#42785979)

You justify an assumption with assumption. The core promoter sequences are so degenerate that they can be found pretty much anywhere. This has lead to misannotation of long genes as multiple single genes. There are a number other causes of annotation errors.

  • Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies Alexandra M. Schnoes, Shoshana D. Brown, Igor Dodevski, Patricia C. Babbitt
  • Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies. Tripp HJ, Hewson I, Boyarsky S, Stuart JM, Zehr JP.

There are also numerous examples of manually curated entries that are wrong because people studied non-existent proteins as a result of cloning artifacts or ignoring nonsense mediated decay. Here is one example where a transcripts containing unspliced introns that are eliminated by NMD have been studied and ascribed a function Zhu J, Chen X. MCG10, a novel p53 target gene that encodes a KH domain RNA-binding protein, is capable of inducing apoptosis and cell cycle arrest in G(2)-M. Mol Cell Biol. 2000 Aug;20(15):5602-18. (accessions AF257770, AF257771)

Those long single genes which are sometimes miss-annotated as a series of smaller genes... are sometimes transcribed as a long single gene and sometimes as a series of smaller genes. You've primarily pointed out that biology is hard and that most published papers are full of crap.

Your pretty good idea is applicable to about 60% of the long reading frames and even less applicable to short ORFs: Ingolia NT, Lareau LF, Weissman JS. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 2011 Nov 11;147(4):789-802.. Mind you this does not include processes like RNA editing, that can further complicate how we predict protein sequence based on gene sequence.

This counter-argument doesn't counter my argument.

I wasn't commenting on ploidity. I had in mind things like trans-splicing, where you assemble mature RNA from transcripts that belong to different genes sometimes located on different chromosomes, or the way protozoan genomes are rearranged prior to expression in the macronucleus.

I wasn't commenting on ploidy either. Protozoans do things in all sorts of ways, most of which we have no idea about... and don't care about for the most part. The knowledge we have about the systems we have applies best to the systems we have studied.

See the MCG10 example above. Even for well studied proteins like p53 (there are over 65,000 publication out there on p53) we keep finding new functions (p53 controlling energy metabolism for example).

You're referring to the network of downstream effects which are influenced by p53. That is a whole order or few of magnitude of difficulty beyond identifying protein function.

Yet, sequence homology is in the base of all algorithms for predicting protein function. I know it is the best tool we have (I use it on daily basis), but still this limits its applicability to generating a testable hypothesis. Which bring is back to my point that we have to experimentally validate all these computational predictions.

...which isn't a point I ever disagreed with. We can most easily predict functions which are the result of sequences we have seen before... but we don't assume that similar sequences will result in similar functions.

Again, you are supporting one assumption with another. Here are couple of examples that fly in the face of it: VPS39/Vam6/TLP is involved in lysosome fusion, but it also regulates TGF-beta signaling. Disruption of any of these functions is lethal for the organism. So which one is the primary?; FAM48A in the nucleus is a part of chromatin remodeling complex that controls transcription. In the cytoplasm it engages in completely unrelated set of interactions that regulate EMT and autophagy. Which one is the FAM48A primary function?; A number of RNA binding proteins (SRSF1, PTBP1, hnRNPK, PCG1-alpha) can have multiple functions in regulating RNA splicing protein translation and gene transcription. Again, how do you assign a primary function?

"It is generally thought" is an assumption. I don't force reality to fit my assumptions, but use them as guidelines. In the case you mention, which function is more required? You could determine this by working with temperature sensitive alleles or other methods. Theoretically, one function will always be more important than the other. Which function is most important may vary with time and/or environmental situation, but one will always be more important... there really isn't any other way for real-valued things in the real world to exist.

Organisms, especially higher eukaryotes are by no means energy efficient and "waste" energy on a whole bunch of processes. Adding a thousand useless proteins, that are rapidly eliminated soon after they are synthesized does not change the energy equation by much. Look at the Ingolia's paper above. There are thousands of proteins that are translated from alternate ORFs. Some of them have functions, others don't and are rapidly degraded (arguably the ORFs encoding them may have a function of regulated the translation of nearby reading frames). Gene duplication events often render proteins redundant and inactivating mutations in one locus can produce a useless protein that is passed to the progeny simply because there is not selective pressure against it.

And now you're just ranting for the sake of ranting...? "Some of them have functions, others don't..." You have evidence for a negative function? Yes, duplicated/dysfunctional genes exist which no longer work for their primary function. This is not some profound revelation. There are always side reactions which may or may not be important enough for you to care.

I can readily have assumptions that I use in my daily work as a research biologist, so long as I pay attention to when they don't apply/work. This is sort of at the root of what science is. ALL models of the world are assumptions, but that does not mean they are not useful under certain circumstances. One of the reasons to dismiss an old scientist when they say something isn't possible, while a younger one says it might, is that the older scientist has become, "married to the model" and no longer observes the reality they are studying.

Patents (0)

Anonymous Coward | about 2 years ago | (#42782851)

Do these algorithms autonomously file for patents on their findings and issue legal threats to competing algorithms as well?

Re:Patents (0)

Anonymous Coward | about 2 years ago | (#42783131)

No. This would violate my patent on automated lawyer bots.

Soon we'll discover that our ancestors were assass (0)

Anonymous Coward | about 2 years ago | (#42783241)

I just realized that the word assassin has ass in there twice. Thank you /. for limiting how much I can write in the title.

Wouldn't... (1)

DirtyLiar (796951) | about 2 years ago | (#42784205)

...a post-genomic world be one in which we had stopped fiddeling with genes and DNA and such?

Aren't we more in the midst of a Genomics Revolution?

Or more accurately, we are in the infancy of the Genomics Revolution.

TED Talk: Understanding cancer through proteomics (1)

rbrandis (735555) | about 2 years ago | (#42784649)

This TED.com talk by Danny Hillis is informative on this topic, http://www.ted.com/talks/danny_hillis_two_frontiers_of_cancer_treatment.html [ted.com] "Danny Hills makes a case for the next frontier of cancer research: proteomics, the study of proteins in the body. As Hillis explains it, genomics shows us a list of the ingredients of the body -- while proteomics shows us what those ingredients produce. Understanding what's going on in your body at the protein level may lead to a new understanding of how cancer happens."
Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?