Data integration or die: the importance of biologist input in efficiently sharing data
Data format structural standards are critical to the intrinsic value of analyses, with regard to retrieval, sharing, validation, reproducibility, and particularly, integration and interpretation.
Integrating data is imperative for the advancement of research; blending results of diverse disciplines is often an essential step in answering meaningful biological questions. To achieve this, standards should be implemented at the source of the data for the sake of efficiency, particularly since the datasets are constantly increasing in size, and it may be almost impossible to achieve unification further downstream.
In order to engage the biologist community, the aim of the scientific paper is to familiarise experimental biologists with definitions and terms used by computational biologists, to foster cooperation towards cohesive data flow pipelines. Four main classes of data format are identified, (tables, FASTA, Genbank and tag-structured), a major step in defining how the multitude might be curated.
Data integration in biological research is centred on standards adoption promising easier conversion between data/file formats. The scale and infrastructure of a given database determine whether it should be stored in a centralised or distributed manner, with a trade-off against the difficulty of updating or querying, respectively. Either way, when the data needs to be (further) integrated (with other data), the computational burden of unifying formats should be eased wherever possible.
Ideally biologists should work with bioinformaticians and computer scientists to get more involved with standardising their data structures, reducing the ongoing issue of database management and programming tools to parse data. This will boost biological research, gaining a more robust structure for data analysis.
Senior Author, Dr Vicky Schneider, Head of the 361⁰ Division at TGAC, said: “Data integration should not just rely on software engineers and computational scientists, but needs to be driven by the actual users whose communities need to define, adopt and use standards, ontologies and annotation best practice. Therefore, it is particularly important for the biological research community to get acquainted with the conceptual basis of data integration, its limitations, challenges and terminology.”
Senior Author, Dr Allegra Via, Assistant Professor in the Biocomputing Group of Sapienza, University of Rome, added: "The importance of biologists in data integration is huge. They are those who produce and analyse data, which need to be shared for a better science. There cannot be data sharing without good practice in data integration."
The paper, titled: “Data Integration in Biological Research: An overview” is published in PubMed. The publication is a collaborative effort between TGAC, Department of Informatics at Ionian University, the ELIXIR Hub and Biocomputing Group, Sapienza University.
TGAC is strategically funded by BBSRC and operates a National Capability to promote the application of genomics and bioinformatics to advance bioscience research and innovation.
Notes to Editors
For more information, please contact:
Marketing & Communications Officer, The Genome Analysis Centre (TGAC)
T: +44 (0)1603 450107
The Genome Analysis Centre (TGAC) is a world-class research institute focusing on the development of genomics and computational biology. TGAC is based within the Norwich Research Park and receives strategic funding from the Biotechnology and Biological Science Research Council (BBSRC) - £7.4M in 2013/14 - as well as support from other research funders. TGAC is one of eight institutes that receive strategic funding from BBSRC. TGAC operates a National Capability to promote the application of genomics and bioinformatics to advance bioscience research and innovation.
TGAC offers state of the art DNA sequencing facility, unique by its operation of multiple complementary technologies for data generation. The Institute is a UK hub for innovative Bioinformatics through research, analysis and interpretation of multiple, complex data sets. It hosts one of the largest computing hardware facilities dedicated to life science research in Europe. It is also actively involved in developing novel platforms to provide access to computational tools and processing capacity for multiple academic and industrial users and promoting applications of computational Bioscience. Additionally, the Institute offers a Training programme through courses and workshops, and an Outreach programme targeting schools, teachers and the general public through dialogue and science communication activities. www.tgac.ac.uk
ELIXIR, the European lifescience infrastructure for biological information, is a unique and unprecedented initiative that consolidates Europe’s national centres, services, and core bioinformatics resources into a single, coordinated infrastructure.
ELIXIR brings together Europe’s major life-science data archives and, for the first time, connects these with national bioinformatics infrastructures throughout ELIXIR’s member states. By coordinating local, national and international resources the ELIXIR infrastructure will meet the data-related needs of Europe’s 500,000 life-scientists. ELIXIR supports users addressing the Grand Challenges in diverse domains ranging from marine research via plants and agriculture to health research and medical sciences. www.elixir-europe.org
BBSRC invests in world-class bioscience research and training on behalf of the UK public. Our aim is to further scientific knowledge, to promote economic growth, wealth and job creation and to improve quality of life in the UK and beyond.
Funded by Government, and with an annual budget of around £467M (2012-2013), we support research and training in universities and strategically funded institutes. BBSRC research and the people we fund are helping society to meet major challenges, including food security, green energy and healthier, longer lives. Our investments underpin important UK economic sectors, such as farming, food, industrial biotechnology and pharmaceuticals.