|
|
|
|
NCI > Resources for Scientists > Analytic Tools Working Group Summary > Session III: |
  |
SESSION III: OPERABILITY AND INTEROPERABILITY
Overview and Goals of Session
Dr. Kenneth Buetow of the NCI presented the overview. The goal of this session was to formulate the role of information technology (IT) in facilitating communication: What am I doing? How can I communicate it to you? How do I compare my results to your results? Is there a means to synthetically interpret all of our results?
Operability is defined as providing intuitive interfaces and well-defined operating characteristics; and interoperability as providing common operating characteristics (what types of things are comparable), interchangeable parts (software components, data interfaces), access (local and distributed) and integration.
Key considerations:
Nomenclature issues - how do we describe what we're doing, with respect to genes, samples, experiments?
Data formats - how do we exchange our information?
Software engineering - how can we efficiently/effectively use resources?
Nomenclature
Greg Schuler of the National Center for Biotechnology Information (NCBI), NIH, discussed nomenclature issues. The usual stumbling block is getting consensus on the definition of "gene". Because names and identifiers can become meaningless, the only "constant" is the sequence. At present, the most useful identifier is the Locus ID; this also provides a direct link to Medline and other information sources concerning that locus. Using the genomic sequence as the organizing principle would allow one to generate "gene tags", then both oligos and cDNAs derived from that gene tag would have the same locus ID.
Data formats
Alex Lash from NCBI discussed data formats, how we exchange information. Starting with the data set, results are recorded in an electronic format and sent to a "central" database where the relevant information must be extracted to permit integration with other data sets. Numerous issues must be considered at each step of the process.
Should the format of the document be public or proprietary, hierarchical (HTML, XML - can be extensively marked up) or flat (concise, easier to read, can be tab delimited), verbose or compact, a mix of formats or pure? Should the annotation be simple or precise, should the vocabulary be designed all at once, or should it evolve over time?
Data import requires a parser, an algorithm for examining the document and extracting all relevant information. It can create a structured representation of the document in memory, or generate state events based on content. Minimally, errors will be generated if document is not well formed; thus, do we choose a simple or robust programming language and interface? Imported data must be validated to check that it conforms to the defined schema.
Critical examination and evaluation of many of these issues (formats, parsers, annotation, schema) is required in order to permit the optimal sharing of data.
Software engineering
Jed Rifkin of NCI discussed the issue of interoperability - making things work together on a large scale. Interoperability allows for adaptation to new data sources, reuse of resources, and the integration of data. In addition to sharing data and software, researchers could share computer resources. This may be necessary, because these problems are too large for a single lab to solve, and there is too much data for analysis without IT tools.
Design considerations for sharing data include:
Format
Operational
Integration
Interfaces
Global design considerations include:
The rise of the Internet means we are seeing increasing connectivity, better approaches to machine-to-machine communication, and more common use of distributed resources.
Discussion and Action Items - Operability and Interoperability
Stephen Friend of Rosetta began the discussion by emphasizing the need for a common data interchange language to facilitate data sharing, analysis, and integration of data generated by diverse platforms. At the crux of this problem is the need to agree on the information that is both necessary and sufficient to be included and to create a language that is sufficiently robust and able to evolve. Rosetta's Gene Expression Markup Language (GEML) is a XML file format for storing DNA microarray and gene expression data, to enable the exchange of data between gene expression databases and analysis systems. GEML has two data type definitions: "pattern" files, which describe chip layout, probes, gene annotations, and "profile" files, which contain gene expression data, tissue sample annotations, RNA sample preparation and hybridization information. Further information about GEML can be found at http://www.rii.com/geml. It was suggested that the group define XML specifications and communicate these specifications to the developers of GEML at an upcoming meeting in August. Dr. Eisen stated that he would be willing to communicate these specifications to participants at the European Bioinformatics Institute meeting next month; this group is also working toward XML representation of gene expression data.
Dr. Buetow mentioned that there are some tab-delimited formats being used, and for XML formats, there's GEML, as described by Dr. Friend, and another GEML being worked up by EBI. Dr. Masys mentioned that because the language is extensible, it is easily changeable; however, data storage could be a problem because files are large.
It was also stressed that the definition of "gene" is a potentially huge problem for interoperability, and suggested that we start talking about the sequence or probe/feature description "detector".
Contents page | Session I | Session II | Session III | Summary
| Información en español | National Institutes of Health (NIH) | Department of Health and Human Services (DHHS) |
|
For questions about cancer-related topics and the Institute's activities
please contact the Public Inquiries Office:
Building, 31, Room 10A03, 31 Center Drive, MSC 2580,
BETHESDA, MD 20892-2580 USA,
(301) 435-3848 Comments regarding this website are welcome. Please send them to: webmaster@cancer.gov |