private data | Edinburgh Research Data B

private data | Edinburgh Research Data Blog
Skip to primary content
Skip to secondary content
Warning
: Undefined array key "file" in
/apps/www/wordpress/blogs/wp-includes/media.php
on line
1686
[Reposted from
In a recent blog post, we looked at the
four quadrants of research data curation systems
. This categorised systems that manage or describe research data assets by whether their primary role is to store metadata or data, and whether the information is for private or public use. Four systems were then put into these quadrants. We then started to investigate further the requirements of a Data Asset Register in
another blog post
This blog post will look at the requirements and characteristics of a Data Vault, and how this component fits into the data curation system landscape.
What?
The first aspect to consider is what exactly is a Data Vault? For the purposes of this blog post, we’ll simply consider it is a safe, private, store of data that is only accessible by the data creator or their representative. For simplicity, it could be considered very similar to a safety deposit box within a bank vault. However other than the concept, this analogy starts to break down quite quickly, as we’ll discuss later.
Why?
There are different use cases where a Data Vault would be useful. A few are described here:
A paper has been published, and according to the research funder’s rules, the data underlying the paper must be made available upon request. It is therefore important to store a date-stamped
golden-copy
of the data associated with the paper. Even if the author’s own copy of the data is subsequently modified, the data at the point of publication is still available.
Data containing personal information, perhaps medical records, needs to be stored securely, however the data is ‘complete’ and unlikely to change, yet hasn’t reached the point where it should be deleted.
Data analysis of a data set has been completed, and the research finished. The data may need to be accessed again, but is unlikely to change, so needn’t be stored in the researcher’s active data store. An example might be a set of completed crystallography analyses, which whilst still useful, will not need to be re-analysed.
Data is subject to retention rules and must be kept securely for a given period of time.
How?
Clearly the storage characteristics for a Data Vault are different to an open data repository or active data filestore for working data. The following is a list of some of the characteristics a Data Vault will need, or could use:
Write-once file system: The file system should only allow each file to be written once. Deleting a file should require extra effort so that it is hard to remove files.
Versioning: If the same file needs to be stored again, then rather than overwriting the existing file, it should be stored alongside the file as a new version. This should be an automatic function.
File security: Only the data owner or their delegate can access the data.
Storage security: The Data Vault should only be accessible through the local university network, not the wider Internet. This reduces the vectors of attack, which is important given the potential sensitivity of the data contained within the Data Vault.
Additional security: Encrypt the data, either via key management by the depositors, or within the storage system itself?
Upload and access: Options include via a web interface (issues with very large files), special shared folders, dedicated upload facilities (e.g. GridFTP), or an API for integration with automated workflows.
Integration: How would the Data Vault integrate with the
Data Asset Register
? Could the register be the main user interface for accessing the Data Vault?
Description: What level of description, or metadata, is required for data sets stored in the Data Vault, to ensure that they can be found and understood in the future?
Assurance: Facilities to ensure that the file uploaded by the researcher is intact and correct when it reaches the vault, and periodic checks to ensure that the file has not become corrupted. What about more active preservation functions, including file format migration to keep files up to date (e.g. convert Word 95 documents to Word 2013 format)?
Speed: Can the file system be much slower, perhaps a Hierarchical Storage Management (HSM) system that stores frequently accessed data on disk, but relegates older or less frequently accessed data to slower storage mechanisms such as tape? Access might then be slow (it takes a few minutes for the data to be automatically retrieved from the tape) but the cost of the service is much lower.
Allocation: How much allocation should each person be given, or should it be unrestricted so as to encourage use? What about costing for additional space? Costings may be hard, because if the data is to be kept for perpetuity, then whole-life costing will be needed. If allocation is free, how to stop it being used for routine backups of data rather than golden-copy data?
Who: Who is allowed access to the Data Vault to store data?
Review periods: How to remind data owners what data they have in the Data Vault so that they can review their holdings, and remove unneeded data?
Feedback on these issues and discussion points are very welcome! We will keep this blog updated with further updates as these services develop.
Image available from
Tony Weir, Head of Unix Section, IT Infrastructure
Stuart Lewis, Head of Research and Learning Services, Library & University Collections.
Warning
: Undefined array key "file" in
/apps/www/wordpress/blogs/wp-includes/media.php
on line
1686
[Reposted from
In my last blog post, I looked at the
four quadrants of research data curation systems
. This categorised systems that manage or describe research data assets by whether their primary role is to store metadata or data, and whether the information is for private or public use. Four systems were then put into these quadrants.
The University of Edinburgh already has two active services from this diagram:
PURE
, our Current Research Information System and
DataShare
, our open data repository.
This blog post will start to unpack some of the requirements for a Data Asset Register.
The first aspect to cover is its name. What should it be called? Traditionally systems like this, which only hold metadata records that either just describe, or describe and point to other resources, are known as registers, catalogues, directories, indexes, or inventories.
The University already has a ‘
Data Catalogue
’, maintained by the Data Library. However this list has a different purpose, to hold details of external data. Oxford University, instead of opting for a name such as this, have instead opted to call their service by the verb ‘find’ –
DataFinder
. Whilst there may be some brand or service name applied to the system we create at the University of Edinburgh, for now its working title is ‘Data Asset Register’ as one of its main functions will be to allow data creators to ‘register’ their data assets by describing them, and if the data is published online to link to the data.
But what should the Data Asset Register provide? The following diagram shows some early thoughts:
The diagram splits this up into three broad areas:
Description – what the asset register should describe
Functions – the functions needed to allow data asset description
Services – the value-added services that will add benefit to people who register their data
Description
The core purpose of the system is to describe data. This is split into two categories: being able to describe single items or data assets, and describing collections of data assets. Many data assets are created on their own, for example a population health longitudinal study. As such, this should be described on its own. In contrast, some data are created in large sets, where it isn’t necessarily useful to describe every part of that set on its own. In this case, the collection as a whole can be described. A good example of this is the
Research Data Australia
service from the
Australian National Data Service
We’ll need to decide how to describe the data. A likely initial candidate will be the
DataCite Metadata Schema
, but we may find this needs to be extended to cover extra elements relevant to the University or the discipline of the data asset being described. There will also be requirements coming from a possible
UK research data registry
, development of which is being led by the
Digital Curation Centre
Functions
In order to enable data asset description, a register will need certain functions. So far three have been identified:
CRUD: Create / Read / Update / Delete are the basic functions required when manipulating data. The system should allow records of research data to be created, read later, updated, and if needed, deleted.
User Interface (UI): In order to enable CRUD functionality, a user interface will be required. To be useful, this will need to provide search and display functionality, for example using faceted search and browse.
Log: Some funders have requirements to keep data for certain lengths of time, or for periods of time that must be reset each time a data set is accessed. For this reason each access of a data asset must be logged by the system. An example is from the
EPSRC
“Research organisations will ensure that EPSRC-funded research data is securely preserved for a minimum of 10-years from the date that any researcher ‘privileged access’ period expires or, if others have accessed the data, from last date on which access to the data was requested by a third party;”
It may also be that the Data Asset Register can be a front-end for our Data Vault too –
more about that in another blog post
Services
Extra value-added services are required in order to make the Data Asset Register useful to people. Our initial thoughts about these services include the following:
Identify: The ability to assign identifiers to data assets. Some of these identifiers will need to be persistent.
DOI:
DataCite DOIs
allow DOIs to be assigned to data assets, in the same way that DOIs are assigned to journal articles. This allows them to be persistently identified over time even if they move between systems, but also allow them to be cited using a well-known identifier.
TinyURL: A short URL such as those provided by
TinyURL
or
bitly
are useful to give easy web identifiers to objects. For example it might be nice to be able to issue URLs such as http://data.ed.ac.uk/abcd.
Other: Are there any other identifier systems that we should consider using?
Discover: It is important that the data records held in the Data Asset Register are searchable and can be indexed by external services. This may be by national, international, or discipline-based data aggregators, or by normal web search engines.
Share: Whilst often the data assets will be described online but kept offline by the researcher, they may wish to share the data. The Data Asset Register may need to facilitate this in a number of ways:
Deposit: If the data is held in the Data Vault, along with a description in the Data Asset Register, then using a deposit protocol such as
SWORD
it would be possible to deposit the data into the institutional data repository, or into an external repository. The Data Asset Register can then record the identifier for the hosted data set.
Redirect: Where the data is hosted online elsewhere, the Data Asset Register could automatically redirect users. For example visiting http://data.ed.ac.uk/abcd could redirect a visitor directly to the repository, rather than showing them just the data asset record description. If the data is not shared openly, then contact details can be provided of the data owner.
RCUK: Some funders, such as the RCUK members (Research Councils UK) require funded journal papers to include “a statement on how the underlying research materials – such as data, samples or models – can be accessed”. The data asset register could facilitate this by automatically writing statements such as “Details about accessing the data referenced in this paper may be found at http://data.ed.ac.uk/abcd”
It is very early days in our thinking about what features a Data Asset Register should offer, and like many components of a modern research data management infrastructure, there are very few existing examples to look at. Our thoughts will be refined over the coming months so that we can start looking at implementation options. Is there an existing system that can do all of this for us, or is it better to build something new, either alone or with collaborators?
Images available from
Stuart Lewis, Head of Research and Learning Services, Library & University Collections.
Warning
: Undefined array key "file" in
/apps/www/wordpress/blogs/wp-includes/media.php
on line
1686
Warning
: Undefined array key "file" in
/apps/www/wordpress/blogs/wp-includes/media.php
on line
1686
Warning
: Undefined array key "file" in
/apps/www/wordpress/blogs/wp-includes/media.php
on line
1686
Warning
: Undefined array key "file" in
/apps/www/wordpress/blogs/wp-includes/media.php
on line
1686
[Reposted from
The University of Edinburgh, like many other universities, is currently undertaking extensive work to build infrastructure that supports and enables good practice in the area of Research Data Management. This infrastructure ranges from large-scale research storage facilities to data management planning tools.
One aspect of Research Data Management highlighted in the
University’s RDM Roadmap
is ‘Data stewardship: tools and services to aid in the description, deposit, and continuity of access to completed research data outputs.’
To help describe how these systems fit together yet how they differ from each other, I use a model with two axes to differentiate
what
they hold, and
who
can access them. The first axis is used to differentiate between systems that hold only metadata from those that hold files (typically with some level of metadata), while the second differentiates between private systems and public systems.
Research information and data management and associated systems aren’t a new phenomenon. We have been offering services in these areas for some time. To demonstrate this, we have two existing systems that provide services in two of the areas:
PURE
is our Current Research Information System (CRIS). It is a private system for the University to record the research outputs it generates. It therefore falls into the metadata / private quadrant. (It can hold files, and has a public interface, but this is primarily for Open Access publications rather than research data).
DataShare
is our open research data repository. It holds and curates data (and associated metadata) for public consumption on behalf of the data creators. It therefore falls into the data / public quadrant.
What about the other two quadrants? Are there systems or infrastructure needed to fill these? Is there a case where we need a public store of metadata about research data, or a private store of finished data sets?
The rest of this blog post will argue that there is a need for these, and will describe two pieces of infrastructure that could fill them. Further blog posts will be written that start to unpick the requirements of these systems in more depth.
Public Metadata:
Not only is it good practice for a research institution to know what research data it is creating, some research funders require us to do so. In addition the
University’s RDM policy
requires
“Any data which is retained elsewhere, for example in an international data service or domain repository should be registered with the University.”
The following is an extract from the
EPSRC’s expectations for research data management
“Research organisations will ensure that appropriately structured metadata describing the research data they hold is published (normally within 12 months of the data being generated) and made freely accessible on the internet; in each case the metadata must be sufficient to allow others to understand what research data exists, why, when and how it was generated, and how to access it. Where the research data referred to in the metadata is a digital object it is expected that the metadata will include use of a robust digital object identifier (For example as available through the DataCite organisation –
).”
This need can be fulfilled by the creation of a Data Asset Register.
Private Data:
Whilst some data will be suitable for public sharing, for various reasons some will not, or will need to have access controlled by the data creator. Therefore there is a need for a safe place for keeping data that will be kept secure, both in terms of access and change. Once lodged/archived there, files should only be accessible by the data creator or data manager, and it should not be possible to change files, but only to create newer versions or to remove/delete them.
This need can be fulfilled by the creation of a Data Vault.
Systems however do not live in isolation, and become more powerful, more useful, and more likely to be used if they are able to integrate with each other. With the ever-growing number of ‘systems’ provided by a large research-intensive university, the last thing that a research data management programme wants to do is to introduce further systems that need to be fed with duplicate information. This means that some or all of the components will need to be integrated together.
There are three obvious integrations between these systems, as shown below:
First, because PURE is the master system for holding data and relationships about research outputs (THIS grant, funded THAT piece of equipment, which was used to create THIS data set, that was described in THESE journal articles), records of data sets need to exist within it. However if some or all of these are being created in the Data Asset Register, then they will need to be pushed into PURE. Equally if some data are being registered directly in PURE, it will be useful to pull this out of PURE and into the Data Asset Register.
Secondly, because the Data Asset Register may become the main user interface for entering details of data sets, it could also be the main administrative user interface for uploading files into the Data Vault. If that is the case, then the Data Asset Register and the Vault will need to be integrated.
Finally, for instances where metadata is held in the Data Asset Register, corresponding files are held in the Data Vault, and the data owner decides to make the data openly available, then the Data Asset Register should be able to deposit these as a new item in the Data Repository.
The next challenge will be to describe the requirements for the Data Vault and Data Asset Register. We have some early thoughts about this, and will share these in future blog.
Images available from
Related blog posts:
Thinking about Research Data Asset Registers
Thinking about a Data Vault
Stuart Lewis, Head of Research and Learning Services, Library & University Collections.
Recent Posts
Reflections on IDCC 2026
DataShare Spotlight: A photographic record of a divided Berlin in the 1980’s
Two upcoming Edinburgh data-related conferences
Data Management for Bioimaging – No-Cost, Easy-Access Tools for Edinburgh
AI, Openness & Future Publishing – Event summary
Subscribe via email
Archives
Archives
Categories
Categories
Tags
arts data
big data
BITS
code
Collaboration
Conference
Data-X
data asset register
data curation
data journals
data reuse
Data Safe Haven
Data Science
DataShare
DataStore
DataVault
data visualisation
DCC
Dealing with Data
digital preservation
Edinburgh DataShare
Edinburgh DataVault
ELNs
hadoop
Humanities data
librarians
MANTRA
metadata
methods
MOOC
open data
open science
postgraduate training
private data
RDA
RDM launch
RDM services
Research data
Research Data Service
research support
scientific data
staff
training-kit
videos
Workshops
Meta
Entries feed
Comments feed
WordPress.org