6 Before we start

Let’s imagine we are in the situation we need a dataset for our research. There are various things we have to do before we should get too far with it.

6.1 Data management plan

It is becoming more common that funders demand a data management plan, either in application or soon after it. It’s a document that specifies how the data is handled during and after the project. Thereby it connects to both what data is used, how it is processed, and how it is stored afterwards. It usually includes (adapted from Aalto University’s Data Management Plan page:

  • How the data is collected or what existing data is used
  • How the data will be stored and processed by the team
  • How the ownership and user rights go
  • Whether the data will be open, and no matter if open or closed, by which conditions and licenses
  • How the research ethics and information security are taken into account
  • What are the costs of data management, how much time it needs?

It’s also often emphasized that the data management plan should live along the project, and be updated when something new is understood better. Actually, it is probably a good idea to version them and somehow document what kind of changes you have introduced along the project. This will help you doing it the next time as well.

6.2 Thinking of publishing starts here

Where the data will be stored, and be available for further research, should be somehow specified in this point already, if possible. Some repositories may be able to give you very early information about the processes they demand for the data that they can take in. For example, archives may have very specific documentation they want, and some forms they demand. Some archives do have also a procedure where they give out a statement that they are willing to archive the data from your project.

What you absolutely have to do is to make sure that the policy outlined by the archive and by your own organization are compatible by one another. Your organization, usually university, also probably has some recommendations or rules for your data management. These should be read closely and followed. But it is also very possible that for your own specific field the practices are still emerging.

If you plan to publish the data openly somewhere, then you also have to think how the anonymization or pseudonymization are performed. Again, some data types, such as audio or video recordings, cannot be anonymized in a sensible manner.

6.3 How big are we going?

Bit above in the data management plan we already discussed that it is important to estimate how much time the data management needs. This, at least in humanities, is often connected to how much it costs. Our data usually is not tied to very expensive equipment, and we often do rather qualitative analysis and annotation, so a lot tends to boil into how much time something takes.

How large dataset you are creating is thereby really dependent from how big your team is. It’s easy to see what this means when you are working alone! If you have to manage and publish a huge database, and also do your research, you may find yourself from a pretty tight spot. This is one reason why we will discuss quite thoroughly the benefits of trying to use the existing datasets. You simply have less to do with data management, and can emphasize further work on the materials and your own research.

When there is a team working with the dataset things get different. With several specialists who know the data type, content and best practices, we can do very impressive new datasets in reasonable time, which still often means years. We have thereby had plenty of research projects that create new datasets, and exclusively do that. The result of the project is a new corpus or resource.

In this point I have grown little bit suspicious toward this approach. I think datasets that are not used in research may not always be entirely battle tested. And this is a problem. How can we know that the conventions of the dataset are actually useful, if nobody has seriously used it in any research? Was the sampling reasonable, and same question can be asked about tens of little details we have to decide. Still, I do admit that when we are speaking about really large datasets, large projects and teamwork is the way to go.

And, as a final word of caution, usually a single PhD student should not be responsible for the project’s data management while they also do their PhD. This is a common recipe, but I think lots of alarms should start ringing under this arrangement. Generally in a project the data management should not be in hand one individual person, as it streches into everything we do with the data as a team, and should connect to everyone’s work. At the same time the practices should not only be dictated from above, but be discussed in detail, based on current experiences and tools available, with the entire team.

6.4 How narrow are we staying?

Besides the size of the dataset, there is another question of breadth. Even a small data, if annotated for lots of detail, can become lots of work. At the same time some small datasets are really suitable for few individual studies. I would never say just to one, as replication and changing the angle certainly can always be done with real benefits, but it is true that some datasets are wider than others. For example, a general corpus of a language can suit countless of different studies, whereas a small corpus of specific pronounced words is already something that may have been created just for one specific purpose. These examples also have the relation that latter often could had been derived from the former.