Statistical processing
Contact info
Science, Technology and CultureAnne-Sofie Dam Bjørkman
+45 20 37 54 60
Get as PDF
The Statistics are compiled annually on the basis of questionnaires. The data collected are validated very carefully focusing on a number of prioritized variables, notably the R&D expenses. This is performed at macro- as well as micro level.
Source data
The Statistics are compiled on the basis of questionnaires. There are separate questionnaires for university hospitals and for other institutions.
Frequency of data collection
Annual.
Data collection
Web-questionnaire. A paper version is available on request.
Data validation
A number of variables has been chosen for a intense data validation based on the experience regarding the impact in the total results.
The variables prioritized for the detailed data validation are:
- R&D labour costs
- Other R&D current costs
- R&D expenditures in total
- R&D personel in full time equivalents
- Average labour costs per person (full time equivalents). This variable is calculated using R&D labour costs and R&D personel in full time equivalents.
The variables are numerical and errors in the reported data may have considerable effects in the totals. Hence, they are given special attention in the data validation process.
The remaining variables are also validated and in those cases good estimates for the corrections are pro-rata calculations in case of relative distributions or reportings from the preceding year.
In the micro validation the individual reporting is validated in several levels:
- at the level of variables
- at the level of crossing variables
- at the level of time (years)
- comparison with other data sources
Each variable is validated against its definitions. This is followed by a cross checking with combination of relevant variables. If available, information from historic reporting is introduced in the validation of consistency.
Basically there is a distinction in two types of errors. Logical errors (where there is a contradiction in the data for comparable questions) and potential errors (where there is a probable but not necessarily real error).
Macro validation is based on a comparison of each reporting with the aggregates. The validation is performed using the software Banff which is developed by Statistics Canada.
Various forms of imputation is used:
- Imputation of missing values
- Pro rata imputation
- Imputation for missing reporting
Imputation for missing values is applied for more questions and primarily those representing relative distributions of R&D activities by Fields of Science, purpose and Strategic area. In general, values are imputed by use of information from the preceding year. In cases where reliable data on labour costs is not available an estimate is imputed based on the number of R&D personel.
Pro-rata imputation is applied primarily for the mentioned questions representing relative distributions, in cases where the sums are not equal to 100. Furthermore, pro-rata imputation is used if R&D expenditures covered by external sources, exceeds the total expenditures.
Imputation for missing reporting, which are cases where the full questionnaire is filled in automaticaly due to a non-response even after several re-contacts, is very seldomly occuring due to the very high response rate.
Statistics Denmark have established at University and University hospitals agreements with local contacts who are able to provide further information at the very detailed level if necessary. The contacts also collect the data for their institution which ensure an efficient contact to the relevant statistical units which may otherwise cause problems. The procedure enables a streamlining of the data collection process for institutions with multiple statistical units and lay ground for a high degree of continuity. The population is constructed based on the information from the preceding year. New units are basically identified by information on structural changes reported by the contacts in the universities and the university hospitals. For the remaining part of the population similar kind of information is used and also a surveying of potential new relevant units is performed.
Data compilation
The main aim of the data collection is to produce statistics on Research and development activities in the Danish public sector. Data has to be:
- True and covering the R&D activities in the public sector and private non-profit (PNP) sector as a whole.
- Suitable for statistics at more detailed level as well.
- Burden the respondents as less as possible
The statistics is based on a census of units in the public sector and PNP's performing R&D activities. The reporting of data is mandatory according the the Act on Statistics Denmark paragraphs 6 and 8.
The aim for Statistics Denmark is that reporting of data is digital. This is provided via the web portal http://www.Virk.dk. Beside this the reporting of most data for universities and university hospitals is also in electronic format, notably spread sheets. Finally the questionnaire is available in a paper version on request..
The aim of the further data treatment is to treat the data collected and validate for errors and missing data to bring it to a quality adequate to give a true picture of the R&D activities, also over time.
Due to the large questionnaire (600 variables) the process of data management is comprehensive. As the number of respondents exceeds 700 units it then gives a first set of results of more than 420.000 cells and hence a considerable number of potential errors or misunderstandings in the reported information.
This implies that:
- Data validation as well as data correction as much as possible is carried out automatically. Since many of the questions are interrelated the correction need to be performed in a planned and systematic manner.
- It is necessary to take in to consideration that some questions have greater impact on the overall picture than others, and then perform a prioritized validation. Tools for the validation is based on the statistical software SAS.
Many of the questions are dealing with the same issues. It gives advantages in the sense that it provides the basis for internal verification in the questionnaire itself. At the same time, it is a source for identifying errors across questions with contradictory information.
Adjustment
No further adjustments are done.