Statistical processing
Contact info
Science, Technology and Culture, Business StatisticsMille Wilhjelm Poulsen
+45 40 18 78 40
Get as PDF
The statistics are based on questionnaires and are collected annually from approximately 650 public and private non-profit units, which together are assumed to carry out all significant R&D in the public sector. The reported data undergo extensive validation, focusing on a number of high-priority variables, particularly R&D expenditure and R&D full-time equivalents. Validation is performed at both the macro and micro level.
Source data
The statistics are questionnaire-based and are collected from approximately 650 public and private non-profit units, which together are assumed to carry out all significant R&D in the public sector. Data are collected using two separate questionnaires: one for university hospitals and one for other reporting units (primarily universities).
Frequency of data collection
Annual.
Data collection
Web-questionnaire. A paper version is available on request.
Data validation
A number of variables has been chosen for a intense data validation based on the experience regarding the impact in the total results.
The variables prioritized for the detailed data validation are:
- R&D labour costs
- Other R&D current costs
- R&D expenditures in total
- R&D personnel in full time equivalents
- Average labour costs per person (full time equivalents). This variable is calculated using R&D labour costs and R&D personnel in full time equivalents.
- External funding
- Distribution by field and type of research
-The variables are numerical and errors in the reported data may have considerable effects in the totals. Hence, they are given special attention in the data validation process.
The remaining variables are also validated and in those cases good estimates for the corrections are pro-rata calculations in case of relative distributions or reportings from the preceding year.
In the micro validation the individual reporting is validated in several levels:
- at the level of variables
- at the level of crossing variables
- at the level of time (years)
- comparison with other data sources. This can, for example, include the institutions’ annual reports or grant awards, which can be compared with external funding.
Each variable is validated against its definitions. This is followed by a cross checking with combination of relevant variables. If available, information from historic reporting is introduced in the validation of consistency.
Two basic types of errors are distinguished:
- Certain errors - These can include, for example, formatting errors, technical errors, or logical errors. Certain errors are usually identified quickly during the initial stages of the validation process. An example from the statistics could be a “thousands” error, where the reporter has entered costs without taking into account that the question asks for figures in thousands.
- Probable errors - These are, for example, observations that fall outside predefined acceptance levels, observations that appear unlikely when compared with historical data, or observations whose impact on the total is improbably high. Probable errors are usually identified during selective validation. An example from the statistics could be a reporter receiving a large external grant that they have not received in previous years. This is not necessarily an error, but the institution will most likely be contacted to ensure that the reported data are correct.
Macro validation is based on a comparison of each reporting with the aggregates. The validation is performed using the software Banff which is developed by Statistics Canada.
Various forms of imputation is used:
- Imputation of missing values
- Pro rata imputation
- Imputation for missing reporting
Imputation for missing values is applied for more questions and primarily those representing relative distributions of R&D activities by Fields of Science, purpose and Strategic area. In general, values are imputed by use of information from the preceding year. In cases where reliable data on labour costs is not available an estimate is imputed based on the number of R&D personnel. In cases where reliable salary data cannot be obtained, salaries are imputed based on the number of full-time equivalents.
Example:
- No breakdown of R&D activity by purpose is provided, but this can be retrieved from the same unit for the year preceding the reference year.
- Total R&D personnel and full-time equivalents are provided, but salary information is missing.
Pro-rata imputation is applied primarily for the mentioned questions representing relative distributions, in cases where the sums are not equal to 100. Furthermore, pro-rata imputation is used if R&D expenditures covered by external sources, exceeds the total expenditures.
Example:
- A breakdown of R&D activity by purpose is provided, but the total does not sum to 100.
Imputation for missing reporting, which are cases where the full questionnaire is filled in automaticaly due to a non-response even after several re-contacts, is very seldomly occuring due to the very high response rate.
Statistics Denmark has agreements with local contacts at most universities and larger university hospitals, who provide supplementary information as needed for each institution. In many cases, these contacts also handle the collection of data from the units within their institution, ensuring efficient communication with individual statistical units that might otherwise be difficult to reach. This procedure also helps streamline the reporting process for institutions with multiple statistical units and preserves institutional experience with reporting to the statistics to the greatest extent possible. The population is constructed based on the population from the previous reference year. New units are primarily identified, and units that have ceased operations or are no longer active in research are excluded, by asking the main contacts about developments within the relevant institutions. The quality of the population outside of these main contacts is maintained through ongoing contact with reporters and continuous monitoring of potential new units. It should be noted, however, that the coverage of the population is generally considered most reliable for universities and university hospitals, which are covered by the main contacts.
Data compilation
The main aim of the data collection is to produce statistics on Research and development activities in the Danish public sector. Data has to be:
- True and covering the R&D activities in the public sector and private non-profit (PNP) sector as a whole.
- Suitable for statistics at more detailed level as well.
- Burden the respondents as less as possible
The statistics is based on a census, of the expected research-active institutions. The reporting of data is mandatory according the the Act on Statistics Denmark paragraphs 6 and 8.
The aim for Statistics Denmark is that reporting of data is digital. This is provided via the web portal http://www.Virk.dk. Beside this the reporting of most data for universities and university hospitals is also in electronic format, notably spread sheets. Finally the questionnaire is available in a paper version on request..
The aim of the further data treatment is to treat the data collected and validate for errors and missing data to bring it to a quality adequate to give a true picture of the R&D activities, also over time.
Due to the large questionnaire (600 variables) the process of data management is comprehensive. As the number of respondents is approximately 650 units it then gives a first set of results of more than 400.000 cells and hence a considerable number of potential errors or misunderstandings in the reported information.
This implies that:
- Data validation as well as data correction as much as possible is carried out automatically. Since many of the questions are interrelated the correction need to be performed in a planned and systematic manner.
- It is necessary to take in to consideration that some questions have greater impact on the overall picture than others, and then perform a prioritized validation. Tools for the validation is based on the statistical software SAS.
Many of the questions are dealing with the same issues. It gives advantages in the sense that it provides the basis for internal verification in the questionnaire itself. At the same time, it is a source for identifying errors across questions with contradictory information.
Adjustment
No further adjustments are done.