Values researchers commonly derive country-level measures of shared values constructs by aggregating scores on individual-level belief variables. However, different composition models may underlie this approach, and it remains to be seen whether consensus in beliefs between individuals from the same country is large enough to warrant aggregation. To examine these issues, I first outline a typology of composition models for specifying the functional relationship among constructs at different levels of analysis as developed by Chan (1998). Then, I show that statistical interrater agreement measures for a small selection of constructs in the 2017 wave of the European Values Study provide weak evidence for warranting aggregation of individual-level variables for deriving country-level constructs. I compare these results to the results from a response weighted mean aggregation method described by Van Bruggen et al. (2002) that incorporates the degree of agreement among respondents’ responses into the calculation of the country-specific mean scores on the constructs. For three out of four constructs considered, the ranking of some countries on their position on the construct concerning changes considerably. However, the correlation between the unweighted and weighted country ranking is still high, which suggests that the position of countries can be estimated with sufficient reliability.
Loek Halman and I go way back when it comes to studying values in a cross-national context. We often discussed the workings, possibilities, and limitations of this multilevel analysis for cross-national values research, and – at times even more interesting – Loek’s enormous endeavours for making each new wave of the European Values Study (EVS) a success. From our discussions, one of the empirical issues in values research that intrigued Loek is the role of “heterogeneity” in beliefs within the context of cross-national research. Together with Veerle Draulans, he studied heterogeneity in beliefs as a country-level characteristic to be explained in a country-level analysis (Halman & Draulans, 2004). Their approach for operationally defining a central country-level construct in that contribution – as the within-country dispersion of scores – relates to the topic of current chapter: composing a group-level variable from individual-level data. However, rather than focusing on the dispersion of beliefs among inhabitants of a country, I focus on the probably more familiar and popular approach of composing contextual-level constructs from individual-level survey data by using the degree of within-group agreement of scores as a criterion for aggregation.
Aggregation of individual-level scores of people’s beliefs to regional and country-level plays an important descriptive role in values research, as it allows us to map the cultural diversity between populations of such higher-order units. In explanatory research, developing such group-level constructs may be of interest to values researchers because one may want to test whether a predictor has a differential effect at both levels of the hierarchy (Enders & Tofighi, 2007). Researchers commonly use the individual-level scores and the aggregated group scores (for country, region, or lower levels such as neighbourhoods) as predictors in the model. Including the predictor at multiple levels allows testing whether a predictor has a differential effect at the individual level (i.e., a composition effect) and the group level (i.e., a contextual effect). For example, Van Oorschot, Arts, and Halman (2005) included a measure of ethnocentrism both at individual and country-level as a predictor for informal solidarity. Oberwittler (2004) used a measure of violence tolerance as an individual and neighbourhood-level predictor of juvenile severe offending.
Even though the group-level predictor is an aggregate of the individual scores within each cluster, it is not necessarily true that both share the same meaning or measure the same construct (Firebaugh, 1978). Chan (1998) has argued that researchers who apply multilevel analysis should use explicit composition models ‘for the development and validation of constructs that specify the functional relationships among phenomena or constructs at different levels of analysis that reference essentially the same content but that are qualitatively different at different levels’ (Chan, 1998, p. 234). The idea of using composition models is that they force the researcher to precisely describe how constructs at different levels of analysis are functionally related to each other, with the benefits of conceptual precision and parsimony.
By its design, the EVS has brought forward over many decades an enormous number of descriptive and explanatory empirical studies in which aggregation of individual-level belief scores to the regional or country-level is a crucial operational procedure. These studies commonly specify a straightforward functional relationship between variables at the different levels of analysis, namely, as a simple unweighted sum or mean of the scores on the individual-level variable to represent the value on the regional or country-level variable. However, different composition models, each with their methodological requirements, may underlie this approach.
To clarify this, I will first outline a seminal typology of composition models for specifying the functional relationship among constructs at different levels of analysis as developed by Chan (1998). As we will see, conditional on the composition model used, one must examine levels of agreement within countries, and for a selection of constructs from the EVS 2017, I map the degree of within-country agreement. Then, I determine the country-specific means of these constructs using a method of estimation that explicitly incorporates the degree of agreement among citizens in these aggregated measures, and I compare these to the ‘standard’ unweighted country-mean approach. By comparing countries’ position on the agreement-unrelated measure to their position on the agreement-related measure, I determine whether accounting for the degree of agreement within countries changes the relative position of countries on the values constructs under consideration. Finally, I reflect on the implications of the findings and usage of different composition models for empirical values research.
Chan (1998) developed a typology of composition models that is concerned with multilevel research situations in which scores on a construct of some lower-level units (e.g., individuals) are used to generate a group-level construct that is of an aggregate or collective nature. Chan’s (1998) typology distinguishes five ideal types of composition models: additive, direct consensus, referent-shift consensus, dispersion, and process composition. The additive and direct consensus model are most relevant for the current practice in which regional or country-level values constructs are developed for descriptive and explanatory values research.
When values researchers use an additive composition procedure, the meaning of a country-level construct is a summation of the individuals within a country regardless of the variance among these individuals. Importantly, in additive composition models, the variance of lower-level units is of no theoretical or operational concern for composing the individual level construct to the regional or country-level construct. In such an additive conceptualization, a values researcher assumes that all countries in the sample are characterized by a cultural climate that can be evaluated as low or high on various dimensions irrespective of the within-country individual agreement. Glick (1985, p. 605) discussed this issue for multilevel organizational research, and he pointed out that with an additive composition model approach, any within-group agreement is interpreted in terms of measurement accuracy reflecting individual-level random error and sources of bias. Consequently, values researchers average individual beliefs of citizens from different countries irrespective of the within-country variance to operationalize the country-level variable. The degree of association with other group-level constructs, using the correlation coefficient as a validity index, is used as empirical support for aggregating individual scores to the country level (see, for example, Kalmijn & Uunk, 2007; Hagenaars, Halman, & Moors, 2003). However, we have found no study in the field of values research in which within-group agreement is used for additional evidence for the appropriateness of aggregation of individual-level data to regional or country level.
A somewhat similar model has been described by Bliese (2000), which he denotes as the pure compilation process model. In such models, individual-level responses are expected to vary within groups; however, when the data are aggregated, the aggregate variable is expected to measure some phenomenon not evident at the lower level. According to Bliese (2000), there is no theoretical need to establish agreement about the aggregated variable in such models. Consequently, when a researcher uses this model, an inspection of within-group agreement is also not used as empirical support for aggregating individual scores to the country level.
Chan’s (1998) direct consensus model also uses aggregation of individual scores to the group as a typical operational procedure for establishing the group-level construct, but it is also different from the additive model in significant ways. First, whereas in the additive model variance among lower-level units is of no theoretical or operational concern, in the direct consensus model within-group agreement is considered a necessary condition for the construct validity of the group-level construct. Using this composition model demands that the researcher evaluates the values of within-group agreement indices developed in the methodological literature to establish empirical support for justifying aggregation of individual opinions and beliefs to represent scores on a country-level values construct.
Second, the fact that within-group agreement is explicitly required for aggregating individual scores to the group level has as a consequence that the meaning of the higher-level construct resides in the consensus among the lower-level units. Only when some cut-off level of agreement is reached on the agreement indices can the aggregation of individual responses be justified. Only then does the group-level construct represent some degree of ‘sharedness’ or ‘consensus’ among the lower-level units, which can be interpreted as a shared cultural climate characteristic. How countries included in the European Values Study perform in this respect for certain constructs is presented in the remainder of this contribution.
To assess the degree of agreement among respondents’ responses and the impact of ignoring this information on the estimation of the country means for values-orientations, I use data from the 2017 wave of the European Values Study, including 34 countries (see Table 8.1).
Measures
I use the following small capita selecta of constructs that are available in the 2017 EVS wave (see Hagenaars, Halman, & Moors, 2003): Personal sexual permissiveness appears from the acceptability of ‘homosexuality’; ‘abortion’; ‘divorce’; ‘euthanasia (terminating the life of the incurably sick)’; ‘suicide’; ‘having casual sex’ (1=never justified; 10= always justified).
Self-interest permissiveness refers to the acceptance of ‘claiming state benefits which you are not entitled to’; ‘cheating on tax if you have the chance’; ‘lying in your own interest’; ‘paying cash for services to avoid taxes’ (1=never justified; 10= always justified).
Confidence in legitimacy institutions is indicated by the degree to which respondents have confidence in the education system, the press, trade unions, the police, parliament, civil service, the social security system, health care system, and the justice system, with response categories a great deal (=1), quite a lot (=2) not very much (=3) none at all (=4) confidence in these. The scoring of the items was reversed for ease of interpretation.
For measuring a tolerance towards immigrants construct, respondents were asked to what extent they would place their view on the following items: Immigrants take jobs away from [NATIONALITY] (1) - Immigrants do not take jobs away from [NATIONALITY] (10); Immigrants make crime problems worse (1) - Immigrants do not make crime problems worse (10); Immigrants are a strain on a country’s welfare system (1) - Immigrants are not a strain on a country’s welfare system (10); It is better if immigrants maintain their distinct customs and traditions (1) - It is better if immigrants do not maintain their distinct customs and traditions (10). The scoring of the last item was reversed.
In the composition model of direct consensus, the conceptual definition of the higher-level construct resides in the consensus or sharedness of perceptions among lower-level units, and therefore it is necessary to assess within-group agreement to index consensus and justify aggregation (Chan, 1998: 236). Researchers have a multitude of different interrater agreement indices at their disposal, each with its benefits and drawbacks (O’Neill, 2017). Here, I report two indices: rwg(j) (James, Demaree, & Wolf, 1984) and awg(j) (Brown & Hauenstein, 2005). rwg(j) is well-known and commonly used among researchers. A value of 1 indicates complete agreement, whereas a value of 0 indicates completely random responding. One drawback of this index is that it may not be directly comparable across different means of group ratings or the number of raters. Therefore, I also report awg(j) index values, which “estimates agreement as the proportion of observed agreement of the maximum disagreement possible given the observed mean rating” (Brown & Hauenstein, 2005, p. 174). A value of +1 indicates perfect agreement, given the group mean, a value of 0 indicates that the observed variance is 50 percent of the maximum variance, given the group mean, and a value of -1 indicates maximum disagreement, given the group mean. Compared to rwg(j), awg(j) overcomes problems of sample size dependency, scale dependency, and dependence on a specific underlying null distribution (Brown & Hauenstein, 2005).
To compare the impact of the degree of agreement within populations on the estimate of the country mean scores, I apply two aggregation methods. For both the additive and the direct-consensus model (Chan, 1998), a standard approach for determining a country’s position on a particular value orientation is by first averaging scores to thematically related questions in a single index score for each respondent in a sample and then averaging the individual-level index scores to yield country-level means (Welzel, Brunkert, Kruse, & Inglehart, 2021). Following Van Bruggen, Lilien, and Kacker (2002), the unweighted arithmetic mean of the individual responses of group members for a variable X of group i equals
with Xij being the response for the value of variable X by respondent j in group i, and ni being the number of respondents in group i. Wagner, Rau, and Lindemann (2010) point out that this method of aggregation presumes that all respondents are equally knowledgeable. Consequently, the method attaches the same weight to the responses of all respondents.
However, it may be that the true value is closer to the responses provided by agreeing respondents than to that of deviating respondents and that the response of the deviating respondents contains a larger systematic error component (Van Bruggen et al., 2002). To account for the fact that the degree of agreement among respondents’ responses contains information that is important to include in an aggregate measure, Van Bruggen et al. (2002) developed an alternative aggregation method – the response data-based weighted mean –, which I also apply below. To compute the appropriate weights, this method first calculates the absolute distance of respondent’s j’s response on variable X from the unweighted mean of all responses of group I to which respondent j belongs.
Then, a weight for each respondent’s response is calculated:
The weight of a response is inversely related to the absolute deviation of a response from the unweighted mean relative to the sum of the absolute deviations of all respondents responses. The formula also contains a parameter α that corrects the effect of systematic error in the respondents’ responses. As α approaches 1, less weight is attached to responses that are farther away from the unweighted mean (i.e., the responses from respondents whose scores are expected to contain substantial systematic error). If α equals 0, the weight of all responses is identical, which results in the unweighted mean, and it is assumed that respondents’ responses do not contain systematic error. Advanced procedures exist for estimating an optimal α value (see Van Bruggen et al., 2002: 473-474 for details), but here I use a standard value α = 1 to compare the maximum difference between the unweighted (with α = 0) and weighted country means (with α = 1). Finally, the response data-based weighted mean is estimated as follows:
Table 8.1 Unweighted Country Means and Response Data-Based Weighted Country Means, Interrater Agreement Statistics, and Rank-order (Correlations) for EVS 2017 Countries
| Personal-Sexual Permissiveness | Self-Interest Permissiveness
| Confidence in Legitimacy Institutions | Tolerance towards immigrants | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | Unweighted | Rank | Weighted | Rank | Unweighted | Rank | Rank | Unweighted | Rank | Weighted | Rank | Unweighted | Rank | Weighted | Rank | |
Albania | 2.41 | 32 | 2.10 | 32 | 1.41 | 34 | 1.08 | 34 | 2.05 | 32 | 2.01 | 32 | 7.34 | 1 | 7.71 | 1 |
Armenia | 2.33 | 33 | 2.02 | 33 | 2.95 | 5 | 2.70 | 4 | 2.25 | 28 | 2.21 | 27 | 5.97 | 3 | 5.98 | 3 |
Austria | 5.75 | 12 | 5.80 | 12 | 1.96 | 15 | 1.73 | 14 | 2.71 | 6 | 2.72 | 6 | 4.36 | 29 | 4.26 | 30 |
Azerbaijan | 2.49 | 31 | 2.31 | 31 | 2.42 | 8 | 2.18 | 8 | 2.78 | 3 | 2.85 | 2 | 5.76 | 5 | 5.79 | 5 |
Belarus | 3.96 | 23 | 3.80 | 23 | 3.15 | 3 | 3.01 | 2 | 2.56 | 13 | 2.61 | 11 | 4.94 | 22 | 5.02 | 22 |
Bosnia and | 3.05 | 28 | 2.66 | 29 | 1.85 | 23 | 1.40 | 30 | 2.16 | 30 | 2.18 | 30 | 5.08 | 21 | 5.10 | 19 |
Bulgaria | 3.68 | 25 | 3.51 | 25 | 1.63 | 32 | 1.26 | 33 | 2.01 | 33 | 2.01 | 32 | 4.17 | 31 | 4.04 | 32 |
Croatia | 3.97 | 22 | 3.81 | 22 | 2.17 | 11 | 1.86 | 11 | 1.98 | 34 | 1.97 | 34 | 5.25 | 14 | 5.29 | 14 |
Czechia | 5.64 | 14 | 5.68 | 14 | 2.24 | 9 | 2.07 | 9 | 2.28 | 25 | 2.28 | 23 | 3.36 | 34 | 3.20 | 34 |
Denmark | 7.18 | 1 | 7.40 | 1 | 1.56 | 33 | 1.33 | 32 | 2.75 | 5 | 2.76 | 5 | 4.74 | 25 | 4.75 | 26 |
Estonia | 4.49 | 17 | 4.36 | 17 | 1.93 | 17 | 1.69 | 16 | 2.62 | 9 | 2.64 | 9 | 4.68 | 27 | 4.65 | 27 |
Finland | 6.27 | 8 | 6.38 | 8 | 1.86 | 22 | 1.63 | 20 | 2.83 | 2 | 2.83 | 3 | 5.25 | 14 | 5.28 | 15 |
France | 6.33 | 6 | 6.44 | 6 | 2.48 | 7 | 2.25 | 7 | 2.57 | 12 | 2.59 | 13 | 5.48 | 10 | 5.45 | 10 |
Georgia | 2.19 | 34 | 1.93 | 34 | 1.85 | 23 | 1.50 | 27 | 2.38 | 21 | 2.37 | 20 | 5.90 | 4 | 5.88 | 4 |
Germany | 6.24 | 9 | 6.36 | 9 | 1.66 | 31 | 1.37 | 31 | 2.56 | 13 | 2.58 | 14 | 5.32 | 13 | 5.33 | 13 |
Great Britain | 6.12 | 10 | 6.19 | 10 | 1.78 | 28 | 1.47 | 28 | 2.56 | 13 | 2.55 | 15 | 5.64 | 6 | 5.59 | 8 |
Hungary | 4.45 | 18 | 4.35 | 18 | 1.96 | 15 | 1.73 | 14 | 2.33 | 22 | 2.33 | 22 | 3.70 | 33 | 3.54 | 33 |
Iceland | 6.81 | 4 | 6.88 | 4 | 1.98 | 14 | 1.75 | 13 | 2.67 | 8 | 2.66 | 8 | 6.31 | 2 | 6.41 | 2 |
Italy | 5.15 | 15 | 5.12 | 15 | 1.81 | 27 | 1.61 | 23 | 2.40 | 20 | 2.37 | 20 | 4.63 | 28 | 4.60 | 28 |
Lithuania | 4.04 | 20 | 3.99 | 20 | 2.68 | 6 | 2.45 | 6 | 2.49 | 17 | 2.51 | 17 | 4.74 | 25 | 4.78 | 25 |
Montenegro | 2.68 | 30 | 2.42 | 30 | 1.87 | 20 | 1.67 | 17 | 2.26 | 27 | 2.19 | 28 | 5.35 | 12 | 5.41 | 11 |
Netherlands | 6.95 | 3 | 7.09 | 3 | 1.93 | 17 | 1.67 | 17 | 2.59 | 10 | 2.62 | 10 | 4.86 | 24 | 4.84 | 24 |
North Macedonia | 3.41 | 27 | 3.07 | 27 | 1.87 | 20 | 1.51 | 26 | 2.21 | 29 | 2.19 | 28 | 5.50 | 9 | 5.46 | 9 |
Norway | 6.67 | 5 | 6.83 | 5 | 1.83 | 26 | 1.61 | 23 | 2.90 | 1 | 2.91 | 1 | 5.18 | 16 | 5.16 | 16 |
Poland | 3.80 | 24 | 3.57 | 24 | 1.75 | 30 | 1.45 | 29 | 2.30 | 23 | 2.27 | 25 | 5.10 | 19 | 5.08 | 20 |
Portugal | 4.00 | 21 | 3.97 | 21 | 1.77 | 29 | 1.63 | 20 | 2.58 | 11 | 2.61 | 11 | 5.44 | 11 | 5.39 | 12 |
Romania | 2.96 | 29 | 2.68 | 28 | 2.11 | 12 | 1.64 | 19 | 2.28 | 25 | 2.25 | 26 | 5.09 | 20 | 5.14 | 17 |
Russia | 4.19 | 19 | 4.07 | 19 | 3.71 | 1 | 3.53 | 1 | 2.47 | 18 | 2.49 | 18 | 4.29 | 30 | 4.27 | 29 |
Serbia | 3.55 | 26 | 3.35 | 26 | 2.20 | 10 | 1.89 | 10 | 2.07 | 31 | 2.05 | 31 | 4.88 | 23 | 4.89 | 23 |
Slovakia | 4.84 | 16 | 4.89 | 16 | 2.99 | 4 | 2.70 | 4 | 2.45 | 19 | 2.46 | 19 | 4.14 | 32 | 4.08 | 31 |
Slovenia | 5.66 | 13 | 5.70 | 13 | 1.91 | 19 | 1.63 | 20 | 2.29 | 24 | 2.28 | 23 | 5.14 | 18 | 5.08 | 20 |
Spain | 6.04 | 11 | 6.06 | 11 | 3.24 | 2 | 2.90 | 3 | 2.53 | 16 | 2.54 | 16 | 5.59 | 8 | 5.64 | 6 |
Sweden | 7.03 | 2 | 7.24 | 2 | 1.85 | 23 | 1.58 | 25 | 2.76 | 4 | 2.81 | 4 | 5.62 | 7 | 5.64 | 6 |
Switzerland | 6.29 | 7 | 6.39 | 7 | 1.99 | 13 | 1.81 | 12 | 2.68 | 7 | 2.72 | 6 | 5.17 | 17 | 5.13 | 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mean | 5.14 |
| 5.11 |
| 2.41 |
| 2.15 |
| 2.49 |
| 2.49 |
| 5.00 |
| 4.98 |
|
Number of categories | 10 |
|
|
| 10 |
|
|
| 4 |
|
|
| 10 |
|
|
|
Number of items | 6 |
|
|
| 4 |
|
|
| 9 |
|
|
| 4 |
|
|
|
Mean rwg(j) | 0.314 |
|
|
| 0.796 |
|
|
| 0.901 |
|
|
| 0.334 |
|
|
|
Range | 0.866 |
|
|
| 0.934 |
|
|
| 0.142 |
|
|
| 0.678 |
|
|
|
Mean awg(j) | 0.030 |
|
|
| 0.126 |
|
|
| 0.429 |
|
|
| 0.216 |
|
|
|
Range | 0.529 |
|
|
| 0.501 |
|
|
| 0.401 |
|
|
| 0.670 |
|
|
|
Kendall’s Tau-b | 0.996 |
|
|
| 0.886 |
|
|
| 0.973 |
|
|
| 0.961 |
|
|
|
Table 8.1 reports the findings from assessing the interrater agreement for the four constructs, and it maps the changes that occur in the ranking of countries when their position is estimated on each of the four constructs using unweighted country means and response data-based weighted country means, respectively. For self-interest permissiveness and confidence in legitimacy institutions mean rwg(j) values across countries indicate a sufficient level of agreement (using the heuristic threshold of .70). However, except for confidence in legitimacy institutions, the findings also indicate a broad range in interrater agreement for the separate countries. The mean awg(j) estimates suggest for all four constructs unacceptable levels of agreement, with the awg(j) values being between the heuristic values of 0 and 0.59 (Brown & Hauenstein, 2005, p. 178). Finally, we also see for the awg(j) values a relatively broad range of estimates across all countries. In summary, an inspection of some critical interrater agreement statistics suggests only weak evidence for aggregating individual responses to represent country-level constructs.
Turning to the results concerning the comparison of unweighted and response data-based weighted country means on the four constructs, we see, first, that applying the weighting procedure has some – and occasionally, considerable – effects on the relative ranking of countries, and in particular for the constructs of self-interest permissiveness, confidence in legitimacy institutions, and tolerance towards immigrants. Kendall’s Tau-b values for these constructs suggest that country rankings on self-interest permissiveness are affected by the weighting procedure (Tau-b = .886). We see, for example, that Bosnia and Herzegovina drops considerably from rank 23 (unweighted) to rank 30 (weighted), and Georgia drops from rank 23 (tied unweighted with Bosnia and Herzegovina) to rank 27. Also, North-Macedonia and Romania show considerable drops in ranking. For confidence in legitimacy institutions and tolerance towards immigrants, we observe changes in rankings of generally maximally two positions, albeit Romania promotes from position 20 (unweighted) to position 17 (weighted) on the tolerance towards immigrants construct. Only for the personal-sexual permissiveness construct, country rankings remain largely the same between both methods of aggregation (Tau-b = .996).
In this contribution, I have shown that statistical interrater agreement measures for a small selection of constructs in the most recent wave of the EVS show weak evidence for warranting aggregation of individual-level for deriving country-level constructs. I have also applied a response weighted mean aggregation method described by Van Bruggen et al. (2002) that incorporates the degree of agreement among respondents’ responses into the calculation of the country-specific mean scores on the constructs. When comparing the results from the weighted approach to the unweighted approach, I found that for three out of four constructs considered, the ranking of some countries on their position on the construct concerning changed considerably. However, despite this, I also found that the correlation between the unweighted rank and the weighted rank was still high, which suggests that countries’ position can be estimated with sufficient reliability.
Of course, my simple descriptive analysis raises more fundamental questions about the explanation and methodological implications of the findings that need to be addressed in future research. For example, the low interrater reliability for the constructs may result from the wording of the survey items that indicate the construct: items with a group (here: country) referent may be better able to capture country-level constructs (Klein, Conn, Smith, & Sorra, 2001). Also, in my analysis, I have not compared the accuracy of both estimation methods. Finally, I have assumed that the constructs’ measures are invariant across countries and not systematically biased, a psychometric property that should be tested (although the latter has recently been disputed, see Welzel et al. (2021)).
A key question is whether a high-within group agreement is necessary for aggregation in multilevel research. LeBreton and Senter (2008) advance that the researcher’s research question and composition model should be guiding. In particular, they point out that agreement measures do not play a role in the aggregation process in the additive composition model or the pure compilation process model. On the other hand, in a composition model such as the direct consensus model, it is assumed that aggregated variables are to be essentially identical – isomorphic – to lower-level constructs in form and function (Bliese, 2000). In the latter case, the evaluation of agreement measures, assessment of the construct validity, and systematic measurement error are crucial for the empirical analysis. The response data-based weighted mean approach may be necessary for estimating more accurate country means. Thus, an essential task for values researchers who intend to measure collective beliefs as contextual characteristics inferred from individual beliefs lies first with considering the type of composition model appropriate for their research. The EVS then provides us with invaluable data for testing hypotheses about the effects of aggregated variables that may show little agreement among citizens of countries. The researcher may have solid theoretical reasons that warrant aggregation anyway.
Bliese, P. D. (2000). Within-Group Agreement, Non-Independence, and Reliability: Implications for Data Aggregation and Analysis. In K. J. Klein and S. W. J. Kozlowski (Eds.), Multilevel theory, Research, and Methods in Organizations: Foundations, Extensions, and New Directions (pp. 349–381). San Francisco: Jossey-Bass.
Brown, R. D., & Hauenstein, N. M. A. (2005). Interrater Agreement Reconsidered: An Alternative to the rwg Indices. Organizational Research Methods, 8(2), 165-184.
Chan, D. (1998). Functional Relations Among Constructs in the Same Content Domain at Different Levels of Analysis: A Typology of Composition Models. Journal of Applied Psychology, 83(2), 234-246.
Enders, C. K., & Tofighi, D. (2007). Centering Predictor Variables in Cross-Sectional Multilevel Models: A New Look at an Old Issue. Psychological methods, 12(2), 121-138.
Firebaugh, G. (1978). A Rule for Inferring Individual-level Relationships from Aggregate Data. American Sociological Review, 557-572.
Glick, W. H. (1985). Conceptualizing and Measuring Organizational and Psychological Climate: Pitfalls in Multilevel Research. Academy of Management Review, 10(3), 601-616.
Hagenaars, J. A. P., Halman, L. C. J. M., & Moors, G. B. D. (2003). Exploring Europe’s Basic Values Map. In: The Cultural Diversity of European Unity. Findings, Explanations and Reflections from the European Values Study (pp. 23-58). Leiden: Brill.
Halman, L. C. J. M., & Draulans, V. J. R. (2004). Religious Beliefs and Practices in Contemporary Europe.
In W. A. Arts, & L. C. J. M. Halman (Eds.), European Values at the Turn of the Millennium (pp. 283-316). Leiden: Brill.
James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating Within-group Interrater Reliability with and without Response Bias. Journal of Applied Psychology, 69(1), 85-98.
Kalmijn, M., & Uunk, W. (2007). Regional Value Differences in Europe and the Social Consequences of Divorce: A Test of the Stigmatization Hypothesis. Social Science Research, 36(2), 447-468.
Klein, K. J., Conn, A. B., Smith, D. B., & Sorra, J. S. (2001). Is Everyone in Agreement? An Exploration of Within-group Agreement in Employee Perceptions of the Work Environment. Journal of Applied Psychology, 86(1), 3-16.
LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 Questions about Interrater Reliability and Interrater Agreement. Organizational Research Methods, 11(4), 815-852.
Moor, R. A., de (Ed.). (1995). Values in Western Societies (Vol. 2). Leiden: Brill.
Van Oorschot, W., Arts, W., & Halman, L. (2005). Welfare State Effects on Social Capital and Informal Solidarity in the European Union: Evidence from the 1999/2000 European Values Study. Policy & Politics, 33(1), 33-54.
Oberwittler D. (2004) A Multilevel Analysis of Neighbourhood Contextual Effects on Serious Juvenile Offending: The Role of Subcultural Values and Social Disorganization. European Journal of Criminology. 1(2):201-235.
O’Neill, T. A. (2017). An Overview of Interrater Agreement on Likert Scales for Researchers and Practitioners. Frontiers in Psychology, 8(777).
Van Bruggen, G. H., Lilien, G. L., & Kacker, M. (2002). Informants in Organizational Marketing Research: Why Use Multiple Informants and how to Aggregate Responses. Journal of Marketing Research, 39(4), 469478.
Wagner, S. M., Rau, C., & Lindemann, E. (2010). Multiple Informant Methodology: A Critical Review and Recommendations. Sociological Methods & Research, 38(4), 582-618.
Welzel, C., Brunkert, L., Kruse, S., & Inglehart, R. F. (2021). Non-invariance? An Overstated Problem With Misconceived Causes. Sociological Methods & Research, 0(0), 0049124121995521.