How to get all possible combinations of k elements from a group of n elements programmatically

by François Perruchas - INGENIO [CSIC-UPV]

Some time ago, I was asked to create an Access macro to recover all item pairs which share a common element. For example, imagine you want to link cited articles through citing articles, and you want to know which articles are the most cited together (a pair).

We set one table with 2 columns, the first one is the citing article ID, the second one is the cited article ID, and then the macro calculates all the possible pairs, with one loop inside another. With an example it's easier to understand:

Input Table

ID Cited Article (IDCited) ID Citing Article (IDCiting)
1 A
1 B
1 C
1 D
2 E
2 A
2 B
2 D

The first loop is used to get the first element of the pair, the second loop is to get the second element. As we don't want repetitions (AB is the same than BA), we have to loop from the next element until the last (with elements ordered).

In our example, the result will be:

For IDCited = 1, we will have AB, AC, AD, BC, BD, CD.
For IDCited = 2, we will have AB, AD, AE, BD, BE, DE.

But then somebody came and told me: “I'm using your macro to have pairs of keywords to search for similar articles, but the search is too 'wide'. Is it possible to do the same but instead of pairs, to have group of 3, 4 or more elements?” Indeed, it's a combination, as we studied in high school, written C(n,k) – and the pair we calculated before is just C(n,2). It's easy to calculate the number of possible combinations (see wikipedia), but more difficult to produce the list.

The first solution that came to me was to do the same than with pairs, for 3 elements it will be 3 loops, 4 elements 4 loops, and so on... The problem here is that the number of elements is a parameter, so if we want to follow this logic, we need to write specific instructions for each value of k.

The second solution, as we need a loop in any case, is to use a “while” with a boolean variable to control the loop end. Then, after some thought, I realized that, following the same logic as with the pairs, getting all the possibles combinations is like counting, without repeating numbers and combinations.

Let's have another example, imagine your group is {1,2,3,4,5,6}, then all the possible combinations of C(6,2) will be:

12, 13, 14, 15, 16, 23, 24, 25, 26, 34, 35, 36, 45, 46, 56

As we can see from this example, the right numeral is always strictly superior to the value of the left numeral, and always less or equal to the number of elements (n). The left numeral is also always equal or inferior to the number of elements minus 1. When the numeral value is superior to the number of elements, we add 1 to the numeral to its left and reinitialize it.

All of this means that for every loop, we just need to add 1 to the far right numeral, and then check if each numeral follows these rules. We will have all the possible combinations when the far left numeral value is over the number of elements minus one.

The only problem here is that the cited articles are not integers but strings. The easiest way to sort this out is to use arrays: the cited articles will be the values and the numbers will be the keys, and then we will have another array that holds every “numeral value”, one per element.

Below is the part of the algorithm inside the loop that goes through every ID Citing

Dim Arr_Group() As String 'the array that receives the actual combination elements
Dim Arr_Elements() As String ' the array that holds all the elements ordered
Dim Arr_IdGroup() As Integer 'the array that holds the numeral values
Dim StopLoop As Boolean
Dim GroupSize As Integer 'GroupSize is set at the beginning

'First we initialize Arr_IdGroup that holds the numeral values

ReDim Arr_IdGroup(GroupSize - 1)
Arr_IdGroup(0) = 1
For i = 1 To GroupSize - 1
  Arr_IdGroup(i) = Arr_IdGroup(i - 1) + 1
Next i

StopLoop = False
Do Until StopLoop
  'We erase the array that holds the values of the group
  Erase Arr_Group()
  ReDim Arr_Group(GroupSize - 1)

  'We get the actual combination
  For i = 0 To GroupSize - 1
    Arr_Group(i) = Arr_Elements(Arr_IdGroup(i) - 1)
  Next i

  'Here we apply the rules defined before - from right to left
  For i = 0 To GroupSize - 1
    'We add 1 to the far right numeral
    If i = 0 Then Arr_IdGroup(GroupSize - i - 1) = Arr_IdGroup(GroupSize - i - 1) + 1
    'Then we check the 3 rules for the actual numeral
    'If it's above the number of elements, then we add + 1 to the left numeral and we reinitialize right numeral
    If Arr_IdGroup(GroupSize - i - 1) > (UBound(Arr_Elements) - i) Then
      'We check if it's not the far left numeral
      If GroupSize - i - 2 > -1 Then
        Arr_IdGroup(GroupSize - i - 2) = Arr_IdGroup(GroupSize - i - 2) + 1
        Arr_IdGroup(GroupSize - i - 1) = Arr_IdGroup(GroupSize - i - 2) + 1
      Else
        'Because if so it means that we already get all the possible combinations, so we stop the loop
        StopLoop = True
      End If
      'Here we reinitialize the actual numeral if it's superior to the limit
      If Arr_IdGroup(GroupSize - i - 1) > (UBound(Arr_Elements) - i) Then
        Arr_IdGroup(GroupSize - i - 1) = UBound(Arr_Elements) - i
      End If
    End If
  Next i

'Here you just have to save the array Arr_Group
Loop

Then of course you need to implement the other part of the code.

Temas: