I am creating an inverted index to get a dictionary of words with an associated list of the line numbers that the word appears on (starting the line numbers and a list of words that appear in a given cell within that line).
I have managed to get some code working for this, but I found dealing with adding to the arrays (the values in the dictionary) to be a little cumbersome and I wonder is there is a more efficient or more elegant way to handle this.
I am open to using arrays, collections or any other data type that can be easily searched to store the list of line numbers in the values of the dictionary. I have pasted a cut down version of my code to demonstrate the core problem below, the question is really just about the BuildInvertedIndex procedure, but the rest is included to try to make it easier to recreate the scenario:
Sub Test()
' minimum included here to demonstrate use of buildInvertedIndex procedure
Dim vRange As Range
Dim vDict As Dictionary
Set vRange = ActiveSheet.Range("F2:F20585")
Set vDict = New Dictionary
BuildInvertedIndex vDict, vRange
' test values returned in dictionary (word: [line 1, ..., line n])
Dim k As Variant, vCounter As Long
vCounter = 0
For Each k In vDict.Keys
Debug.Print k & ": " & ArrayToString(vDict.Item(k))
vCounter = vCounter + 1
If vCounter >= 10 Then
Exit For
End If
Next
End Sub
Sub BuildInvertedIndex(pDict As Dictionary, pRange As Range)
Dim cell As Range
Dim words As Variant, word As Variant, val As Variant
Dim tmpArr() As Long
Dim newLen As Long, i As Long
' loop through cells (one col wide so same as looping through lines)
For Each cell In pRange.Cells
' loop through words in line
words = Split(cell.Value)
For Each word In words
If Not pDict.exists(word) Then
' start line array with first row number
pDict.Add word, Array(cell.Row())
Else
i = 0
If Not InArray(cell.Row(), pDict.Item(word)) Then
newLen = UBound(pDict.Item(word)) + 1
ReDim tmpArr(newLen)
For Each val In tmpArr
If i < newLen Then
tmpArr(i) = pDict.Item(word)(i)
Else
tmpArr(i) = cell.Row()
End If
i = i + 1
Next val
pDict.Item(word) = tmpArr
End If
End If
Next word
Next cell
End Sub
Function ArrayToString(vArray As Variant, _
Optional vDelim As String = ",") As String
' only included to support test (be able to see what is in the arrays)
Dim vDelimString As String
Dim i As Long
For i = LBound(vArray) To UBound(vArray)
vDelimString = vDelimString & CStr(vArray(i)) & _
IIf(vCounter < UBound(vArray), vDelim, "")
Next
ArrayToString = vDelimString
End Function
To run this you will need values in column F of the active sheet (sentences), if you do not already have it you will also need to add a reference to the Microsoft Scripting Runtime in your VBA environment for the dictionary data type to be available (tools -> references -> Microsoft Scripting Runtime).
As you will see from the code this gets a bit messy where I have to insert a new line number into an existing array (that is stored as a value within the dictionary). As I do not know of a way to just extend this array (without clearing the existing values), I have used the variable tmpArr to create an array of the appropriate size and then copy the values one by one from the existing array in the dictionary and then add the current row number to the end. The temporary array is then used to replace the existing value for that key (the current word).
Any advice on this would be greatly appreciated.
