1

I am creating an inverted index to get a dictionary of words with an associated list of the line numbers that the word appears on (starting the line numbers and a list of words that appear in a given cell within that line).

I have managed to get some code working for this, but I found dealing with adding to the arrays (the values in the dictionary) to be a little cumbersome and I wonder is there is a more efficient or more elegant way to handle this.

I am open to using arrays, collections or any other data type that can be easily searched to store the list of line numbers in the values of the dictionary. I have pasted a cut down version of my code to demonstrate the core problem below, the question is really just about the BuildInvertedIndex procedure, but the rest is included to try to make it easier to recreate the scenario:

Sub Test()
' minimum included here to demonstrate use of buildInvertedIndex procedure

    Dim vRange As Range
    Dim vDict As Dictionary

    Set vRange = ActiveSheet.Range("F2:F20585")
    Set vDict = New Dictionary

    BuildInvertedIndex vDict, vRange

    ' test values returned in dictionary (word: [line 1, ..., line n])
    Dim k As Variant, vCounter As Long
    vCounter = 0
    For Each k In vDict.Keys
        Debug.Print k & ": " & ArrayToString(vDict.Item(k))
        vCounter = vCounter + 1
        If vCounter >= 10 Then
            Exit For
        End If
    Next


End Sub


Sub BuildInvertedIndex(pDict As Dictionary, pRange As Range)

    Dim cell As Range
    Dim words As Variant, word As Variant, val As Variant
    Dim tmpArr() As Long
    Dim newLen As Long, i As Long

    ' loop through cells (one col wide so same as looping through lines)
    For Each cell In pRange.Cells

        ' loop through words in line
        words = Split(cell.Value)
        For Each word In words

            If Not pDict.exists(word) Then
                ' start line array with first row number
                pDict.Add word, Array(cell.Row())
            Else
                i = 0
                If Not InArray(cell.Row(), pDict.Item(word)) Then
                    newLen = UBound(pDict.Item(word)) + 1
                    ReDim tmpArr(newLen)
                    For Each val In tmpArr
                        If i < newLen Then
                            tmpArr(i) = pDict.Item(word)(i)
                        Else
                            tmpArr(i) = cell.Row()
                        End If
                        i = i + 1
                    Next val
                    pDict.Item(word) = tmpArr
                End If
            End If
        Next word
    Next cell

End Sub


Function ArrayToString(vArray As Variant, _
                       Optional vDelim As String = ",") As String
' only included to support test (be able to see what is in the arrays)

    Dim vDelimString As String
    Dim i As Long

    For i = LBound(vArray) To UBound(vArray)
        vDelimString = vDelimString & CStr(vArray(i)) & _
                       IIf(vCounter < UBound(vArray), vDelim, "")
    Next

    ArrayToString = vDelimString
End Function

To run this you will need values in column F of the active sheet (sentences), if you do not already have it you will also need to add a reference to the Microsoft Scripting Runtime in your VBA environment for the dictionary data type to be available (tools -> references -> Microsoft Scripting Runtime).

As you will see from the code this gets a bit messy where I have to insert a new line number into an existing array (that is stored as a value within the dictionary). As I do not know of a way to just extend this array (without clearing the existing values), I have used the variable tmpArr to create an array of the appropriate size and then copy the values one by one from the existing array in the dictionary and then add the current row number to the end. The temporary array is then used to replace the existing value for that key (the current word).

Any advice on this would be greatly appreciated.

1
  • You cannot work directly with an array stored in a dictionary - the usual approach is to pull it out of the dictionary, modify it, and then re-store it in the same slot. EG: stackoverflow.com/questions/16447088/… Commented Mar 3, 2014 at 1:50

1 Answer 1

1

I am open to using arrays, collections or any other data type

As I see, using collection instead array would be much simplier:

Sub BuildInvertedIndex(pDict As Dictionary, pRange As Range)
    Dim cell As Range
    Dim words, word
    Dim i As Long    
    ' loop through cells (one col wide so same as looping through lines)
    For Each cell In pRange.Cells    
        ' loop through words in line
        words = Split(cell.Value)
        For Each word In words    
            If Not pDict.Exists(word) Then
                ' initialize collection
                pDict.Add word, New Collection
            End If
            'try to add to collection. If row is already in collecton, nothing happend. Storing key makes you sure there're only unique rows
            On Error Resume Next
            pDict.Item(word).Add Item:=cell.Row, Key:=CStr(cell.Row)
            On Error GoTo 0                
        Next word
    Next cell
End Sub

Next step, is to slightly modify ArrayToString to ColToString:

Function ColToString(vCol As Collection, _
                   Optional vDelim As String = ",") As String
' only included to support test (be able to see what is in the arrays)

    Dim vDelimString As String
    Dim i As Long

    For i = 1 To vCol.Count
        vDelimString = vDelimString & CStr(vCol.Item(i)) & _
                       IIf(i < vCol.Count, vDelim, "")
    Next

    ColToString = vDelimString
End Function

and the test subroutine (changed only one row - Debug.Print k & ": " & ColToString(vDict.Item(k)) and target range to "F2:F5"):

Sub Test()
' minimum included here to demonstrate use of buildInvertedIndex procedure

    Dim vRange As Range
    Dim vDict As Dictionary

    Set vRange = ActiveSheet.Range("F2:F5")
    Set vDict = New Dictionary

    BuildInvertedIndex vDict, vRange

    ' test values returned in dictionary (word: [line 1, ..., line n])
    Dim k As Variant, vCounter As Long
    vCounter = 0
    For Each k In vDict.Keys
        Debug.Print k & ": " & ColToString(vDict.Item(k))
        vCounter = vCounter + 1
        If vCounter >= 10 Then
            Exit For
        End If
    Next

    'clean up memory
    Set vDict = Nothing
End Sub

RESULT:

enter image description here


UPDATE:

to improve speed of your code you could store range in array (next approach work only with single-column range, but you could easily modify it):

Test sub:

Sub TestWirhArray()
' minimum included here to demonstrate use of buildInvertedIndex procedure

    Dim vRange As Range
    Dim vDict As Dictionary
    Dim myArr As Variant

    Set vDict = New Dictionary
    Set vRange = ActiveSheet.Range("F2:F20585")
    myArr = vRange.Value
    BuildInvertedIndexWithArr vDict, myArr, vRange.Row

    ' test values returned in dictionary (word: [line 1, ..., line n])
    Dim k As Variant, vCounter As Long
    vCounter = 0
    For Each k In vDict.Keys
        Debug.Print k & ": " & ColToString(vDict.Item(k))
        vCounter = vCounter + 1
        If vCounter >= 10 Then
            Exit For
        End If
    Next

    'clean up memory
    Set vDict = Nothing
End Sub

new version of BuildInvertedIndexWithArr:

Sub BuildInvertedIndexWithArr(pDict As Dictionary, pArr, firstRow As Long)
    Dim cell, words, word
    Dim i As Long, j As Long

    j = firstRow
    ' loop through cells (one col wide so same as looping through lines)
    For Each cell In pArr

        ' loop through words in line
        words = Split(cell)
        For Each word In words

            If Not pDict.exists(word) Then
                ' initialize collection
                pDict.Add word, New Collection
            End If

            On Error Resume Next
            pDict.Item(word).Add Item:=j, Key:=CStr(j)
            On Error GoTo 0

        Next word
        j = j + 1
    Next cell
End Sub
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for the detailed response, just trying it out now.
That's fantastic, thank you. Not only does it look cleaner there is > 10x speed improvement. On just over 20k rows the timings are: Built index for 20585 rows in 11327.0713492417 ms using arrays as values in dictionary. and Built index for 20585 rows in 727.502338194183 ms using collections as values in dictionary.
That's cool:) one more tip how to speed up your code: instead working with range you could store it in array (it should give you ~10-20X speed improvement). I've updated my asnwer with this approach. Can you try it please and say timing now, just curious:).
There is another slight improvement: Built index for 20585 rows in 508.731182722124 ms using collections as values in dictionary and array for range.. I have also done a comparison of the results by writing them all to a sheet and found that the results are not identical for some of the words that appear in a lot of lines. I am investigating now to see which is correct.
It looks like this is my original version, there is an occasional duplicate value in the arrays, but they seem to be the same other than this. Thanks for all your help.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.