2

I'm doing some integrations towards MS based web applications which forces me to fetch the data to my php application via SOAP which is fine.

I got the structure of a file system in an xml which I convert to an object. All documents have an ID and it's path. To be able to place the documents in a tree view I've built some methods to calculate the documents whereabouts through the files and folder structure. This works fine until I started to try with large file lists.

What I need is a faster method (or way to do things) than a foreach loop.

The method below is the troublemaker.

/**
 * Find parent id based on path
 * @param array $documents
 * @param string $parentPath
 * @return int 
 */
private function getParentId($documents, $parentPath) {
    $parentId = 0;
    foreach ($documents as $document) {
        if ($parentPath == $document->ServerUrl) {
            $parentId = $document->ID;
            break;
        }
    }
    return $parentId;
}
// With 20 documents nested in different folders this method renders in 0.00033712387084961
// With 9000 documents nested in different folders it takes 60 seconds

The array sent to the object looks like this

Array
(
    [0] => testprojectDocumentLibraryObject Object
        (
            [ParentID] => 0
            [Level] => 1
            [ParentPath] => /Shared Documents
            [ID] => 163
            [GUID] => 505d70ea-51d7-4ef0-bf79-8e912553249e
            [DocIcon] => 
            [FileType] => 
            [Title] => Folder1
            [BaseName] => Folder1
            [LinkFilename] => Folder1
            [ContentType] => Folder
            [FileSizeDisplay] => 
            [_UIVersionString] => 1.0
            [ServerUrl] => /Shared Documents/Folder1
            [EncodedAbsUrl] => http://dev1.example.com/Shared%20Documents/Folder1
            [Created] => 2011-10-08 20:57:47
            [Modified] => 2011-10-08 20:57:47
            [ModifiedBy] => 
            [CreatedBy] => 
            [_ModerationStatus] => 0
            [WorkflowVersion] => 1
        )
...

A bit bigger example of the data array is available here http://www.trikks.com/files/testprojectDocumentLibraryObject.txt

Thanks for any help!

=== UPDATE ===

To illustrate the time different stuff takes I've added this part.

  1. Packet downloaded in 8.5031080245972 seconds
  2. Packet decoded in 1.2838368415833 seconds
  3. Packet unpacked in 0.051079988479614 seconds
  4. List data organized in 3.8216209411621 seconds
  5. Standard properties filled in 0.46236896514893 seconds
  6. Custom properties filled in 40.856066942215 seconds
  7. TOTAL: This page was created in 55.231353998184 seconds!

Now, this is a custom property action that im describing, the other stuff is already somewhat optimized. The data sent from the WCF service is compressed and encoded ratio 10:1 (like 10mb uncompressed : 1mb compressed).

The current priority is to optimize the custom properties part, where the getParentId method takes 99% of the execution time!

5
  • Need more speed? Either get better hardware or switch to a faster language. That shouldn't be a hard task considering PHP is one of the slowest languages out there. Commented Oct 9, 2011 at 17:51
  • Well I agree with you. But in this case I dont have an option. The "same" method in C# on a server with the same specs run the same data in less than 2 secs. Commented Oct 9, 2011 at 17:53
  • Sounds not likely. PHP loops aren't exactly speedy, but yours doesn't do much. It's more likely the SOAP unpacking and object tree generation is slower. -- In case you run your function multiple times, and forgot to mention that crucial detail in your question, it might be advisable to construct a separate ->ServerURL to ->ID array map once, and use that instead. Commented Oct 9, 2011 at 18:04
  • @mario, I've updated the post a bit so you can see what the stopwatch tells me. It's not the soap consuming my time! Commented Oct 9, 2011 at 18:16
  • 1
    Next time please look at a profiler graph instead of manual stopwatch generation. Then it becomes appearant that it's not the showcased loop, but the outer loop that's the issue. Commented Oct 9, 2011 at 19:31

3 Answers 3

3

You may see faster results by using XMLReader or expat instead of simplexml. Both of these reqd the xml sequentially and won't store the entire document in memory.

Also make sure you have the APC extension on, for the actual loop it's a big big difference. Some benchmarks on the actual loop would be nice.

Lastly, if you cannot make it faster.. rather than trying to optimize reading the large xml document, you should look into ways where this 'slowness' is not an issue. Some ideas include an asynchronous process, proper caching, etc..

Edit

Are you actually calling getParentId for every document? This just occurred to me. If you have a 1000 documents then this would imply already 1000*1000 loops. If this is truly the case, you need to rewrite your code so it becomes a single loop.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks, thats clever. But when im at that part the xml is already in an object ready to use. What you are describing is the 4th point in the flow I've added in the bottom of my post. This is my second priority to look into but not really a problem right now. Thanks for your advice though, +1!
How big is the actual xml document in bytes?
Just added another wild guess to my answer
Something like 9100000 bytes (8.7 mb). My lab stream is approx 207000 rows. I tried the XMLReaders and so on but they where very slow. If you'd like to have a look a the xml i put it at trikks.com/files/fileXml.xml, its obviously the //List/Content parts im working with :)
I rewrote the getParentId, see my second post in the bottom of this topic!
1

How are you populating the array in the first place? Perhaps you could arrange the items in a hierarchy of nested arrays, where each key relates to one part of the path.

e.g.

['Shared Documents']
    ['Folder1']
        ['Yet another folder']
            ['folderA']
            ['folderB']

Then in your getParentId() method, extract the various parts of the path and just search that section of data:

private function getParentId($documents, $parentPath) {
    $keys = explode('/', $parentPath);

    $docs = $documents;
    foreach ($keys as $key) {
        if (isset($docs[$key])) {
            $docs = $docs[$key];
        } else {
            return 0;
        }
    }

    foreach $docs as $document) {
        if ($parentPath == $document->ServerUrl) {
            return $document->ID;
        }
    }
}

I haven't fully checked that will do what you're after, but it might help set you on a helpful path.

Edit: I missed that you're not populating the array yourself initially; but doing some sort of indexing ahead of time might still save you time overall, especially if getParentId is called on the same data multiple times.

Comments

0

As usual this was a matter of programming design. And there are a few lessons to be learned from this.

In a file system the parent is always a folder, to speed up such a process in php you can put all the folders in a separate array with it's corresponding ID as the key and search that array when you want to find the parent of a file, instead of searching the entire file structure array!

  1. Packet downloaded in 6.9351849555969 seconds
  2. Packet decoded in 1.2411289215088 seconds
  3. Packet unpacked in 0.04874587059021 seconds
  4. List data organized in 3.7993721961975 seconds
  5. Standard properties filled in 0.4488160610199 seconds
  6. Custom properties filled in 0.15889382362366 seconds
  7. This page was created in 11.578738212585 seconds!

Compare the custom properties by the one from my original post

Cheers

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.