18

I have a PHP script that builds a binary search tree over a rather large CSV file (5MB+). This is nice and all, but it takes about 3 seconds to read/parse/index the file.

Now I thought I could use serialize() and unserialize() to quicken the process. When the CSV file has not changed in the meantime, there is no point in parsing it again.

To my horror I find that calling serialize() on my index object takes 5 seconds and produces a huge (19MB) text file, whereas unserialize() takes unbearable 27 seconds to read it back. Improvements look a bit different. ;-)

So - is there a faster mechanism to store/restore large object graphs to/from disk in PHP?

(To clarify: I'm looking for something that takes significantly less than the aforementioned 3 seconds to do the de-serialization job.)

8
  • Why not store the information that is in the file into a database? Commented Mar 30, 2010 at 13:24
  • Because the script is part of a tool that specifically does not want to use a database dependency. Commented Mar 30, 2010 at 13:26
  • What do your index objects look like? Commented Mar 30, 2010 at 13:29
  • If you have full access to the web service writing a PHP extension module specifically for faster IP2country searches could be an option. Also a service that monitors the CSV file modification date and provides the data via a named pipe could also fit your needs. Commented Mar 30, 2010 at 13:32
  • @stereofrog: It is a tree of nested node objects, each having a $value (float), a $payload (string) and $left and $right node references. Nothing fancy, but it contains > 100,000 of such objects. Commented Mar 30, 2010 at 13:33

8 Answers 8

15

var_export should be lots faster as PHP won't have to process the string at all:

// export the process CSV to export.php
$php_array = read_parse_and_index_csv($csv); // takes 3 seconds
$export = var_export($php_array, true);
file_put_contents('export.php', '<?php $php_array = ' . $export . '; ?>');

Then include export.php when you need it:

include 'export.php';

Depending on your web server set up, you may have to chmod export.php to make it executable first.

Sign up to request clarification or add additional context in comments.

3 Comments

I know this is old, but there is a better way, still using the same code. instead of having file_put_contents('export.php', '<?php $php_array = ' . $export . '; ?>');, just use file_put_contents('export.php', '<?php return ' . $export . '; ?>');. And instead of include 'export.php';, use $data = include 'export.php';.
This is an awesome solution. I always use var_export 'ed datas in includes, and this makes it a little easier !
Reading 27MB of data in var_export format was horribly slow. Creating the var_export was very quick.
7

Try igbinary...did wonders for me:

http://pecl.php.net/package/igbinary

1 Comment

This is actually slower on my machine somehow, went from .4s with native serialization and then increased to .5s
5

First you have to change the way your program works. divide CSV file to smaller chunks. This is an IP datastore i assume. .

Convert all IP addresses to integer or long.

So if a query comes you can know which part to look. There are <?php ip2long() /* and */ long2ip(); functions to do this. So 0 to 2^32 convert all IP addresses into 5000K/50K total 100 smaller files. This approach brings you quicker serialization.

Think smart, code tidy ;)

Comments

4

It seems that the answer to your question is no.

Even if you discover a "binary serialization format" option most likely even that would be to slow for what you envisage.

So, what you may have to look into using (as others have mentioned) is a database, memcached, or on online web service.

I'd like to add the following ideas as well:

  • caching of requests/responses
  • your PHP script does not shutdown but becomes a network server to answer queries
  • or, dare I say it, change the data structure and method of query you are currently using

1 Comment

You have a rich data source which offers many creative ideas, I'm sure you'll come up with something very smooth.
2

i see two options here

string serialization, in the simplest form something like

  write => implode("\x01", (array) $node);
  read  => explode() + $node->payload = $a[0]; $node->value = $a[1] etc

binary serialization with pack()

  write => pack("fnna*", $node->value, $node->le, $node->ri, $node->payload);
  read  => $node = (object) unpack("fvalue/nre/nli/a*payload", $data);

It would be interesting to benchmark both options and compare the results.

3 Comments

The tree has a root node. Would it be enough to pack() that root node, I mean would it pack the entire graph?
Then it is not an option, I'm afraid. :-\
@Tomalak I would like to enlist your help on an unrelated question here on stack overflow about passing byte arrays to a COM object method by reference. Here it is stackoverflow.com/questions/42189245/… As I pored over the internet I came across related questions posted by persons who were stuck in the same rut here bugs.php.net/bug.php?id=41286&thanks=3 I am banking on your expertise to show me how to do it if you don't mind please. I will be so grateful for your help.
1

If you want speed, writing to or reading from the file system in less than optimal.

In most cases, a database server will be able to store and retrieve data much more efficiently than a PHP script that is reading/writing files.

Another possibility would be something like Memcached.

Object serialization is not known for its performance but for its ease of use and it's definitely not suited to handle large amounts of data.

4 Comments

Is there no binary serialization format for PHP that writes memory bytes to the disk and simply reads them back again? If the CSV is all strings and the index object actually contains less info than the text file, why must its serialized form be so bloated?
@Tomalak: check out pack/unpack
@Robert: Looks like pack works for individual values only, not for complex objects.
@tomalak: serialize is slower because it does a lot of things that you don't always see when it comes to objects and classes. It also relies heavily on recursion to build a string representation of nested data structures which may also be slow. I think, when you already have table oriented data (csv) a relational database is the best option.
0

SQLite comes with PHP, you could use that as your database. Otherwise you could try using sessions, then you don't have to serialize anything, you just saving the raw PHP object.

5 Comments

Can I share the object between sessions in PHP?
You couldn't share it between different sessions. Although you could probably get everyone using the same session by setting a custom session ID. Otherwise you would have to look into using shared memory. php.net/manual/en/book.shmop.php
Just a quick note in case anyone stumbles upon it - do NOT use sessions for storing large objects, and even more so - do NOT let people share the same session. This defeats the purpose of using a session in the first place - and, since only one user can access one session id at a time, it will effectively limit request processing to only one! Session has to load from disk/database anyway!
@SteveB Admittedly, the contexts were obscure, but i have used large data-sets in shared/fixed sessions in multiple apps before. If you are building a-typical apps, a-typical solutions are often good ones.
@hiburn8 I can agree with that. If you're fixing a particular issue then it might be a sound idea. Exploring every option available is something I would respect. I might have been too prejudiced based on my experiences.
0

What about using something like JSON for a format for storing/loading the data? I have no idea how fast the JSON parser is in PHP, but it's usually a fast operation in most languages and it's a lightweight format.

http://php.net/manual/en/book.json.php

5 Comments

Yes that would work for data, not for object graphs. I was looking for something that dumps the entire object graph to disk so I would have no penalty for re-creating it (in terms of parsing, error checking, object construction).
JSON cannot represent references. It can represent hierarchies. It's not even necessary to have cyclical references, as soon as there is a parent reference, it's over. Besides, serializing/un-serializing is absolutely not what I had in mind.
You are right, it cannot represent references. Though a parent reference would make the object graph cyclical, i.e. being able to get to someplace you had previously been. Hmm... you could have a sibling reference and it would still be a-cyclical, making my previous statement wrong.
I don't know about fast or memory-efficient, but I have an almost-working implementation of a JSON-serializer (and un-serializer) for object-graphs, which does support cyclical references. I don't know if this is what you're looking for - my gut feeling is, the amount of data you're wrestling with is probably better off in a database.
A restriction for JSON is that json_encode requires that the string values are in UTF-8 encoding.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.