Reading Large Excel Files Using Chunks with PHPExcel in SC

What do I need to stop getting the error message “Fatal error: Interface ‘PHPExcel_Reader_IReadFilter’ not found in C:…” and actually get the code to work? The referenced interface not found is actually within “[LEFT][SIZE=13px]PHPExcel/Reader/IReader.php[/SIZE][/LEFT]” which I have included already.

I have the need to read very large files that file if I try to load the whole file at once with PHPExcel.

I set up an internal library called chunkReadFilter.php with the following code:

sc_include_library(“sys”, “phpexcel”, “PHPExcel/Reader/IReader.php”, true, true);
[COLOR=#000000] 4|
5|
6| class chunkReadFilter implements PHPExcel_Reader_IReadFilter
7| {
8| private $_startRow = 0;
9| private $_endRow = 0;
10|
11| /** Set the list of rows that we want to read */
12| public function setRows($startRow, $chunkSize) {
13| $this->_startRow = $startRow;
14| $this->_endRow = $startRow + $chunkSize;
15| }
16|
17| public function readCell($column, $row, $worksheetName = ‘’) {
18| // Only read the heading row, and the configured rows
19| if (($row == 1) || ($row >= $this->_startRow && $row < $this->_endRow)) {
20| return true;
21| }
22| return false;
23| }
24| } [/COLOR]

This is referenced in a blank app having:

sc_include_library(“sys”, “phpexcel”, “PHPExcel/IOFactory.php”, true, true);

$inputFileName = “sample.xlsx”;

if (!file_exists($inputFileName)) {
sc_error_message(“File missing”);
sc_error_exit();
}

$inputFileType = ‘Excel2007’;

$objReader = PHPExcel_IOFactory::createReader($inputFileType);

/** Define how many rows we want to read for each “chunk” /
$chunkSize = 2048;
/
Create a new Instance of our Read Filter **/
$chunkFilter = new chunkReadFilter();

/** Tell the Reader that we want to use the Read Filter **/
$objReader->setReadFilter($chunkFilter);

/** Loop to read our worksheet in “chunk size” blocks /
for ($startRow = 2; $startRow <= 1000000; $startRow += $chunkSize) {
/
Tell the Read Filter which rows we want this iteration /
$chunkFilter->setRows($startRow,$chunkSize);
/
Load only the rows that match our filter **/
$objPHPExcel = $objReader->load($inputFileName);

$sheetData = $objPHPExcel->getActiveSheet()->toArray(null,true,true,true);
var_dump($sheetData);
echo ‘<br /><br />’;
}

Hi Scriptcaser,
Got your message. I feel your pain.

At first glance, I think you do not need to be specific when inlcuding the library - in either place. I use lots of the features of phpexcel, but include using just this:
sc_include_library(“sys”,“PHPExcel180”,“phpexcel/PHPExcel.php”,true,true);


BTW, you will notice I am using a slightly different library: PHPExcel180 - here is why:

Probably unrelated but I have found I get errors using phpexcel sometimes due to the fact that SC did some patches to the original settings.php file in the phpexcel third party lib.

I downloaded the original unpatched library for PHPExcel - the old 1.8.0 one with the original files

  • so zipped it and used tools – external libraries – created a new one called PHPExcel180 for version 1.8.0
    • upload the zip

    • so I now use this in my blank app to process the files…
      sc_include_library(“sys”,“PHPExcel180”,“phpexcel/PHPExcel.php”,true,true); // mod Jamie 06212018 to override patched SC version

That way I am leaving the patched phpexcel as is with the SC patches so their uses of it on export buttons and such work. I am assuming they patched for a reason :slight_smile:

This cropped up on me again when I switched form SC8.1 to SC9.1.


Also - phpexcel is now deprecated - https://github.com/PHPOffice/PHPExcel

There is a new one to use called https://github.com/PHPOffice/PhpSpreadsheet

I have not tried it yet, as my code is so damn complicated and I am busy on other things. But we should all be looking at this and possibly other libraries out there.

Good luck,
Jamie

Thanks so much for this: huge eye-opener as per phpspreadsheet.

To solve my issue, I eventually switched my source files to CSV and read them with Spout. Infinitely more efficient to process!!!

I will now study/ experiment with this new phpspreadsheet thingie!

Hi Scriptcaser,

As I said, I haven’t had any time to go back and redo my importing and exporting processes. I am very interested what you find about phpspreadsheet. I read thru the docs a few months ago when I was converting my project from 8.1 to 9.1 (honestly still debugging). Hopefully we share info back and forth :slight_smile:

Since you are using spout for reading csv, I thought I would share some code below. I wrote it about 2 years ago, but it has comments from when I was doing it. I have a user driven requirement to read in xlsx files only :frowning: so I had to do something to make the server 1) not time out PHP and 2) not hog all resources for someone’s import for the other online users. It was nightmare LOL.

In a nut shell, I run a cron job (1 per minute) and look for import (and export) jobs - stored in a database, etc. I didn’t post that code here, but I found that spout has the ability to open an xlsx at arbitrary rows and export those. So basically I made a function, put it on a file at same level as the spout folder in the external library I created, and below is the code. I call it and chop up the large xlsx file into a series of csv files. These csv I tuned (number of rows) to be digestible in less than 1 minute on my server by phpexcel. I have data columns that need to be encrypted for storage in DB, computations made, validation checked, all for xlsx files that have from 30 - 200 columns with data going into a complicated and variable database table scheme. In other words, it is very time consuming when reading and writing xlsx. Phpexcel can suck up all the server memory and kill the whole system.

So my cron process (a blank with a ton of code), basically looks for a job, if found, sets a flag to preprocess - starts chopping using the function below. Might take multiple 1 minute cron passes. Then when it is all converted to a series of csv files, the same cron process sets it next to postprocess and runs code using phpexcel to import the data, etc etc.

I was very impressed with spout - however, when I was doing this spout just doesn’t have the features of phpexcel. I have to read and write xlsx files that have colored columns, validation rules, etc. phpexcel has that ability.

Hopefully, the code might give you more ideas to streamline what you are doing - unless your users are able to supply csv with no issues. Again, I have to work with xlsx and Excel is full of weirdness!

Let’s keep in touch about the phpspreadsheet or any other solutions.

Peace, Jamie

----------- code below -------------


<?
//////
// chopup system

// Note - have to to have the 'use' statements outside of the function to work

// use external library feature of SC - note this actually gets uploaded _lib/libraries/sys/Spout/  and note sys is public
// note absolute path to the file inside the Library
///    sc_include_library("sys","Spout","spout-2.5.0/src/Spout/Autoloader/autoload.php",true,true);     

require_once("spout-2.5.0/src/Spout/Autoloader/autoload.php"); // using a DB setting in control



// prepare Spout for the type of file

    use Box\Spout\Reader\ReaderFactory;
    use Box\Spout\Writer\WriterFactory;
    use Box\Spout\Common\Type;



//
//////


function chopup($filePathIn, $targetSheet, $phase, $startingRow, $endingRow, $chunckSize, $maxphasesatatime, $theSystem, $filesFolder, $theFile)
{

// filePathIn - absolute path and filename to read in (e.g. xlsx)
// targetSheet - from 0 to N sheet to read in - only one for now
// phase - the previously completed phase number or 0 if just starting out; the enxt one doen below is incremented 1 up
// startingRow - could be the start of the file, or below header row, or where a previous chopup pass left off (expectes recorrect + 1 in caller)
// endingRow - the last user specified row to be imported in file; if user makes this beyond actual, chopup adjust to real number
// chunckSize - how many rows to cut at each phase; to save server resources and fight against max_execution_time
// maxphasesatatime    - a way to trottle what happens below, to also save max_execution_time problems
// theSystem - dev, prod...
// filesFolder - absolute path where in and out files will be stored
// theFile - the input file name by itself

// debug parameters
    ///echo "filePathIn=".$filePathIn."<BR>";
    ///echo "targetSheet=".$targetSheet."<BR>";
    ///echo "phase=".$phase."<BR>";
    ///echo "startingRow=".$startingRow."<BR>";
    ///echo "endingRow=".$endingRow."<BR>";
    ///echo "chunckSize=".$chunckSize."<BR>";
    ///echo "maxphasesatatime=".$maxphasesatatime."<BR>";
    ///echo "theSystem=".$theSystem."<BR>";
    ///echo "filesFolder=".$filesFolder."<BR>";
    ///echo "theFile=".$theFile."<BR>";
    ///echo "<BR>";    

$startDateTime = date("Y-m-d H:i:s");
////echo "Hello process_queue2 at ".$startDateTime."<BR>";

$numberRowsToDo = ($endingRow - $startingRow + 1);
$phasesToDo = ceil($numberRowsToDo/$chunckSize); // round up to integer

if ($phasesToDo > $maxphasesatatime)
{
    // there will be too many phases to do in this one process, so just do max
    $phasesToDo = $maxphasesatatime;
}

$estlastphase = $phase + $phasesToDo; // what was passed in as previous phase completed + new number to do here

///echo "phasesToDo=".$phasesToDo." on numberRowsToDo=".$numberRowsToDo."<BR>";
////echo "previously completed phase passed in phase =".$phase."-- now will run thru to estlastphase =".$estlastphase."<BR>";



    $reader = ReaderFactory::create(Type::XLSX); // for XLSX files
    //$reader = ReaderFactory::create(Type::CSV); // for CSV files
    //$reader = ReaderFactory::create(Type::ODS); // for ODS files

$reader->setShouldFormatDates(true);



$reader->open($filePathIn);


//$filePathOut = substr($filePathIn, 0 , (strrpos($filePathIn, "."))).".csv"; // remove the last extension (and dot), and tack on csv
$writer = WriterFactory::create(Type::CSV);
//$writer->openToFile($filePathOut);

// note - phase now passed in - so we can restart process
// start at phase 0 (minimum of 1)
$phasePointer = $phase;
$phaseCounter = 0;
$rowCount = 0; // init
$rowWritten = 0; // how many actually get processed on this pass

$phasestartingRow = 0; ////////// initialize   .........$startingRow;
$phaseendingRow = 0; //////////$phasestartingRow + $chunckSize - 1;
$graceful = FALSE;  // initialize this way to detect if we never reached user specificed end befire reach a real end row
$ReachedEOF = "No"; // will get passed back in result to caller - so we know when to change job Status

foreach ($reader->getSheetIterator() as $sheet) 
{
    // use 0 below to only read data from 1st sheet -- index is 0-based
    if ($sheet->getIndex() == $targetSheet) 
    { 
        $rowCount = 0;
        foreach ($sheet->getRowIterator() as $row) 
        {
            // do something with the row
            $rowCount ++;

            if (($rowCount == $startingRow) AND ($phaseCounter == 0))
            {
                // first phase passed in is starting now - according to the user specifiedstartingRow
                $phasePointer ++;
                $phaseCounter ++;
                $phasestartingRow = $startingRow;
                $phaseendingRow = $phasestartingRow + $chunckSize - 1;
                if ($phaseendingRow > $endingRow)
                {
                    // dont need to go as far as chunck size would allow - only to what user specified
                    $phaseendingRow = $endingRow;
                }

                ///echo "==== Phase ".$phasePointer." -- phasestartingRow=".$phasestartingRow." -- phaseendingRow=".$phaseendingRow." ======================================<BR><BR>"; // debug

                // also open a new write file for this phase 

                    $filePathOut = $filesFolder.substr($theFile, 0 , (strrpos($theFile, ".")))."_phase".$phasePointer.".csv"; // remove the last extension (and dot), and tack on csv
                    $writer->openToFile($filePathOut);

                    // it is first row of this phase so we must save it too
                    ///echo "ROW#: ".$rowCount." ---- ";
                    ////echo implode("|",$row);
                    ///echo "<BR>";

                    // write the data to the CSV file    
                    $writer->addRow($row); //$writer->addRow([$name, $email]);
                    $rowWritten ++;

            }
            elseif (($rowCount >= $phasestartingRow) AND ($phasePointer <> $phase))
            {
                // it is within row range of our starting parameter
                ///echo "ROW#: ".$rowCount." ---- ";
                ////echo implode("|",$row);
                ///echo "<BR>";

                // write the data to the CSV file
                $writer->addRow($row);  
                $rowWritten ++;


                if ($rowCount == $phaseendingRow)
                {
                    // got last row we needed for this phase

                    // close the write file !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                    $writer->close();

                    $phasePointer ++;
                    $phaseCounter ++;
                    if ($phasePointer <= $estlastphase)
                    {
                        // still got more to do
                            $phasestartingRow = $phaseendingRow + 1; // start again where we just left off
                            $phaseendingRow = $phasestartingRow + $chunckSize - 1;
                            if ($phaseendingRow > $endingRow)
                            {
                                // dont need to go as far as chunck size would allow - only to what user specified
                                $phaseendingRow = $endingRow;
                            }

                            ///echo "==== Phase ".$phasePointer." -- phasestartingRow=".$phasestartingRow." -- phaseendingRow=".$phaseendingRow." ======================================<BR><BR>"; // debug

                            // also open a new write file for this phase 
                            $filePathOut = $filesFolder.substr($theFile, 0 , (strrpos($theFile, ".")))."_phase".$phasePointer.".csv"; // remove the last extension (and dot), and tack on csv
                            $writer->openToFile($filePathOut);

                     }
                    else
                    {
                                // we are done with all phases - no more phases
                                //    this will be a graceful break - as we reached this point
                                $graceful = TRUE;
                                $phaseCounter = $phaseCounter - 1; // back counter up


                                break; // no need to loop thru more rows on this sheet - note we already closed the out file
                    } 


                } 


            }
            // else just skip this row - it must be before we want to start processing rows

        }
        break; // no need to read more sheets
    }

}

$reader->close();

// now check if we reach the ultimate end of the records specified by user    
if ($rowCount == $endingRow)
{    
    $ReachedEOF = "Yes"; // because we hot the last record as specific by the user originally
}

if ($graceful == FALSE)
{
    // we must have had a shorter file in terms of rows than user specified
    $writer->close();

    $ReachedEOF = "Yes"; // because we hit actual EOF before the user specific record

    // should update the number of actual phases and actual number of rows
    $actualNumberPhases = $phaseCounter;
    $lastphaseDone = $phasePointer;

}
else
{
    // we completed thru the specified number of rows so actual phases = estimated
    $actualNumberPhases = $phasesToDo;
    $lastphaseDone = $phasePointer - 1;

}

///echo "graceful = ".$graceful."<BR>"; 
///echo "phaseCounter = ".$phaseCounter."<BR>"; 
///echo "Target Sheet was ".$targetSheet."<BR>";
///echo "Last Row Processed ".$rowCount."<BR>";
///echo "actualNumberPhases = ".$actualNumberPhases."<BR>";
///echo "lastphaseDone = ".$lastphaseDone."<BR>";

$endDateTime = date("Y-m-d H:i:s");

///echo "Ended process_queue2 at ".$endDateTime."<BR>";

$realstartDateTime = date_create($startDateTime);
$realendDateTime = date_create($endDateTime);
$interval = date_diff($realstartDateTime, $realendDateTime);
$intervalSeconds = $interval->format('%s');

///echo "Time for process_queue2 was ".$intervalSeconds." seconds<BR>";

$resultArray['lastphaseDone'] = $lastphaseDone;
$resultArray['intervalSeconds'] = $intervalSeconds;
$resultArray['lastrowDone'] = $rowCount;
$resultArray['ReachedEOF'] = $ReachedEOF;
$resultArray['phaseCounter'] = $phaseCounter;
$resultArray['rowWritten'] = $rowWritten;


return $resultArray;

}
/*
//////
// setup variables and call function to test

// the xlsx source file
    $InfilePathIn = "/home/blahblahblah/import/files/Test-Data-2015 XYZ-ECT - Lead - with errors(51).xlsx"; // index 0



$IntargetSheet = 1; // sheet index starts at 0

$InstartingRow = 2; // rows start at 1
$InendingRow = 3277;
$InchunckSize = 500;

$Inpreviousphase = 5;

$OutlastphaseDone = chopup($InfilePathIn, $IntargetSheet, $Inpreviousphase, $InstartingRow, $InendingRow, $InchunckSize);
echo "OutlastphaseDone = ".$OutlastphaseDone."<BR>";
//////
*/
?>

Thanks so much: this looks very interesting and promising. I will print out and study line by line.

I had actually attempted Spout to read the Excel but was shocked, despite promises of efficiency on their site, it took my code 30minutes to execute this simple loading statement after instantiation:
$reader->open($inputFileName); This was on a 21MB file: go figure!!!

Fortunately, it was easy to switch in source files to csv hence trying out Spout import of csv which was a breeze. I also have need to generate xlsx and plan was to use phpexcel but would see what wonders phpspreadsheet can perform.

Cheers