Code cleanup

From PGVWiki
Revision as of 18:08, 7 June 2010 by Momse (talk | contribs) (→‎PHP)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This article is intended to provide up-to-date information about code cleanup issues, progress and which developers are working on particular areas. This article has arisen out of at least one thread on the Open Discussion forum:

Article Organisation

  • Please try to help keep this article from exploding in length by providing suitable links to other web resources/information. Also, for possible discussions, use the discussion page.
  • Also, try to use links to relevant directories/files in the development branch of the SVN repo as this is where most of this work will be done before merging it to the trunk.

Consolidating/Separating Programming Languages

Separating the code of several programming languages is advantageous for clean, easy to follow code with minimal redundancy. This make it easier to maintain, modularise and for new developers to get involved as it will be another step closer to pure MVC.

HTML

One way (please add others if you know them) to separate out HTML (View in MVC) from the PHP application code is by using a template engine. The application code does not concern itself with how the data is to be displayed as this is the job of the template. Thus programmers can concentrate on the code and templates can be designed independently to provide different views to the end user - easier for theming/skinning. A good thing in any programming.

Cons for template engines in general:

  • An external dependency - We'd simply add the library to the vendor_branch and include it with PGV. This would allow us to easily update the library when new releases are made by the vendor. Possibly using some sort of check to see if the user has a more up-to-date version.
  • Could there be a speed impact?
    • I just made a quick speed test with "ab" (the content is a table with list of INDIs from a PDO db or declared array (to check DB overhead)) ("ab -n 100 -c 1", average of 3 tests):
Test Requests/sec
PHP + Smarty + DB 11.22
PHP + print() + DB 11.39
PHP + Smarty + array() 11.98
PHP + print() + array() 12.31
Pure HTML (generated file) 911.68

I think that using PHP's OO abilities gives a huge overhead already and adding a template engine would not increase the request times significantly. --Korbendallas1976 00:52, 2 February 2008 (EST)

  • ...

Below is a table for people with any knowledge of template engines to add pros/cons. Our aim is to try to find a possible solution to separating out the view aspect of MVC, one of which could be a template engine.

Suggestions to moving to a template engine such as Smarty. Please add other options below and we can collate pro's and con's for them. There is an open RFE for a Template System that contains some arguments. A search for Smarty in the forums will show some arguments related to this.

Option Website(s) Pros Cons
Smarty

compiles to PHP, caching, functions, plugins

John's Comment #1, #2, #3,

Here is a list of 25 template engines for PHP.

Apart from the links to John's comments this seems to be moving to a discussion about 'which' template engine should be used. I have no experience of any, but feel we should first decide whether a template engine is necessary / useful or not, before thinking about which one. Can anyone help explain their virtues? John has described their pitfalls pretty well.

JavaScript

Needs to be gleamed out of various output methods and HTML files into .js files which are then used in the HTML header section where they belong.

SQL

A collection of lots of SQL statements (like those for creating the tables) belong in their own .SQL files to separate languages (like HTML from PHP). It is also advisable to group all - or most - other SQL in a single file that provides functions to call from the outside - for easier debugging. This should be part of the Model anyway.

Getting MVC Right

A great example that badly needs it is the whole import process (I would love to see the ability to develop a command-line import tool).

Model

Defines an API through which the data of the software can be accessed. This defines everything, from the relations to types, ranges and meta-data. How this data is then stored is the sole concern of the Model, as long as it is uniformly accessible through the API (it can be stored in a DB, a text-file, XML - you name it). Parsers and serialisers thus belong in the Model. If done correctly, we should be able to create unit tests for the API which can be run frequently to ensure the smooth, "bug free", stable running of this part of PGV.

I am not sure about having two storages for the model as is currently the case in PGV (DB and the Gedcom file) because this introduces a whole lot of sync trouble (which one is to be trusted if they are not in sync?). Since the DB is of course a whole lot faster than parsing the Gedcom every time, maybe they should only be synced when the gedcom is im-/exported. - Can someone clarify how the GEDCOM file and DB are used and synced etc during the use of PGV?

(Answer: Data is stored at gedcom level 0 record level. Hence all updates replace one level 0 record with a new version. After each update to the database (following the acceptance of an change), the old record is located in the disk-based file, and replaced with a new record. If the sync-gedcom-file option is turned off, then this action does not occur. The gedcom obtained from the download option will be different to the contents of the file on disc; line endings, record order, header contents, etc. Also, failed file updates are ignored, so the DB copy should always be regarded as the master copy. This process requires that the full file is loaded into memory (plus a second post-edit copy), which can blow the PHP memory limits for large gedcoms.)
My question would then be, why is this required? Is it because it's a relic of when PGV used flat file indicies? Couldn't it all be done in a database, then when someone wants to export a GEDCOM (or any other plugable format for that matter), they click a button and the contents of the database are set to the browser for download in the requested format?

Using classes is not the same thing as OO. OO requires other important concepts such as abstraction and data-hiding. Classes should have a clear well defined function and behavior that forms part of an API for which Unit tests can be written.

Classes

As I see it, we should have classes that represent the data all inheriting from a GEDCOM base class. These "data classes" are simple representations of the data (i.e. they don't need to know that the data has come from a GEDCOM file or some other place entirely) with simple getters/setters for that data. There would be other classes dealing with IO from different file formats, storage backends etc. It would be the responsibility of the "IO classes" to know the structure of the file format, DB backend etc and create the "data classes" from that data or visa versa.

Here is a list of websites I found dealing with a GEDCOM model:

Comments
I have just performed a global search for the string INDI and counted those files where the string appeared to be used in the context of the GEDCOM file format. Ignoring language files I came to a rough estimate of 67 files where a detailed knowledge of the GEDCOM file format is embedded. One could argue that data abstraction is even more important than MVC. For a newcomer to the code this is also an issue that makes the code difficult to understand. A typical example (from person_class) is:
$famlink = get_sub_record(1, "1 FAMC @".$family->getXref()."@", $this->gedrec); 
$ft = preg_match("/2 PEDI (.*)/", $famlink, $fmatch); 

Presumably at some stage in the future, whether in 1, 2, 5 or 10 years time an improved GEDCOM format will appear (GEDCOM XML is already specced). Is this area of data abstraction for the GEDCOM format one that should be treated as a priority in the Code Cleanup? Conveniently of course what you end up with is a good chunk of what is known as the Model in the latest Three Letter Acronym.

View

Describes a "View" of the Model, that is, a particular way of looking at it, as for example, the source-code-view or the UML-view (in the case of code). The view does not concern itself with anything else, e.g. security, just with how to represent the data. Views are exchangeable (i.e. pluggable), and get their data from the Model, so as soon as you change something in one view, the other views of the same data change accordingly. Themes and a little bit more for that matter, belong in the View.

We should remove all presentation and layout from the HTML and put it in CSS.

Comments
Any construct of any complexity should have a comment at its end to help match the end to the beginning. Examples:
} // if edit allowed 
DOM elements should begin and end in the same function or even the same set of braces. Hard to change a set of table rows/cells to a set of divs when the the parts and the tags between them are printed from a utility function.

Controller

*_ctrl.php in the controllers folder should not be concerned with the printing/displaying of content. The controller in MVC is intended to glue the model and view parts together and provides all the programming logic behind the app, like security, user management, file management and the like.

Unit Tests

Unit tests are short (or not so short) scripts that sets up appropriate objects required for the test, and systematically checks that the return value from a method is what was expected i.e. correct. Different method arguments should be used, particularly for boundary conditions to ensure that as many routes through the code are covered (e.g. both sides of an if statement are tested).

A proper unit test framework will run the required test suite(s) and summarise the results as the number of passed tests along with details of those that failed. We are currently, using SimpleTest as our testing framework as it is both PHP4 and 5 compliant and has the ability to test web interfaces. However, the current lack of correct abstraction of objects has hindered the development of simple unit tests.

Test First Development

The tests are written first i.e. before the code itself. While this seems counterintuitive, it helps focus the mind on developing a suitable API and allows you to have a good test suite from the start. For example, the pseudocode of a single unit test may look like this:

$test->assertsTrue( $dog->canBark() )

All this does is to ensure the canBark() method of the $dog object returns true. We know that a dog can bark, so we expect this test to pass. It doesn't matter how the method actually works, we simply want to know if the API works as expected given the relevant input to the various methods - thus we can write the unit tests and design an API at the same time. This also helps us to think about whether the method internals are hidden to the programmer using the classes via the API (a good thing). If for some reason, it returns false then this test fill fail and we fill be provided details about the failure so we can fix the method.

What this means, is we can write tests first, and then write the method code and refine it until the test passes. In the absence of any other information we might come up with the following code for canBark():

function canBark() {
  return true;
}

If however, we know a dog could be mute and thus cannot bark like a normal dog, we might write a test (thus defining an API at the same time) like:

$test->assertsTrue( $dog->canBark() );
$dog->setMute(true);
$test->assertsFalse( $dog->canBark() );

First, we are saying a dog can bark by default. Then we use setMute(true) to say the dog can no longer bark. These tests now check that canBark() returns the correct values but also tests (indirectly) if the isMute(true) method correctly sets the dog to being a mute. Now, we rewrite our methods until the tests pass:

function canBark() {
  if ( isMute() ) {
    return false;
  }
  return true;
}

function setMute($value) {
  $this->mute = $value;
}

function isMute() {
  return $this->mute;
}

While there may be other ways to write the code, and still have the tests pass, what we have done by writing the tests first, is to start creating the API while the implementation is hidden (as it should be). Thus, we may go away and totally rewrite the code underlying the API, but as long as the tests continue to pass, we don't really mind about the implementation.

Running Tests

Unit tests are run frequently to ensure any code changes do not have any adverse affect on the outcome of the tests. Thus ensuring new bugs are not introduced. Where bugs are found, we first write a unit test that exposes the bug while testing for the correct outcome. Then we go away trying to fix the code until the test passes but also ensuring that other tests don't start to fail. Once we achieve this, we have fixed the bug, and have included a test to ensure it doesn't raise it's ugly head again. Thus, over time, a test suite will fully stress test the code (proving that the code does what we say it does) and give confidence that we are as bug free as possible while ensuring that development time doesn't go into fixing bugs that reappear down the line.

GEDCOM Parser

There are two locations in the code where a GEDCOM file is parsed (--GEDCOM functions-- in functions.php and in functions_import.php). This process is very complicated and thus error-prone because it uses RegExes for the parsing. This should be done by a using lexer, which are used in compilers. PEAR has classes for parser- and lexer-building which is all that's needed. I'm sure lexers can handle international character issues as well.

Modules to Plugins

The current approach does not define enough hooks to make PGV really customizable. The very existence of remedy code in the general code-base for the most common modules like googlemap or lightbox (in individual.php) is witness to this. The ultimate goal is for plugin writers to not have to touch or modify any code of the 'core' - preventing them from breaking anything and allowing the core developers to change the underlying functionality without breaking any of their plugins' code. That is, the mod does not have to be re-inserted into every new release. For that hooks need to be defined for every little function call in PGV, every section of the output, in effect - for everything. Look at WP or G2 for examples. We need to compile a list of all possible locations in the code for hooks. It does require some effort on the part of the developers, though, for defining a sensible API that is reasonably stable over long periods of releases (again Unit tests can help ensure it remains stable).

  • Sorry for being dim-witted....but, could you explain what you mean by a hook? --Nath | Talk 23:27, 1 February 2008 (EST)
    • See the - rather short - article in wikipedia. In Programming a hook is generally a specific place in the code where plugins can 'hook' in. For this the plugin programmers define functions and register them with the software for specific hooks. The hokee keeps a list of hooks and functions that are registered for them and makes sure to call the right functions when that place in the code is reached. If there is more than one plugin registered for a hook they get called sequentially (in the order they registered). By using the hook plugins are then able to be notified of something or change something. For that purpose it gets called with the information that is relevant and can decide what to do. For example when a user gets created, plugins can get notified, or before and after the template is rendered into HTML - thus giving plugins the ability to change the output. In WP, for example, there is a hook that gets passed every blog entry before output. It is used also by the core for some modifications to the final output - like HTML-specialchars, NL2BR-ing or linkifying.

General Organisation

General organization: Files ending in .php should be files that need to be executed directly in the browser, like index.php or individual.php. These - and only these - should reside in '/'. Every other file that is included by the main files should reside in a logically named subfolder (includes for example) and have an ending of .inc, class files by convention have the ending of .class; thus preventing them from being executed and accessed directly. Files that logically belong together should be placed in the same subfolder; all files comprising the back-end (admin functions) should be in a folder named something like admin or pgv-admin. This makes it clearly visible to anyone what function can be found where and also which part of the HTML output comes from where.

Comments
I agree with many of the conventions mentioned, but ask that the files retain the .php extension. If the webserver does not recognize the file type, it will do one of two things:
  1. serve it as plain text. This would allow hackers who know the file layout to read our settings
  2. serve it as a binary file. This prompts visitors to download the file. Once on their local machine, it is readable by any text editor.

Instead, I suggest two levels of extensions, based on the function of the file. As an example, the filename gedparser.class.php clearly shows that it contains a class used to parse GEDCOM.

Followup: Since PGV is an open-source project, anyone can go to phpgedview.net, download the source and then know our file layout, and even: The contents of those files.
And because of that you put all the relevant settings in a file appropriately named config, which is the only file that needs to be protected - which is why it is mostly a .php file that displays nothing. The thing that you do not want, is that people are able to execute all those files directly, which is a greater security risk than them reading them. And by making all files have a .php ending you just allowed them to. Because then all you need is one file that does not rigorously check if it was executed or included and you have a much bigger security risk than you would have, had the user just seen the source code (which he can get anyway). Now the attacker can pass variables, try to set options etc.

SVN

Lots more line breaks. SVN can't merge two unrelated changes when they are both in different parts of the same mile long line of code.

PHP

Do we drop support for PHP4 and go with PHP5?

We should use PHP's multibyte and internationalisation libraries instead of wasting time on our own slow/broken version.

DBMS

Store everything in the database. This means everything. It includes all the stuff we currently store in files. We need a data model that models the data, not one that models gedcom files. This will mean dropping support for synchronised gedcom files (which would be a good thing in itself).

Images (inc thumbnails) can easily be stored in a BLOB together with some basic file info/meta-data.

- I don't think we should require people to store images in the database as hosts generally give more space to the filesystem than they do to the db. Also, FTP can be used to upload images to the filesystem. --Ljm 09:25, 2 February 2008 (EST)

Miscellaneous

  • Replace print_XXX() with print XXX() to prevent this sort of construct:
ob_start();
print_list_person(...);
$indi=ob_get_contents();
ob_end_clean();

Stakeholders

Here is a list of possible stakeholders outside the PGV dev team, that should probably be consulted about their experiences with PGV code.

Drupal Integration
http://drupal.org/project/phpgedview
Joomla Integration
http://joomlacode.org/gf/project/phpgedview/

Road Map Questions

Things that need to be considered/decided for the development of a road map are things like:

  • When do we move to support PHP5 and drop PHP4 support? With PGV 4.2?
    • If we move to support only PHP5, do we use this opportunity to make changes that will not be backward compatible?
  • Should a branch be made (from the current dev branch or trunk) where changes can be made which will not be backward compatible and devs that want to make major changes can?
  • Others? I'm sure, please add them here.
  • I think PGV 4.2 could be the bells&whistles version of the GEDCOM-based 4.X line. The new PGV "5" does not need to be backward compatible, so migrating is easier (GEDCOM export/import and some DB scripts to recover data from the "old" 4.X installations).
  • I like this idea: PGV 5.x for supporting PHP5 only and PGV 4.x for the last version supporting PHP4. So how much development effort are people going to put into the 4.2 branch and how do people see this fitting in with starting a 5.1 branch for more of a rewrite or big code cleanup (whichever is easier)? --Nath | Talk 22:07, 31 January 2008 (EST)

Road Map

Main article: 5.x Roadmap

I propose the following as a overview of a possible roadmap (not complete). It might be good for us to team up and have a detailed look at specific items (getting further input from appropriate devs) and make a short report (in the form of a wiki page) to the rest of the devs. Please add specific taks you'd like to be looked into in detail.

SVN

  1. We might want to make a temp branch where we can commit code/script examples etc that demonstrate a particular feature, benchmark, test etc

Backend Review

  1. Review the DB structure

API to Backend

  1. Are we only ever likely to want to use a DB backend? i.e. Do we need to think about having an API to a generic backend which allows us to change the backend with a simple configuration change?
    • If so, can our data classes be uncoupled from from a specific backend?
    • How could/is this be done? Is this done by extending the data classes to tie them to a backend somehow - I'm a bit confused how the model is tied to a backend.
  2. One option might be something like: Easy PHP Data Objects EZPDO.

Data Classes/API

  1. How many current classes are sufficiently uncoupled from the backend, IO and business logic (e.g. privacy) to be used with little effort?
  2. How much current code can be used as is?
  3. How much current code could be used with a small amount of code change?
  4. How much current code would need redesigning?
  5. We should aim to come up with a suitable API before writing the code for the classes. This will allow us to really think how a programmer should use our classes and allow us to write unit tests before the code i.e. test driven development.

Important PGV Features to Consider

  1. i18n - internationalisation

Interested Developers

If you're interested in helping out with things detailed on this page, please add your name with a link to your PGVWiki user page below. Please also included relevant contact details and a brief overview of your skills on your user page so others know how you might be able to help.

Could you now sort yourself under all heading you think you will be able to make the most contributions towards or are most interested in helping out (add new headings you you think it necessary).

DB Backend

MVC

Model

View

Controller

Testing

Unit Testing