Login

10-22-2009, 09:18 AM

[eluser]TornUp[/eluser]
hi all,

First off, let me explain my problem. im building a system that stores partnumbers within a database, the problem i have encountered is that these part numbers are not consistant(a customer could ask for partnumber: "abc 123" but it could be "abc-123"..) the solution to this problem is to strip all Special Chars apart from ABC... and 123... and then get my sales people to mearly search using "abc123"

This is where my problem is.. while inserting the data into the DB, i need to "purify" the part-numbers... but i don't know any Regular Expressions! lol... im guessing its RE is the best thing to use(with Preg_replace(); ) or does CI offer some kind of Lib/Helper that can do the same thing?

Thanks for reading. hope you can help Smile

Tom.

10-22-2009, 10:01 AM

[eluser]djjonex[/eluser]
Me in your case, I'll have a standar format in my DB for the part number. Then I'll try to figure out whats the input format.

Using the preg_match function to determine the input format, and after you have the format, you need to change it to your DB format with the preg_replace function.

I think this link should help you.
http://www.webcheatsheet.com/php/regular...ssions.php

10-22-2009, 12:22 PM

[eluser]jedd[/eluser]
Depending on how much you intend to mung the provided parts codes, I'd possibly look at running a second column for parts IDs, calling it 'munged_part_code' or some such, as populating it as data comes in.

Said munging would be using an algorithm such as you described - removing hyphens, underscores, spaces. I'd only do this so that my users, if they were encouraged to do the same mental munging when searching for part numbers, would be more likely to get a result, and get one faster. It's dubious how much time this would save, though I reckon it would be low-cost to implement it.

Alternatively and/or additionally I'd do this on the search side - anything with a space in it would just become either a string of LIKE queries, or an array of strings (identified as being separated by spaces of course) that were LIKE'd to %$string1%$string2%string3%.

10-22-2009, 12:42 PM

[eluser]TornUp[/eluser]
[quote author="jedd" date="1256253729"]Said munging would be using an algorithm such as you described - removing hyphens, underscores, spaces. I'd only do this so that my users, if they were encouraged to do the same mental munging when searching for part numbers, would be more likely to get a result, and get one faster. It's dubious how much time this would save, though I reckon it would be low-cost to implement it.[/quote]

do you mean you recommend having 1 column with the original part number(EG: "ABC-123") and one column with "purifyed" data (EG: "ABC123") and then run a query on both?(or run query1(purifyed)-> did it return any results? no! -> run query2)

hmmm oviously, after reviewing my problem... im going to also have to also compensate for my end user(who trust me... are stupid! lol) by also runing the same "purification" on their search string..

@Jedd - if im getting the wrong end of your solution, sorry... could you explain a little more detail with an example? :red:

10-22-2009, 02:17 PM

[eluser]jedd[/eluser]
It's a tough call, because I really hate de-normalising a database. Duplicating two fields, with the second field being programmatically determined from the first, means you probably should do it (only) whenever you need it. Ie, at data-retrieval time rather than at data input.

As I said, it depends how much you're going to mung the information on input. My suggestion was as you understood it - query the first (munged or 'local product code') column for an exact hit, if that fails, then use a more expensive query, trying to break down what the user has entered and being a bit fuzzier in your matching to the original product field. This lends itself to an environment where users may search around for product codes the first time, and then get used to whatever the internal (second field) product code is, and use that from then on. Obviously this also depends on just how bad your source data is, and whether it's easier to clean that up (it sounds like it isn't) rather than work around it like this.

If the problem really is as simple as 'abc-123' being in the DB and people entering 'abc 123' instead - then it's quite easy. You just search on those two strings in the same product id. But if they're typing 'abc123' and expecting to hit a field containing 'abc-123' then this becomes more complex. You could start sub-stringing based on changes between alpha and numeric, for example - but then all it takes is an original product code like 'abc-123-456' and a user entering 'abc123456' - there's no sensible algorithm that gets you from what the user entered to the original product code.

Instead, you could start doing sub-string searches and returning possible options to the user for them to select amongst. This is tried and trusted technology, especially if you do some dynamic ajaxy stuff to show the user potential matches as they type.

There are also libraries - I've worked with a particularly expensive one on an Oracle system some years ago - that do the kind of searching you're looking at. Soundex for product codes, as it were. There may well be some free ones out there for PHP/MySQL by now.

10-22-2009, 03:57 PM

[eluser]BrianDHall[/eluser]
You are being too hard on yourself - regex in this instance is super easy to do, I will show. Great resource for future reference: http://www.regular-expressions.info

I highly recommend you have an IDE or text editor that offers regular expression search/replace so you can test things there to see how they work before trying to mess around with PHP functions.

Code:
$dirty = 'partABC!-# 12:/}{\\34';

function part_degreaser($dirty)

{

$pattern = '/' . '[^\w]' . '/';

$clean = strtolower(preg_replace($pattern, '', $dirty));

// $clean is now partabc1234

return $clean

}

The pattern is extremely simple here - it says "if there is anything other than a regular word character (alphanumeric), kill it!".

You can run this exact same function on search string input for part numbers, and before loading part information into the database.

You just have to make sure before-hand that you _really_ don't care about anything other than alphanumeric part numbers. If "abc123" is different than "ab-c123" is different than "abc-12-3" - then you need to store the original unsanitized part number for disambiguation. You still run the same query to pull results, but you know you might have multiple results - so you display them all to the user and let them figure out what they really wanted.

10-22-2009, 04:36 PM

[eluser]TornUp[/eluser]
[quote author="BrianDHall" date="1256266678"]You just have to make sure before-hand that you _really_ don't care about anything other than alphanumeric part numbers. If "abc123" is different than "ab-c123" is different than "abc-12-3" - then you need to store the original unsanitized part number for disambiguation. You still run the same query to pull results, but you know you might have multiple results - so you display them all to the user and let them figure out what they really wanted.[/quote]

i think i need to go back to the source, and chat with the sales manager who is requesting this system, just to see if i can get away with stripping everything down to: “abc123” and hope that we don't have a situation where “ab-c123” is different than “abc-12-3”

i intended to store a original copy of the PN in an additional DB column, and display that on the result page, if i choose to go down the special char stripping route.

[quote author="jedd" date="1256260633"]Instead, you could start doing sub-string searches and returning possible options to the user for them to select amongst. This is tried and trusted technology, especially if you do some dynamic ajaxy stuff to show the user potential matches as they type.[/quote]

yeah.. Ajax "suggestive search" is defo somthing i want to put in! but, as with Regular Expressions... frameworks like Jquery blow my mind! lol.. i have a feeling im going to be reading a lot of tutorials soon! hehe

[quote author="jedd" date="1256260633"]Obviously this also depends on just how bad your source data is, and whether it’s easier to clean that up (it sounds like it isn’t) rather than work around it like this.[/quote]

the source data is being imported from an excel file, thats been changed,CnP'ed and all sorts over the last 5-6 years... the data is NOTHING but clean!

this is the additional problem... im dealing with around about 35-40,000 rows(each with 6-7 columns) so sanitising can't be to CPU/ram intensive..