Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

regex help needed pertaining to url cleaning

Status
Not open for further replies.

aturetsky

Programmer
Apr 28, 2005
12
US
I wonder if you can help me with your regex expertise.
I need to write a java method that will have the following signature:

String cleanUrl (String regex, String url)

The method itself will likely be easy - something like
return url.replaceAll(regex, "");

but it doesn't have to be that - you can suggest a different regex processing invocation.

My question has to do more with the regexs I will pass into the method. The regexs need to support 5 different url-cleaning operations:

1) remove specified query params
2) keep specified query params
3) remove all query params
4) just leave the full host
5) just leave the specified number of host components


Examples corresponding to the 5 types above:
Let’s say the url is 1) remove var1 and var 3 -> 2) keep var3 -> 3) remove all -> 4) just leave the host -> 5) just leave 3 host components ->
Any idea how I would write regexs (and the corresponding code) to handle all these while keeping the java code exactly the same for all 5 cases – so only the regex part is different? In other words, the client code should not specify which of the 5 operations I need - all that has to be implicit in the regex and the code in the cleanUrl method.
 
I wouldn't use regex for this. I would write a method to parse the URL and get the individual components to a String array or something like that and then would build the result.

Cheers,
Dian
 
[0] Since the url including query string can vary quite a bit, I doubt a single pattern, except some over-worked and hard to maintain one? could cover the very contingent needs such as 1-5's and, why not, more. Besides, the functionality is also materially dependent on the exact use of the replaceAll method. Hence, I would make the method cleanUrl's first parameter be an integer to indicate the specific need of cleaning up. Within the method, the specific pattern and use of replaceAll are coded.

[1] This is a quick implementation of the idea.
[tt]
private String cleanUrl(int n, String url) {
Pattern p;
Matcher m;
String r=url;
switch (n) {
case 1: //interpreted as only keeping the 2nd query name/value pair
p=Pattern.compile("^(http://[^?]*)(\\?)([^=]+=[^&]*)(&)([^=]+=[^&]*)(&.+)?$");
m=p.matcher(url);
r=m.replaceAll("$1$2$5");
break;
case 2:
p=Pattern.compile("^(http://[^?]*)(\\?)([^=]+=[^&]*){2}(&)(.+)$");
m=p.matcher(url);
r=m.replaceAll("$1$2$5");
break;
case 3:
p=Pattern.compile("^(http://[^?]*)(\\?.*)?$");
m=p.matcher(url);
r=m.replaceAll("$1");
break;
case 4:
p=Pattern.compile("^(http://[^/]*)(/.*)?$");
m=p.matcher(url);
r=m.replaceAll("$1");
break;
case 5:
p=Pattern.compile("^([^.]+\\.)+(([^/]*?\\.){2}[^/]*)(/.*)$");
m=p.matcher(url);
r=m.replaceAll("$1$3");
default:
//do nothing
}
return r;
}
[/tt]
[1.1] The patterns I put forward may be a bit quick with certain way of interpretation of the needs. You can refine them per your exact interpretation of the needs.

[2] The use of it has nothing special. Take an example: suppose it is an instance x of the class in question.
[tt]
String surl="[ignore][/ignore]";
String surl_cleaned;
surl_cleaned=x.cleanUrl(1,surl);
System.out.println("[1]\n"+surl+"\n"+surl_cleaned);
surl_cleaned=x.cleanUrl(2,surl);
System.out.println("[2]\n"+surl+"\n"+surl_cleaned);
//etc ...
[/tt]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top